Statistical Thinking For
Statistical Thinking For
Statistical Thinking For
Managerial Decisions
Europe Site Site for Asia Site for Asia-Pacific Site for Eastern Europe Site for
Middle East UK Site USA Site
This Web site is a course in statistics appreciation; i.e., acquiring a feeling for the statistical
way of thinking. It contains various useful concepts and topics at many levels of learning
statistics for decision making under uncertainties. The cardinal objective for this Web site is
to increase the extent to which statistical thinking is merged with managerial thinking for
good decision making under uncertainty.
MENU
Chapter 1: Towards Statistical Thinking for Decision Making
Chapter 2: Descriptive Sampling Data Analysis
Chapter 3: Probability as a Confidence Measuring Tool for Statistical Inference
Chapter 4: Necessary Conditions for Statistical Decision Making
Chapter 5: Estimators and Their Qualities
Chapter 6: Hypothesis Testing: Rejecting a Claim
Chapter 7: Hypotheses Testing for Means and Proportions
Chapter 8: Tests for Statistical Equality of Two or More Populations
Chapter 9: Applications of the Chi-square Statistic
Chapter 10: Regression Modeling and Analysis
Chapter 11: Unified Views of Statistical Decision Technologies
Chapter 12: Index Numbers and Ratios with Applications
Exercise Your Knowledge to Enhance What You Have Learned (PDF)
JavaScript E-labs Learning Objects
Excel for Statistical Data Analysis
A Why List: Frequently Asked Statistical Questions (Word.Doc)
Formulas Concerning the Mean(s) (PDF), Print to enlarge
After This Course Is Over: Statistical Concepts You Need For Life (Word.Doc)
What Maths Do I Need for This Course? (Word.Doc), A Sample of "How Things Can
Go Wrong?"
Companion Sites:
To search the site, try Edit | Find in page [Ctrl + f]. Enter a word or phrase in the dialogue
box, e.g."parameter" or"probability". If the first appearance of the word/phrase is not
what you are looking for, try Find Next.
This site builds up the basic ideas of business statistics systematically and correctly. It is a
combination of lectures and computer-based practice, joining theory firmly with practice. It
introduces techniques for summarizing and presenting data, estimation, confidence intervals
and hypothesis testing. The presentation focuses more on understanding of key concepts and
statistical thinking, and less on formulas and calculations, which can now be done on small
computers through user-friendly Statistical JavaScript A, etc. A Spanish version of this site is
available at Razonamiento Estadístico para la Toma de Decisiones Gerenciales and its
collection of JavaScript.
Today's good decisions are driven by data. In all aspects of our lives, and importantly in the
business context, an amazing diversity of data is available for inspection and analytical
insight. Business managers and professionals are increasingly required to justify decisions on
the basis of data. They need statistical model-based decision support systems.
Statistical skills enable them to intelligently collect, analyze and interpret data relevant to
their decision-making. Statistical concepts and statistical thinking enable them to:
This Web site is a course in statistics appreciation; i.e., acquiring a feel for the
statistical way of thinking. It hopes to make sound statistical thinking
understandable in business terms. An introductory course in statistics, it is
designed to provide you with the basic concepts and methods of statistical
analysis for processes and products. Materials in this Web site are tailored to
help you make better decisions and to get you thinking statistically. A cardinal
objective for this Web site is to embed statistical thinking into managers, who
must often decide with little information.
Just like weather, if you cannot control something, you should learn how
to measure and analyze it, in order to predict it, effectively.
If you have taken statistics before, and have a feeling of inability to grasp
concepts, it may be largely due to your former non-statistician instructors
teaching statistics. Their deficiencies lead students to develop phobias for the
sweet science of statistics. In this respect, Professor Herman Chernoff (1996)
made the following remark:
1. In general, people do not like statistics and therefore they try to avoid it.
2. There is a pressure to produce scientific papers, however often confronted with"I need
something quick."
3. At many institutes in the world, there are only a few (mostly 1) statisticians, if any at
all. This means that these people are extremely busy. As a result, they tend to advise
simple and easy to apply techniques, or they will have to do it themselves. For my
teaching philosophy statements, you may like to visit the Web site On Learning &
Teaching.
4. Communication between a statistician and decision-maker can be difficult. One
speaks in statistical jargon; the other understands the monetary or utilitarian benefit of
using the statistician's recommendations.
Plugging numbers into the formulas and crunching them have no value by themselves. You
should continue to put effort into the concepts and concentrate on interpreting the results.
Even when you solve a small size problem by hand, I would like you to use the available
computer software and Web-based computation to do the dirty work for you.
You must be able to read the logical secret in any formulas not memorize them. For example,
in computing the variance, consider its formula. Instead of memorizing, you should start with
some why:
This example shows how to question statistical formulas, rather than memorizing them. In
fact, when you try to understand the formulas, you do not need to remember them, they are
part of your brain connectivity. Clear thinking is always more important than the ability to do
arithmetic.
When you look at a statistical formula, the formula should talk to you, as when a musician
looks at a piece of musical-notes, he/she hears the music.
Java, once an esoteric programming language for animating Web pages, is now a full-fledged
platform for building JavaScript E-labs' learning objects with useful applications. As you
used to do experiments in physics labs to learn physics, computer-assisted learning enables
you to use any online interactive tool available on the Internet to perform experiments. The
purpose is the same; i.e., to understand statistical concepts by using statistical applets which
are entertaining and educating.
The main objective for this course is to learn statistical thinking; to emphasize
more on concepts, and less theory and fewer recipes, and finally to foster
active learning using the useful and interesting Web-sites. It is already a
known fact that"Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write." So, let's be ahead of our time.
Further Readings:
Chernoff H., A Conversation With Herman Chernoff, Statistical Science, Vol. 11, No. 4, 335-350, 1996.
Churchman C., The Design of Inquiring Systems, Basic Books, New York, 1971. Early in the book he
stated that knowledge could be considered as a collection of information, or as an activity, or as a
potential. He also noted that knowledge resides in the user and not in the collection.
Rustagi M., et al. (eds.), Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His
Sixtieth Birthday, Academic Press, 1983.
The original idea of"statistics" was the collection of information about and for
the"state". The word statistics derives directly, not from any classical Greek or
Latin roots, but from the Italian word for state.
Probability originated from the study of games of chance and gambling during
the 16th century. Probability theory was a branch of mathematics studied by
Blaise Pascal and Pierre de Fermat in the seventeenth century. Currently in
21st century, probabilistic modeling is used to control the flow of traffic
through a highway system, a telephone interchange, or a computer processor;
find the genetic makeup of individuals or populations; quality control;
insurance; investment; and other sectors of business and industry.
New and ever growing diverse fields of human activities are using statistics;
however, it seems that this field itself remains obscure to the public. Professor
Bradley Efron expressed this fact nicely:
Further Readings:
Daston L., Classical Probability in the Enlightenment, Princeton University Press, 1988.
The book points out that early Enlightenment thinkers could not face uncertainty. A mechanistic,
deterministic machine, was the Enlightenment view of the world.
David H., and A.Edwards, Annotated Readings in the History of Statistics, Springer, 2001. Offers a
general historical collections of the probability and statistical literature.
Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers the classical, logical,
subjective, frequency, and propensity views.
Hacking I., The Emergence of Probability, Cambridge University Press, London, 1975. A philosophical
study of early ideas about probability, induction and statistical inference.
Hald A., A History of Probability and Statistics and Their Applications before 1750, Wiley, 2003.
Peters W., Counting for Something: Statistical Principles and Personalities, Springer, New York, 1987. It
teaches the principles of applied economic and social statistics in a historical context. Featured topics
include public opinion polls, industrial quality control, factor analysis, Bayesian methods, program
evaluation, non-parametric and robust methods, and exploratory data analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton University Press, 1986. The author
states that statistics has become known in the twentieth century as the mathematical tool for
analyzing experimental and observational data. Enshrined by public policy as the only reliable basis
for judgments as the efficacy of medical procedures or the safety of chemicals, and adopted by
business for such uses as industrial quality control, it is evidently among the products of science
whose influence on public and private life has been most pervasive. Statistical analysis has also come
to be seen in many scientific disciplines as indispensable for drawing reliable conclusions from
empirical (i.e., observed) results. This new field of mathematics found so extensive a domain of
applications.
Stigler S., The History of Statistics: The Measurement of Uncertainty Before 1900, U. of Chicago Press,
1990. It covers the people, ideas, and events underlying the birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984.
This work provides the detailed lives and times of theorists whose work continues to shape much of
the modern statistics.
Frequently, for example the marketing managers are faced with the question,
What Sample Size Do I Need? This is an important and common statistical
decision, which should be given due consideration, since an inadequate sample
size invariably leads to wasted resources. The sample size determination
section provides a practical solution to this risky decision.
Statistical models are currently used in various fields of business and science.
However, the terminology differs from field to field. For example, the fitting
of models to data, called calibration, history matching, and data assimilation,
are all synonymous with parameter estimation.
The above figure depicts the fact that as the exactness of a statistical model
increases, the level of improvements in decision-making increases. That's why
we need Business Statistics. Statistics arose from the need to place knowledge
on a systematic evidence base. This required a study of the rules of
computational probability, the development of measures of data properties and
relationships, and so on.
The notion of "wisdom" in the sense of practical wisdom has entered Western
civilization through biblical texts. In the Hellenic experience this kind of
wisdom received a more structural character in the form of philosophy. In this
sense philosophy also reflects one of the expressions of traditional wisdom.
Fortunately the probabilistic and statistical methods for analysis and decision making under
uncertainty are more numerous and powerful today than ever before. The computer makes
possible many practical applications. A few examples of business applications are the
following:
An auditor can use random sampling techniques to audit the accounts receivable for
clients.
A plant manager can use statistical quality control techniques to assure the quality of
his production with a minimum of testing or inspection.
A financial analyst may use regression and correlation to help understand the
relationship of a financial ratio to a set of other variables in business.
A market researcher may use test of significace to accept or reject the hypotheses
about a group of buyers to which the firm wishes to sell a particular product.
A sales manager may use statistical techniques to forecast sales for the coming year.
1. Objectives or Hypotheses: What are the objectives of the study or the questions to be
answered? What is the population to which the investigators intend to refer their
findings?
2. Statistical Design: Is the study a planned experiment (i.e., primary data), or an
analysis of records ( i.e., secondary data)? How is the sample to be selected? Are there
possible sources of selection, which would make the sample atypical or non-
representative? If so, what provision is to be made to deal with this bias? What is the
nature of the control group, standard of comparison, or cost? Remember that
statistical modeling means reflections before actions.
3. Observations: Are there clear definition of variables, including classifications,
measurements (and/or counting), and the outcomes? Is the method of classification or
of measurement consistent for all the subjects and relevant to Item No. 1.? Are there
possible biased in measurement (and/or counting) and, if so, what provisions must be
made to deal with them? Are the observations reliable and replicable (to defend your
finding)?
4. Analysis: Are the data sufficient and worthy of statistical analysis? If so, are the
necessary conditions of the methods of statistical analysis appropriate to the source
and nature of the data? The analysis must be correctly performed and interpreted.
5. Conclusions: Which conclusions are justifiable by the findings? Which are not? Are
the conclusions relevant to the questions posed in Item No. 1?
6. Representation of Findings: The finding must be represented clearly, objectively, in
sufficient but non-technical terms and detail to enable the decision-maker (e.g., a
manager) to understand and judge them for himself? Is the finding internally
consistent; i.e., do the numbers added up properly? Can the different representation be
reconciled?
7. Managerial Summary: When your findings and recommendation(s) are not clearly
put, or framed in an appropriate manner understandable by the decision maker, then
the decision maker does not feel convinced of the findings and therefore will not
implement any of the recommendations. You have wasted the time, money, etc. for
nothing.
Further Readings:
Corfield D., and J. Williamson, Foundations of Bayesianism, Kluwer Academic Publishers, 2001. Contains Logic,
Mathematics, Decision Theory, and Criticisms of Bayesianism.
Lapin L., Statistics for Modern Business Decisions, Harcourt Brace Jovanovich, 1987.
Pratt J., H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory, The MIT Press, 1994.
The main objective of Business Statistics is to make inferences (e.g., prediction, making
decisions) about certain characteristics of a population based on information contained in a
random sample from the entire population. The condition for randomness is essential to make
sure the sample is representative of the population.
Business Statistics is the science of ‘good' decision making in the face of uncertainty and is
used in many disciplines, such as financial analysis, econometrics, auditing, production and
operations, and marketing research. It provides knowledge and skills to interpret and use
statistical techniques in a variety of business applications. A typical Business Statistics course
is intended for business majors, and covers statistical study, descriptive statistics (collection,
description, analysis, and summary of data), probability, and the binomial and normal
distributions, test of hypotheses and confidence intervals, linear regression, and correlation.
At the planning stage of a statistical investigation, the question of sample size (n) is critical.
For example, sample size for sampling from a finite population of size N, is set at: N½+1,
rounded up to the nearest integer. Clearly, a larger sample provides more relevant
information, and as a result a more accurate estimation and better statistical judgement
regarding test of hypotheses.
Under-lit Streets and the Crimes Rate: It is a fact that if residential city streets are under-lit
then major crimes take place therein. Suppose you are working in the Mayer’s office and put
you in charge of helping him/her in deciding which manufacturers to buy the light bulbs from
in order to reduce the crime rate by at least a certain amount, given that there is a limited
budget?
The above figure illustrates the idea of statistical inference from a random sample about the
population. It also provides estimation for the population's parameters; namely the expected
value µx, the standard deviation, and the cumulative distribution function (cdf) Fx, and their
corresponding sample statistics, mean , sample standard deviation Sx, and empirical (i.e.,
observed) cumulative distribution function (cdf), respectively.
The major task of Statistics is the scientific methodology for collecting, analyzing,
interpreting a random sample in order to draw inference about some particular characteristic
of a specific Homogenous Population. For two major reasons, it is often impossible to study
an entire population:
Given you already have a realization set of a random sample, to compute the
descriptive statistics including those in the above figure, you may like using
Descriptive Statistics JavaScript.
Business statistics has grown with the art of constructing charts and tables! It
is a science of basing decisions on numerical data in the face of uncertainty.
While business statistics cannot replace the knowledge and experience of the
decision maker, it is a valuable tool that the manager can employ to assist in
the decision making process in order to reduce the inherent risk, measured by,
e.g., the standard deviation .
Among other useful questions, you may ask why we are interested in
estimating the population's expected value and its Standard Deviation ?
Here are some applicable reasons. Business Statistics must provide justifiable
answers to the following concerns for every consumer and producer:
Like all profession, also statisticians have their own keywords and phrases to
ease a precise communication. However, one must interpret the results of any
decision making in a language that is easy for the decision-maker to
understand. Otherwise, he/she does not believe in what you recommend, and
therefore does not go into the implementation phase. This lack of
communication between statisticians and the managers is the major roadblock
for using statistics.
Example: The population for a study of infant health might be all children
born in the U.S.A. in the 1980's. The sample might be all babies born on 7th of
May in any of the years.
Design of experiments is a key tool for increasing the rate of acquiring new
knowledge. Knowledge in turn can be used to gain competitive advantage,
shorten the product development cycle, and produce new products and
processes which will meet and exceed your customer's expectations.
Primary data and Secondary data sets: If the data are from a planned
experiment relevant to the objective(s) of the statistical investigation, collected
by the analyst, it is called a Primary Data set. However, if some condensed
records are given to the analyst, it is called a Secondary Data set.
Probability: Probability (i.e., probing for the unknown) is the tool used for
anticipating what the distribution of data should look like under a given
model. Random phenomena are not haphazard: they display an order that
emerges only in the long run and is described by a distribution. The
mathematical description of variation is central to statistics. The probability
required for statistical inference is not primarily axiomatic or combinatorial,
but is oriented toward describing data distributions.
Within a population, a parameter is a fixed value that does not vary. Each
sample drawn from the population has its own value of any statistic that is
used to estimate this parameter. For example, the mean of the data in a sample
is used to give information about the overall mean in the population from
which that sample was drawn.
It is possible to draw more than one sample from the same population, and the
value of a statistic will in general vary from sample to sample. For example,
the average value in a sample is a statistic. The average values in more than
one sample, drawn from the same population, will not necessarily be equal.
Statistics are often assigned Roman letters (e.g. and s), whereas the
equivalent unknown values in the population (parameters ) are assigned Greek
letters (e.g., µ, ).
Example: Suppose the manager of a shop wanted to know , the mean
expenditure of customers in her shop in the last year. She could calculate the
average expenditure of the hundreds (or perhaps thousands) of customers who
bought goods in her shop; that is, the population mean . Instead she could
use an estimate of this population mean by calculating the mean of a
representative sample of customers. If this value were found to be $25, then
$25 would be her estimate.
The principal descriptive quantity derived from sample data is the mean ( ),
which is the arithmetic average of the sample data. It serves as the most
reliable single measure of the value of a typical member of the sample. If the
sample contains a few values that are so large or so small that they have an
exaggerated effect on the value of the mean, the sample is more accurately
represented by the median -- the value where half the sample values fall below
and half above.
The quantities most commonly used to measure the dispersion of the values
about their mean are the variance s2 and its square root, the standard deviation
s. The variance is calculated by determining the mean, subtracting it from each
of the sample values (yielding the deviation of the samples), and then
averaging the squares of these deviations. The mean and standard deviation of
the sample are used as estimates of the corresponding characteristics of the
entire group from which the sample was drawn. They do not, in general,
completely describe the distribution (Fx) of values within either the sample or
the parent group; indeed, different distributions may have the same mean and
standard deviation. They do, however, provide a complete description of the
normal distribution, in which positive and negative deviations from the mean
are equally common, and small deviations are much more common than large
ones. For a normally distributed set of values, a graph showing the dependence
of the frequency of the deviations upon their magnitudes is a bell-shaped
curve. About 68 percent of the values will differ from the mean by less than
the standard deviation, and almost 100 percent will differ by less than three
times the standard deviation.
Notice that to be able to estimate the population parameters, the sample size n
must be greater than one. For example, with a sample size of one, the variation
(s2) within the sample is 0/1 = 0. An estimate for the variation (2) within the
population would be 0/0, which is indeterminate quantity, meaning
impossible.
I'm glad that you're overcoming all the confusions that exist in learning
statistics.
Quantitative data sets consist of measures that take numerical values for which
descriptions such as means and standard deviations are meaningful. They can
be put into an order and further divided into two groups: discrete data or
continuous data.
Discrete data are countable data and are collected by counting, for example,
the number of defective items produced during a day's production.
Continuous data are collected by measuring and are expressed on a
continuous scale. For example, measuring the height of a person.
Data come in the forms of Nominal, Ordinal, Interval, and Ratio (remember
the French word NOIR for the color black). Data can be either continuous or
discrete.
Levels of Measurements
________________________________________
_
Nominal Ordinal Interval/Ratio
Ranking? no yes yes
Numerical
no no yes
difference
Both the zero point and the units of measurement are arbitrary on the Interval
scale. While the unit of measurement is arbitrary on the Ratio scale, its zero
point is a natural attribute. The categorical variable is measured on an ordinal
or nominal scale.
The following are the advantages and/or necessities for sampling in statistical
decision making:
Further Reading:
Thompson S., Sampling, Wiley, 2002.
Sampling Methods
From the food you eat to the television you watch, from political elections to
school board actions, much of your life is regulated by the results of sample
surveys.
A sample is generally selected for study because the population is too large to
study in its entirety. The sample should be representative of the general
population. This is often best achieved by random sampling. Also, before
collecting the sample, it is important that one carefully and completely defines
the population, including a description of the members to be included.
ni = n(Ni/N)
items at random from sub-population i, i = 1, 2, . . . ., k.
The estimates is:
W2t /(Nt-nt)S2t/[nt(Nt-1)]
N2t(Nt-nt)S2t/[nt(Nt-1)].
Var( ) = S2(1-n/N)/n,
where n/N is the sampling fraction. For sampling fraction less than 10% the
finite population correction factor (N-n)/(N-1) is almost 1.
S2 = p(1-p) (1-n/N)/(n-1).
Determination of sample sizes (n) with regard to binary data: Smallest integer
greater than or equal to:
Further Reading:
Thompson S., Sampling, Wiley, 2002.
Statistical Summaries
Mean: The arithmetic mean (or the average, simple mean) is computed by
summing all numbers in an array of numbers (xi) and then dividing by the
number of observations (n) in the array.
The mean uses all of the observations, and each observation affects the mean.
Even though the mean is sensitive to extreme values; i.e., extremely large or
small data can cause the mean to be pulled toward the extreme data; it is still
the most widely used measure of location. This is due to the fact that the mean
has valuable mathematical properties that make it convenient for use with
inferential statistical analysis. For example, the sum of the deviations of the
numbers in a set of data from the mean is zero, and the sum of the squared
deviations of the numbers in a set of data from the mean is the minimum
value.
Generally, the median provides a better measure of location than the mean
when there are some extremely large or small observations; i.e., when the data
are skewed to the right or to the left. For this reason, median income is used as
the measure of location for the U.S. household income. Note that if the median
is less than the mean, the data set is skewed to the right. If the median is
greater than the mean, the data set is skewed to the left. For normal
population, the sample median is distributed normally with = the mean, and
standard error of the median (/2)½ times standard error of the mean.
The mean has two distinct advantages over the median. It is more stable, and
one can compute the mean based of two samples by combining the two means.
When the mean and the median are known, it is possible to estimate the mode
for the unimodal distribution using the other two averages as follows:
Whenever, more than one mode exist, then the population from which the
sample came is a mixture of more than one population, as shown, for example
in the following bimodal histogram.
Click on the image to enlarge it and THEN print it.
A Mixture of Two Different Populations
Almost all standard statistical analyses are conditioned on the assumption that
the population is homogeneous.
Notice that Excel has very limited statistical capability. For example, it
displays only one mode, the first one. Unfortunately, this is very misleading.
However, you may find out if there are others by inspection only, as follow:
Create a frequency distribution, invoke the menu sequence: Tools, Data
analysis, Frequency and follow instructions on the screen. You will see the
frequency distribution and then find the mode visually. Unfortunately, Excel
does not draw a Stem and Leaf diagram. All commercial off-the-shelf
software, such as SAS and SPSS, display a Stem and Leaf diagram, which is a
frequency distribution of a given data set.
The first consideration is the type of data, if the variable is categorical, the
mode is the single measure that best describes that data.
The second consideration in selecting the index is to ask whether the total of
all observations is of any interest. If the answer is yes, then the mean is the
proper index of central tendency.
If the total is of no interest, then depending on whether the histogram is
symmetric or skewed one must use either mean or median, respectively.
In all cases the histogram must be unimodal. However, notice that, e.g., a
Uniform distribution has uncountable number of modes having equal density
value; therefore it is considered as a homogeneous population.
|Mean - Median|
The Main
Characteristics of
the Mode, the The Mode The Median The Mean
Median, and the
Mean Fact No.
It is the value of the
It is the most frequent
middle point of the It is the value in a given
value in the
array (not midpoint of aggregate which would
1 distribution; it is the
range), such that half obtain if all the values
point of greatest
the item are above and were equal.
density.
half below it.
The value of the mode
The value of the media The sum of deviations
is established by the
is fixed by its position on either side of the
predominant
2 in the array and doesn't mean are equal; hence,
frequency, not by the
reflect the individual the algebraic sum of the
value in the
value. deviation is equal zero.
distribution.
The aggregate distance
It is the most probable between the median
It reflect the magnitude
3 value, hence the most point and all the value
of every value.
typical. in the array is less than
from any other point.
A distribution may
have 2 or more
modes. On the other Each array has one and An array has one and
4
hand, there is no only one median. only one mean.
mode in a rectangular
distribution.
It cannot be Means may be
manipulated manipulated
The mode does nott
algebraically: medians algebraically: means of
5 reflect the degree of
of subgroups cannot be subgroups may be
modality.
weighted and combined when
combined. properly weighted.
6 It cannot be It is stable in that It may be calculated
even when individual
manipulated
grouping procedures values are unknown,
algebraically: modes
do not affect it provided the sum of the
of subgroups cannot
appreciably. values and the sample
be combined.
size n are known.
It is unstable that it is Value must be ordered, Values need not be
7 influenced by and may be grouped, ordered or grouped for
grouping procedures. for computation. this calculation.
Values must be It cannot be calculated
It can be compute
8 ordered and group for from a frequency table
when ends are open
its computation. when ends are open.
It can be calculated It is stable in that
It is not applicable to
9 when table ends are grouping procedures do
qualitative data.
open. not seriously affected it.
The Descriptive Statistics JavaScript provides a complete set of information about all
statistics that you ever need. You might like to use it to perform some numerical
experimentation for validating the above assertions for a deeper understanding.
The Geometric Mean: The geometric mean (G) of n non-negative numerical values is the nth
root of the product of the n values.
If some values are very large in magnitude and others are small, then the geometric mean is a
better representative of the data than the simple average. In a"geometric series", the most
meaningful average is the geometric mean (G). The arithmetic mean is very biased toward
the larger numbers in the series.
An Application: Suppose sales of a certain item increase to 110% in the first year and to
150% of that in the second year. For simplicity, assume you sold 100 items initially. Then the
number sold in the first year is 110 and the number sold in the second is 150% x 110 = 165.
The arithmetic average of 110% and 150% is 130% so that we would incorrectly estimate
that the number sold in the first year is 130 and the number in the second year is 169. The
geometric mean of 110% and 150% is G = (1.65)1/2 so that we would correctly estimate that
we would sell 100 (G)2 = 165 items in the second year.
The Harmonic Mean:The harmonic mean (H) is another specialized average, which is
useful in averaging variables expressed as rate per unit of time, such as mileage per hour,
number of units produced per day. The harmonic mean (H) of n non-zero numerical values
x(i) is: H = n/[ (1/x(i)].
An Application: Suppose 4 machines in a machine shop are used to produce the same part.
However, each of the four machines takes 2.5, 2.0, 1.5, and 6.0 minutes to make one part,
respectively. What is the average rate of speed?
The harmonic means is: H = 4/[(1/2.5) + (1/2.0) + 1/(1.5) + (1/6.0)] = 2.31 minutes.
If all machines working for one hour, how many parts will be produced? Since four machines
running for one hour represent 240 minutes of operating time, then: 240 / 2.31 = 104 parts
will be produced.
The Order Among the Three Means: If all the three means exist, then the Arithmetic Mean
is never less than the other two, moreover, the Harmonic Mean is never larger than the other
two.
You might like to use The Other Means JavaScript in performing some numerical
experimentation for validating the above assertions for a deeper understanding.
Further Reading:
Langley R., Practical Statistics Simply Explained, 1970, Dover Press.
A histogram is a graphical presentation of an estimate for the density (for continuous random
variables) or probability mass function (for discrete random variables) of the population.
The geometric feature of histogram enables us to find out useful information about the data,
such as:
The mode is the most frequently occurring value in a set of observations. Data may have two
modes. In this case, we say the data are bimodal, and sets of observations with more than two
modes are referred to as multimodal. Whenever, more than one mode exist, then the
population from which the sample came is a mixture of more than one population. Almost all
standard statistical analyses are conditioned on the assumption that the population is
homogeneous, meaning that its density (for continuous random variables) or probability mass
function (for discrete random variables) is unimodal. However, notice that, e.g., a Uniform
distribution has uncountable number of modes having equal density value; therefore it is
considered as a homogeneous population.
To check the unimodality of sampling data, one may use the histogramming process.
where Log is the logarithm in base 10, and n is the total number of the numerical values
which comprise the data set.
To have an"optimum" you need some measure of quality -- presumably in this case, the"best"
way to display whatever information is available in the data. The sample size contributes to
this; so the usual guidelines are to use between 5 and 15 classes, with more classes, if you
have a larger sample. You should take into account a preference for tidy class widths,
preferably a multiple of 5 or 10, because this makes it easier to understand.
Beyond this it becomes a matter of judgement. Try out a range of class widths, and choose
the one that works best. This assumes you have a computer and can generate alternative
histograms fairly readily.
There are often management issues that come into play as well. For example, if your data is
to be compared to similar data -- such as prior studies, or from other countries -- you are
restricted to the intervals used therein.
If the histogram is very skewed, then unequal classes should be considered. Use narrow
classes where the class frequencies are high, wide classes where they are low.
Let n be the sample size, then the number of class intervals could be
The Log is the logarithm in base 10. Thus for 200 observations you would use 14 intervals
but for 2000 you would use 33.
Alternatively,
A BoxPlot is a graphical display that has many characteristics. It includes the presence of
possible outliers. It illustrates the range of data. It shows a measure of dispersion such as the
upper quartile, lower quartile and interquartile range (IQR) of the data set as well as the
median as a measure of central location, which is useful for comparing sets of data. It also
gives an indication of the symmetry or skewness of the distribution. The main reason for the
popularity of boxplots is that they offer much of information in a compact way.
1. Horizontal lines are drawn at the smallest observation (A), lower quartile. And
another from the upper quartile (D), and the largest observation (E). Vertical
lines to produce the box join these horizontal lines at points (B, and D).
2. A vertical line is drawn at the median point (C), as shown on the above Figure.
For a deeper understanding, you may like using graph paper, and Descriptive Sampling
Statistics JavaScript in constructing the BoxPlots for some sets of data; e.g., from your
textbook.
Average by itself is not a good indication of quality. You need to know the variance to make
any educated assessment. We are reminded of the dilemma of the six-foot tall statistician who
drowned in a stream that had an average depth of three feet.
Statistical measures are often used for describing the nature and extent of differences among
the information in the distribution. A measure of variability is generally reported together
with a measure of central tendency.
Statistical measures of variation are numerical values that indicate the variability inherent in
a set of data measurements. Note that a small value for a measure of dispersion indicates that
the data are concentrated around the mean; therefore, the mean is a good representative of the
data set. On the other hand, a large measure of dispersion indicates that the mean is not a
good representative of the data set. Also, measures of dispersion can be used when we want
to compare the distributions of two or more sets of data. Quality of a data set is measured by
its variability: Larger variability indicates lower quality. That is why high variation makes
the manager very worried. Your job, as a statistician, is to measure the variation, and if it is
too high and unacceptable, then it is the job of the technical staff, such as engineers, to fix the
process.
Decision situations with complete lack of knowledge, known as the flat uncertainty, have
the largest risk. For simplicity, consider the case when there are only two outcomes, one with
probability of p. Then, the variation in the outcomes is p(1-p). This variation is the largest if
we set p = 50%. That is, equal chance for each outcome. In such a case, the quality of
information is at its lowest level.
Remember, quality of information and variation are inversely related. The larger the
variation in the data, the lower the quality of the data (i.e., information): the Devil is in the
Deviations.
The four most common measures of variation are the range, variance, standard deviation,
and coefficient of variation.
Range: The range of a set of observations is the absolute value of the difference between the
largest and smallest values in the data set. It measures the size of the smallest contiguous
interval of real numbers that encompasses all of the data values. It is not useful when extreme
values are present. It is based solely on two values, not on the entire data set. In addition, it
cannot be defined for open-ended distributions such as Normal distribution.
Notice that, when dealing with discrete random observations, some authors define the range
as:
Range = Largest value - Smallest value + 1.
A normal distribution does not have a range. A student said,"since the tails of a normal
density function never touch the x-axis and since for an observation to contribute to forming
such a curve, very large positive and negative values must exist" Yet such remote values are
always possible, but increasingly improbable. This encapsulates the asymptotic behavior of
normal density very well. Therefore, in spite of this behavior, it is useful and applicable to a
wide range of decision-making situations.
Quartiles: When we order the data, for example in ascending order, we may divide the data
into quarters, Q1…Q4, known as quartiles. The first Quartile (Q1) is that value where 25% of
the values are smaller and 75% are larger. The second Quartile (Q2) is that value where 50%
of the values are smaller and 50% are larger. The third Quartile (Q3) is that value where 75%
of the values are smaller and 25% are larger.
Percentiles: Percentiles have a similar concept and therefore, are related; e.g., the 25th
percentile corresponds to the first quartile Q1, etc. The advantage of percentiles is that they
may be subdivided into 100 parts. The percentiles and quartiles are most conveniently read
from a cumulative distribution function, as depicted in the following figure.
Click on the image to enlarge it and THEN print it.
Empirical Cumulative Distribution Function as an Informative Tool
Interquartiles Range: The interquartile range (IQR) describes the extent for which the
middle 50% of the observations scattered or dispersed. It is the distance between the first and
the third quartiles:
IQR = Q3 - Q1,
which is twice the Quartile Deviation. For data that are skewed, the relative dispersion,
similar to the coefficient of variation (C.V.) is given (provided the denominator is not zero)
by the Coefficient of Quartile Variation:
Note that almost all statistics that we have covered up to now can be obtained and understood
deeply by graphical method using Empirical (i.e., observed) Cumulative Distribution
Function (ECDF) JavaScript. However, the numerical Descriptive Statistics provides a
complete set of information about all statistics that you ever need.
The Duality between the ECDF and the Histogram: Notice that the empirical (i.e.,
observed) cumulative distribution function (ECDF) indicates by its height at a particular
pointthat is numerically equal to the area in the corresponding histogram to the left of that
point. Therefore, either or both could be used depending on the intended applications.
Mean Absolute Deviation (MAD): A simple measure of variability is the mean absolute
deviation:
MAD = |(xi - )| / n.
The mean absolute deviation is widely used as a performance measure to assess the quality of
the modeling, such forecasting techniques. However, MAD does not lend itself to further use
in making inference; moreover, even in the error analysis studies, the variance is preferred
since variances of independent (i.e., uncorrelated) errors are additive; however MAD does
not have such a nice feature.
The MAD is a simple measure of variability, which unlike range and quartile deviation, takes
every item into account, and it is simpler and less affected by extreme deviations. It is
therefore often used in small samples that include extreme values.
The mean absolute deviation theoretically should be measured from the median, since it is at
its minimum; however, it is more convenient to measure the deviations from the mean.
As a numerical example, consider the price (in $) of same item at 5 different stores: $4.75,
$5.00, $4.65, $6.10, and $6.30. The mean absolute deviation from the mean is $0.67, while
from the median is $0.60, which is a better representative of deviation among the prices.
The variance is a measure of spread or dispersion among values in a data set. Therefore, the
greater the variance, the lower the quality.
The variance is not expressed in the same units as the observations. In other words, the
variance is hard to understand because the deviations from the mean are squared, making it
too large for logical explanation. This problem can be solved by working with the square
root of the variance, which is called the standard deviation.
Standard Deviation: Both variance and standard deviation provide the same information;
one can always be obtained from the other. In other words, the process of computing a
standard deviation always involves computing a variance. Since standard deviation is the
square root of the variance, it is always expressed in the same units as the raw data:
For large data sets (say, more than 30), approximately 68% of the data are contained within
one standard deviation of the mean, 95% contained within two standard deviations. 97.7% (or
almost 100% ) of the data are contained within within three standard deviations (S) from the
mean.
You may use Descriptive Statistics JavaScript to compute the mean, and standard deviation.
The Mean Square Error (MSE) of an estimate is the variance of the estimate plus the
square of its bias; therefore, if an estimate is unbiased, then its MSE is equal to its variance,
as it is the case in the ANOVA table.
CV =100 |S/ | %
The coefficient of variation is used to represent the relationship of the standard deviation to
the mean, telling how representative the mean is of the numbers from which it came. It
expresses the standard deviation as a percentage of the mean; i.e., it reflects the variation in a
distribution relative to the mean. However, confidence intervals for the coefficient of
variation are rarely reported. One of the reasons is that the exact confidence interval for the
coefficient of variation is computationally tedious.
Note that, for a skewed or grouped data set, the coefficient of quartile variation:
You may use Descriptive Statistics to compute the mean, standard deviation and the
coefficient of variation.
Variation Ratio for Qualitative Data: Since the mode is the most frequently used measure
of central tendency for qualitative variables, variability is measured with reference to the
mode. The statistic that describes the variability of quantitative data is the Variation Ratio
(VR):
VR = 1 - fm/n,
where fm is the frequency of the mode, and n is the total number of scores in the distribution.
Z Score: how many standard deviations a given point (i.e., observation) is above or below
the mean. In other words, a Z score represents the number of standard deviations that an
observation (x) is above or below the mean. The larger the Z value, the further away a value
will be from the mean. Note that values beyond three standard deviations are very unlikely.
Note that if a Z score is negative, the observation (x) is below the mean. If the Z score is
positive, the observation (x) is above the mean. The Z score is found as:
Z = (x - ) / standard deviation of X
The Z score is a measure of the number of standard deviations that an observation is above or
below the mean. Since the standard deviation is never negative, a positive Z score indicates
that the observation is above the mean, a negative Z score indicates that the observation is
below the mean. Note that Z is a dimensionless value, and therefore is a useful measure by
which to compare data values from two different populations, even those measured by
different units.
One of the nice features of the z-transformation is that the resulting distribution of the
transformed data has an identical shape but with mean zero, and standard deviation equal to
1.
One can generalize this data transformation to have any desirable mean and standard
deviation other than 0 and 1, respectively. Suppose we wish the transformed data to have the
mean and standard deviation of M and D, respectively. For example, in the SAT Scores, they
are set at M = 500, and D=100. The following transformation should be applied:
Z = (standard Z) D + M
Suppose you have two data sets with very different scales (e.g., one has very low values,
another very high values). If you wish to compare these two data sets, due to differences in
scales, the statistics that you generate are not comparable. It is a good idea to use the Z-
transformation of both original data sets and then make any comparison.
You have heard the terms z value, z test, z transformation, and z score. Do all of these terms
mean the same thing? Certainly not:
The z value refers to the critical value (a point on the horizontal axes) of the Normal (0, 1)
density function, for a given area to the left of that z-value.
The z test refers to the procedures for testing the equality of mean (s) of one (or two)
population(s).
Pearson coined the term"standard deviation" sometime near 1900. The idea of using squared
deviations goes back to Laplace in the early 1800's.
Finally, notice again, that the transforming raw scores to z scores do NOT normalize the data.
Computation of Descriptive Statistics for Grouped Data: One of the most common ways
to describe a single variable is with a frequency distribution. A histogram is a graphical
presentation of an estimate for the frequency distribution of the population. Depending upon
the particular variable, all of the data values may be represented, or you may group the values
into categories first (e.g., by age). It would usually not be sensible to determine the
frequencies for each value. Rather, the values are grouped into ranges, and the frequency is
then determined.). Frequency distributions can be depicted in two ways: as a table or as a
graph that is often referred to as a histogram or bar chart. The bar chart is often used to show
the relationship between two categorical variables.
Grouped data is derived from raw data, and it consists of frequencies (counts of raw values)
tabulated with the classes in which they occur. The Class Limits represent the largest (Upper)
and lowest (Lower) values which the class will contain. The formulas for the descriptive
statistic becomes much simpler for the grouped data, as shown below for Mean, Variance,
Standard Deviation, respectively, where (f) is for the frequency of each class, and n is the
total frequency:
Selecting Among the Quartile Deviation, Mean Absolute Deviation, and Standard
Deviation
The Main
Characteristics of the
Quartile Deviation, the
The Quartile The Mean Absolute The Standard
Mean Absolute
Deviation Deviation Deviation
Deviation, and the
Standard Deviation
Fact No.
The quartile deviation The mean absolute The standard
is also easy to calculate deviation has the deviation is usually
and to understand. advantage of giving more useful and
1 However, it is equal weight to the better adapted to
unreliable if there are deviation of every further analysis than
gaps in the data around value form the mean the mean absolute
the quartiles. or median. deviation.
2 It depends on only 2 Therefore, it is a more It is more reliable as
values, which include sensitive measure of an estimator of the
the middle half of the dispersion than those population dispersion
described above and than other measures,
ordinarily has a provided the
items.
smaller sampling distribution is
error. normal.
It is also easier to
It is the most widely
compute and to
It is usually superior to used measure of
understand and is less
3 the range as a rough dispersion and the
affected by extreme
measure of dispersion. easiest to handle
values than the
algebraically.
standard deviation.
It may be determined
Unfortunately, it is
in an open-end Compared with the
difficult to handle
distribution, or one in others, it is harder to
algebraically, since
4 which the data may be compute and more
minus signs must be
ranked but not difficult to
ignored in its
measured understand.
computation.
quantitatively.
It also useful in badly
It is generally
skewed distributions or Its main application is
affected by extreme
those in which other in modeling accuracy
5 values that may be
measures of dispersion for comparative
due to skewness of
would be warped by forecasting techniques.
data
extreme values.
You might like to use the Descriptive Sampling Statistics JavaScript in performing some
numerical experimentation for validating the above assertions for a deeper understanding.
The pair of statistical measures, skewness and kurtosis, are measuring tools, which is used in
selecting a distribution(s) to fit your data. To make an inference with respect to the
population distribution, you may first compute skewness and kurtosis from your random
sample from the entire population. Then, locating a point with these coordinates on the
widely used skewness-kurtosis chart , guess a couple of possible distributions to fit your data.
Finally, you might use the goodness-of-fit test to rigorously come up with the best candidate
fitting your data. Removing outliers improves the accuracy of both skewness and kurtosis.
Skewness: Skewness is a measure of the degree to which the sample population deviates
from symmetry with the mean at the center.
Skewness will take on a value of zero when the distribution is a symmetrical curve. A
positive value indicates the observations are clustered more to the left of the mean with most
of the extreme values to the right of the mean. A negative skewness indicates clustering to the
right. In this case we have: Mean Median Mode. The reverse order holds for the
observations with positive skewness.
Kurtosis: Kurtosis is a measure of the relative peakedness of the curve defined by the
distribution of the observations.
Standard normal distribution has kurtosis of +3. A kurtosis larger than 3 indicates the
distribution is more peaked than the standard normal distribution.
A value of less than 3 for kurtosis indicates that the distribution is flatter than the standard
normal distribution.
These inequalities hold for any probability distribution having finite skewness and kurtosis.
In the Skewness-Kurtosis Chart, you notice two useful families of distributions, namely the
beta and gamma families.
The Beta-Type Density Function: Since the beta density has both a shape and a scale
parameter, it describes many random phenomena provided the random variable is between [0,
1]. For example, when both parameters are integer with random variables the result is the
binomial Probability function.
Applications: A basic distribution of statistics for variables bounded at both sides; for
example x between [0, 1]. The beta density is useful for both theoretical and applied
problems in many areas. Examples include distribution of proportion of population located
between lowest and highest value in sample; distribution of daily per cent yield in a
manufacturing process; description of elapsed times to task completion (PERT). There is also
a relationship between the Beta and Normal distributions. The conventional calculation is that
given a PERT Beta with highest value as b, lowest as a, and most likely as m, the equivalent
normal distribution has a mean and mode of (a + 4m + b)/6 and a standard deviation of (b -
a)/6.
Comments: Uniform, right triangular, and parabolic distributions are special cases. To
generate beta, generate two random values from a gamma, g1, g2. The ratio g1/(g1 +g2) is
distributed like a beta distribution. The beta distribution can also be thought of as the
distribution of X1 given (X1+X2), when X1 and X2 are independent gamma random
variables.
Gamma-Type Density Function: Some random variables are always non-negative. The
density function associated with these random variables often is adequately modeled as the
gamma density function. The Gamma-Type Density Function has both a shape and a scale
parameter. With both the shape and scale parameters equal to 1, the result is the exponential
density function. Chi-square is also a special case of gamma density function with shape
parameter equal to 2.
Applications: A basic distribution of statistics for variables bounded at one side ; for
example x greater than or equal to zero. The gamma density gives distribution of time
required for exactly k independent events to occur, assuming events take place at a constant
rate. Used frequently in queuing theory, reliability, and other industrial applications.
Examples include distribution of time between re-calibrations of instrument that needs re-
calibration after k uses; time between inventory restocking, time to failure for a system with
standby components.
Comments: Erlangian, Exponential, and Chi-square distributions are special cases. The
negative binomial is an analog to gamma distribution with discrete random variable.
What is the distribution of the product of sample observations from the uniform (0, 1)
random? Like many problems with products, this becomes a familiar problem when turned
into a problem about sums. If X is uniform (for simplicity of notation make it U(0,1)), Y=-
log(X) is exponentially distributed, so the log of the product of X1, X2, ... Xn is the sum of
Y1, Y2, ... Yn which has a gamma (scaled Chi-square) distribution. Thus, it is a gamma
density with shape parameter n and scale 1.
Applications: Model for a process arising from many small multiplicative errors. Appropriate
when the value of an observed variable is a random proportion of the previously observed
value.
The lognormal distribution is widely used in situations where values are positively skewed
(where the distribution has a long right tail; negatively skewed distributions have a long left
tail; a normal distribution has no skewness). Examples of data that"fit" a lognormal
distribution include financial security valuations or real estate property valuations. Financial
analysts have observed that the stock prices are usually positively skewed, rather than
normally (symmetrically) distributed. Stock prices exhibit this trend because the stock price
cannot fall below the lower limit of zero but may increase to any price without limit.
Similarly, healthcare costs illustrate positive skewness since unit costs cannot be negative.
For example, there can't be negative cost for services in a capitation contract. This
distribution accurately describes most healthcare data.
In the case where the data are log-normally distributed, the Geometric Mean acts as a better
data descriptor than the mean. The more closely the data follow a log-normal distribution, the
closer the geometric mean is to the median, since the log re-expression produces a
symmetrical distribution.
Further Reading:
Snell J., Introduction to Probability, Random House, 1987. Read section 4.2 for a link between beta and F distributions
(with the advantage that tables are easy to find).
Tabachnick B., and L. Fidell, Using Multivariate Statistics, HarperCollins, 1996. Has a good discussion on applications and
significance tests for skewness and kurtosis.
A Numerical Example: Given the following, small (n = 4) data set, compute the descriptive
statistics: x1 = 1, x2 = 2, x3 = 3, and x4 = 6.
i xi ( xi- ) ( xi - ) 2 ( xi - ) 3 ( xi - )4
1 1 -2 4 -8 16
2 2 -1 1 -1 1
3 3 0 0 0 0
4 6 3 9 27 81
Sum 12 0 14 18 98
You might like to use Descriptive Statistics to check your hand computation.
Deviations about the mean of a distribution is the basis for most of the statistical tests we
will learn. Since we are measuring how much a set of scores is dispersed about the mean ,
we are measuring variability. We can calculate the deviations about the mean and express
it as variance 2 or standard deviation . It is very important to have a firm grasp of this
concept because it will be a central concept throughout your statistics course.
Both variance 2 and standard deviation measure variability within a distribution. Standard
deviation is a number that indicates how much on average each of the values in the
distribution deviates from the mean (or center) of the distribution. Keep in mind that
variance 2 measures the same thing as standard deviation (dispersion of scores in a
distribution). Variance 2, however, is the average squared deviations about the mean .
Thus, variance 2 is the square of the standard deviation .
The expected value and the variance of the statistic are and 2/n, respectively.
The expected value and variance of statistic S2 are 2 and 24 / (n-1), respectively.
and S2 are the best estimators for and 2. They are Unbiased (you may update your
estimate); Efficient (they have the smallest variation among other estimators); Consistent
(increasing sample size provides a better estimate); and Sufficient (you do not need to have
the whole data set; what you need are xi and xi2 for estimations). Note also that the above
variance S2 is justified only in the case where the population distribution tends to be normal,
otherwise one may use bootstrapping techniques.
In general, it is believed that the pattern of mode, median, and mean go from lower to higher
in positive skewed data sets, and just the opposite pattern in negative skewed data sets.
However; for example, in the following 23 numbers, mean = 2.87, median = 3, but the data is
positively skewed:
4, 2, 7, 6, 4, 3, 5, 3, 1, 3, 1, 2, 4, 3, 1, 2, 1, 1, 5, 2, 2, 3, 1
and, the following 10 numbers have mean = median = mode = 4, but the data set is left
skewed:
1, 2, 3, 4, 4, 4, 5, 5, 6, 6.
Note also, that most commercial software do not correctly compute skewness and kurtosis.
There is no easy way to determine confidence intervals about a computed skewness or
kurtosis value from a small to medium sample. The literature gives tables based on
asymptotic methods for sample sets larger than 100 for normal distributions only.
You may have noticed that using the above numerical example on some computer packages
such as SPSS, the skewness and the kurtosis are different from what we have computed. For
example, the SPSS output for the skewness is 1.190. However, for large a sample size n, the
results are identical.
The following figure depicts a typical relationship between the cumulative distribution
function (cdf) and the density (for continuous random variables),
All characteristics of the population are well described by either of these two functions. The
figure also illustrates their applications in determining the (lower) percentile measures
denoted by P:
among other useful information. Notice that the probability P is the area under the density
function curve, while numerically equal to the height of cdf curve at point x.
Both functions can be estimated by smoothing the empirical (i.e., observed) cumulative step-
function, and smoothing the histogram constructed from a random sample.
Note that almost all statistics we have covered up to now can be obtained and
understood more deeply by graph paper using Empirical Distribution Function
JavaScript. You may like using this JavaScript in performing some numerical
experimentation for a deeper understanding.
Introduction
Further Readings:
Brown B., F. Spears, and L. Levy, The log F: A distribution for all seasons, Computational Statistics,
17(1), 47-58, 2002.
Probability has an exact technical meaning -- well, in fact it has several, and
there is still debate as to which term ought to be used. However, for most
events for which probability is easily computed; e.g., rolling of a die, the
probability of getting a four [::], almost all agree on the actual value (1/6), if
not the philosophical interpretation. A probability is always a number between
0 and 1. Zero is not"quite" the same thing as impossibility. It is possible
that"if" a coin were flipped infinitely many times, it would never show"tails",
but the probability of an infinite run of heads is 0. One is not"quite" the same
thing as certainty but close enough.
Aside from their value in betting, odds allow one to specify a small probability
(near zero) or a large probability (near one) using large whole numbers (1,000
to 1 or a million to one). Odds magnify small probabilities (or large
probabilities) so as to make the relative differences visible. Consider two
probabilities: 0.01 and 0.005. They are both small. An untrained observer
might not realize that one is twice as much as the other. But if expressed as
odds (99 to 1 versus 199 to 1) it may be easier to compare the two situations
by focusing on large whole numbers (199 versus 99) rather than on small
ratios or fractions.
How to Assign Probabilities?
P(X) = Number of times an event occurred / Total number of opportunities for the
event to occur.
Note that relative probability is based on the ideas that what has happened in the past
will hold.
Further Reading:
Delbecq, A., Group Techniques for Program Planning, Scott Foresman, 1975.
1. Addition: When two or more events will happen at the same time, and the events are
not mutually exclusive, then:
P (X or Y) = P (X) + P (Y) - P (X and Y)
Notice that, the equation P (X or Y) = P (X) + P (Y) - P (X and Y), contains especial
events: An event (X and Y) which is the intersection of set/events X and Y, and
another event (X or Y) which is the union (i.e., either/or) of sets X and Y. Although
this is very simple, it says relatively little about how event X influences event Y and
vice versa. If P (X and Y) is 0, indicating that events X and Y do not intersect (i.e.,
they are mutually exclusive), then we have P (X or Y) = P (X) + P (Y). On the other
hand if P (X and Y) is not 0, then there are interactions between the two events X and
Y. Usually it could be a physical interaction between them. This makes the
relationship P (X or Y) = P (X) + P (Y) - P (X and Y) nonlinear because the P(X and
Y) term is subtracted from which influences the result.
P(A or B or C) =
P(A) + P(B) + P(C) - P(A and B) - P(A and C) - P(B and C) + P(A and B and C)
2. Special Case of Addition: When two or more events will happen at the same time,
and the events are mutually exclusive, then:
3. General Multiplication Rule: When two or more events will happen at the same
time, and the events are dependent, then the general rule of multiplicative rule is used
to find the joint probability:
4. Special Case of Multiplicative Rule: When two or more events will happen at the
same time, and the events are independent, then the special rule of multiplication rule
is used to find the joint probability:
P(X|Y) = P(X),
and
P(Y|X) = P(Y)
Bayes' rule provides posterior probability [i.e, P(X|Y)] sharpening the prior
probability [i.e., P(X)] by the availability of accurate and relevant information in
probabilistic terms.
An Application: Suppose two machines, A and B, produce identical parts. Machine A has
probability 0.1 of producing a defective each time, whereas Machine B has probability 0.4 of
producing a defective. Each machine produces one part. One of these parts is selected at
random, tested, and found to be defective. What is the probability that it was produced by
Machine B?
Probability tree diagrams depict events or sequences of events as branches of a tree. Tree
diagrams are useful for visualizing the conditional probabilities:
The probabilities at the end of each branch are the probability that events leading to that end
will happen simultaneously. The above tree diagram indicates that the probability of a part
testing Good is 9/20 + 6/20 = 3/4, therefore the probability of Bad is 1/4. Thus, P(made by B |
it is bad) = (4/20) / (1/4) = 4/5.
Now using the Bayes' Rule we are able to obtain useful information such as:
P(it is bad | made by B) = P(it is bad & made by B)/P(made by B) = (4/20)/(1/2) = 2/5.
Venn Diagram: A diagram used, in general to represent sets and subsets. It is a way of
displaying how different sets of objects overlap. John Venn an English mathematician
devised them. Venn diagram could be used as a computational probability tool similar to the
probability tree diagram. The following are Venn diagrams representation for two of the
above Probability Rules:
An Application: A surveys show that 70% of all convenience store shoppers buy milk and
55% buy bread. If 45% buy both bread and milk, what percentage buy neither?
Solution: The Venn diagram model for this problem is depicted below:
The solution is readily available from the above Venn diagram model, i.e.
Another approach is to use both, first the Complement Probability Rule and then the
Addition Probability Rule, i.e.
Exercise Your Knowledge on the following probabilistic problem: An urn contains 4 red-
balls (representing, say defective items) and 8 white-balls (Representing, say non-defective
items), as depicted below:
An Urn Model
Suppose 2 balls are drawn at random. Use the following tree diagram, which is a probabilistic
model for this experiment, and verify the solution to the following questions, with the answer
given in the bracket at the end of each question:
A Tree Diagram as a Probabilistic Model
Another Question for You: A coin fair is flipped twice, what is the conditional probability
that both flips land on heads, given:
Further Reading:
Ross Sh., A First Course in Probability, Prentice Hall, 2001.
Many disciplines and sciences require the answer to the question: How Many? In finite
probability theory we need to know how many outcomes there would be for a particular
event, and we need to know the total number of outcomes in the sample space.
A Fundamental Result: If an operation consists of two steps, of which the first can be done
in n1ways and for each of these the second can be done in n2 ways, then the entire operation
can be done in a total of n1× n2 ways.
This simple rule can be generalized as follow: If an operation consists of k steps, of which the
first can be done in n1 ways and for each of these the second step can be done in n2 ways, for
each of these the third step can be done in n3 ways and so forth, then the whole operation can
be done in n1 × n2 × n3 × n4 ×.. × nk ways.
Numerical Example: A quality control inspector wishes to select one part for inspection
from each of four different bins containing 4, 3, 5 and 4 parts respectively. The total number
of ways that the parts can be selected is 4×3×5×4 or 240 ways.
Factorial Notation: the notation n (read as, n factorial) means by definition the product:
n = (n)(n-1)(n-2)(n-3)...(3)(2)(1).
Notice that by convention, 0 = 1, (i.e., 0 1) . For example, 6 = 6×5×4×3×2×1 = 720
Permutations Example: How many permutations (ordered arrangements) are there of the
letters a, b, and c? In this case it is easy to make a list:
The number of ways of lining up k objects at a time from n distinct objects is denoted by n P k,
and by the preceding we have:
nP k = (n)(n-1)(n-2)(n-3)......(n-k+1)
Therfore, The number of permutations of n distinct objects taken k at a time can be written
as:
n P = n / (n - k)
k
Combinations: There are many problems in which we are interested in determining the
number of ways in which k objects can be selected from n distinct objects without regard to
the order in which they are selected. Such selections are called combinations or k-sets. It may
help to think of combinations as a committee. The key here is without regard for order.
The number of combinations of k objects from a set with n objects is n C k. For example, the
combinations of {1,2,3,4} taken k=2 at a time are {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, for
a total of 6 = 4 / [(2)(4-2) ] subsets.
You may ask, what is the relation of combinations to permutations? Each of the above
subsets forms 3 = 6 distinct permutations. 6×4 = 24, which equals 4P 3 If we use the notation
3
4 C to indicate the number of combinations of 4 distinct objects taken 3 at a time, then by the
above we have:
4 C 3 = 4 P 3 / 3 = 4 / [ 3 (4 -3) ] = 4.
Notice that:
n C k = n C n-k
An Application: One of the fundamental aspects of economic activity is a trade in which one
party provides another party something, in return for which the second party provides the first
something else, i.e., the Barter Economics.
The invention of money during 16th Century in Europe was a necessary tool of trading. The
usage of money greatly simplifies barter system of trading, thus lowering transactions costs.
If a society produces 100 different goods, there are:
As another application, consider the following probabilistic problem. Suppose there are at
most 10 defective items in a batch of size 150. You have shipped 15 items to one of your
customers. What is the chance that the customer would find at least one defective item?
Permutation with Repetitions: How many different letter arrangements can be formed
using the letters P E P P E R?
Joint Probability Function: Let us have two discrete random variables X and Y, taking
values xi, i = 1,....,m, and yi , j = 1,.....,n, respectively. The function:
As an example, consider two competitive stocks (A, and B). Suppose the estimated rates of
return of stocks A and B are given as follow (respectively):
Joint Probability
Numerical Example: Find the marginal density of RA and RB from the Joint Probability
table.
To calculate the marginal distribution of RB, simply look at the table and add the probabilities
in each column.
To obtain the marginal distribution of RA, add the probabilities in each row. The marginal
distributions of A and B are shown at the rigt and the bottom margins of the below table,
respectively:
RB Marginal
0.9 1.0 1.1
0.8 0.1 0.1 0.1 0.3
RA 1.0 0.1 0.1 0.1 0.3
1.2 0.1 0.2 0.1 0.4
Marginal 0.3 0.4 0.3
It is clear that a given joint distribution determines the marginal distributions uniquely.
However, the converse is not true; a given marginal distribution can come from many
different joint distributions. The function that links the marginal densities and the joint
density is called the copula. In practice, one picks the marginal distributions first and then
selects an appropriate copula to achieve the right amount of dependency among the
individual random variables.
The function:
Stochastic Independence: When P(A | B) does not depend on the event B, that is P(A | B) =
P(A) is given by:
P(A | B) = [P(A B)] / [P(B)] if P(B) > 0
and is left undefined when P(B) = 0. The symbol A B means"the event A occurs and the
event B occurs.
As an example, suppose we wish to compute the probability that the return on A is medium
or high (RA 1.0) given that the return on B is medium or high (RB 1.0)?
We need to calculate:
P(RA 1.0 | RB 1.0) = [P(RA 1.0 and RB 1.0) ] / [P(RB 1.0)]
Now referring to the below tables:
RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1
P(RB 1.0 )
P(RA 1.0 and RB 1.0) = 0.5, P(RB 1.0) = 0.7, and consequently:
An Application: Determine the number of elementary outcomes and then find the probability
of the event 1/2(RA + RB) < 1.0.
Note that each return takes three values and is allowed to move independently of the other
return, which means we have nine elementary outcomes. In the probabilities of the
elementary outcomes that belong to the event 1/2(RA + RB) = 0.4 are given in bold:
RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1
Consequently,
For estimation of the expected values, variances, etc, you may using the Bivariate
Distributions JavaScript.
Further Reading:
Ross Sh., Introduction to Probability Models, Academic Press, 2002.
Mutually Exclusive (ME): Event A and B are ME if both cannot occur simultaneously. That
is, P[A and B] = 0.
Independency (Ind.): Events A and B are independent if having the information that B
already occurred does not change the probability that A will occur. That is P[A given B
occurred] = P[A].
If two events are ME they are also Dependent: P(A given B) = P[A and B] P[B], and since
P[A and B] = 0 (by ME), then P[A given B] = 0. Similarly,
If two events are Independent then they are also not ME.
If two events are Dependent then they may or may not be ME.
If two events are not ME, then they may or may not be Independent.
The following Figure contains all possibilities. The notations used in this table are as follows:
X means does not imply, question mark ? means it may or may not imply, while the check
mark means it implies.
Notice that the (probabilistic) pairwise independency and mutual independency for a
collection of events A1,..., An are two different notions.
Further Reading:
Ross Sh., A First Course in Probability, Prentice Hall, 2001.
What Is so Important About the Normal Distributions?
The term"normal" possibly arose because of the various attempts made to establish this
distribution as the underlying rule governing all continuous variables. These attempts were
based on false premises and consequently failed. Nonetheless, the normal distribution rightly
occupies a preeminent place in the field of probability. In addition to portraying the
distribution of many types of natural and physical phenomena (such as the heights of men,
diameters of machined parts, etc.), it also serves as a convenient approximation of many other
distributions which are less tractable. Most importantly, it describes the manner in which
certain estimators of population characteristics vary from sample to sample and, thereby,
serves as the foundation upon which much statistical inference from a random sample to
population are made.
Normal Distribution (called also Gaussian) curves, which have a bell-shaped appearance (it is
sometimes even referred to as the"bell-shaped curves") are very important in statistical
analysis. In any normal distribution is observations are distributed symmetrically around the
mean, 68% of all values under the curve lie within one standard deviation of the mean and
95% lie within two standard deviations.
There are many reasons for their popularity. The following are the most important reasons for
its applicability:
1. One reason the normal distribution is important is that a wide variety of naturally
occurring random variables such as heights and weights of all creatures are
distributed evenly around a central value, average, or norm (hence, the name normal
distribution). Although the distributions are only approximately normal, they are
usually quite close.
Whenever there are too many factors influencing the outcome of a random outcome,
then the underlying distribution is approximately normal. For example, the height of a
tree is determined by the"sum" of such factors as rain, soil quality, sunshine, disease,
etc.
As Francis Galton wrote in 1889, "Whenever a large sample of chaotic elements are
taken in hand and arranged in the order of their magnitude, an unsuspected and most
beautiful form of regularity proves to have been latent all along."
2. Almost all statistical tables are limited by the size of their parameters. However,
when these parameters are large enough one may use normal distribution for
calculating the critical values for these tables. For example, the F-statistic is related to
standard normal z-statistic as follows: F = z2, where F has (d.f.1 = 1, and d.f.2 is the
largest available in the F-table). For more, visit the Relationships among Common
Distributions.
Here is how the approximation is made. First, set = np and 2 = npq. To allow for
the fact that the binomial is a discrete distribution, we conventionally use a continuity
correction factor of 1/2 unit added to or subtracted from X on the grounds that the
discrete value (x = a) should correspond on a continuous scale to (a - 1/2) x (a +
1/2). Then we compute the value of the standard normal variable by:
Now one may used the standard normal table for the numerical values.
3. If the mean and standard deviation of a normal distribution are known, it is easy to
convert back and forth from raw scores to percentiles.
4. It has been proven that the underlying distribution is normal if and only if the sample
mean is independent of the sample variance, this characterizes the normal
distribution. Therefore many effective transformations can be applied to convert
almost any shaped distribution into a normal one.
5. The most important reason for popularity of normal distribution is the Central Limit
Theorem (CLT). The distribution of the sample averages of a large number of
independent random variables will be approximately normal regardless of the
distributions of the individual random variables. The Central Limit Theorem is a
useful tool when you are dealing with a population with an unknown distribution.
Often, you may analyze the mean (or the sum) of a sample of size n. For example
instead of analyzing the weights of individual items you may analyze the batch of size
n, that is, the packages each containing n items.
6. The Sampling distribution of normal populations provide more information than
any other distributions. For example, the following standard (i.e., having the same
unit as the data have) errors are readily available:
o Standard Error of the Median = (/2n)½S.
o Standard Error of the Standard Deviation = S/(2n)½.
Therefore, the test statistic for the null hypothesis = 0, is Z = (2n)½ (S -
0)/0.
o Standard Error of the Variance = S2[(2/(n-1)]½.
o Standard Error of the Interquartiles Half-Range (Q) = 1.166Q/n½
o Standard Error of the Skewness = (6/n)½.
o Standard Error of the Skewness of Sample Mean = Skewness/n½
7. The other reason the normal distributions are so important is that the normality
condition is required by almost all kinds of parametric statistical tests. Using most
statistical tables, such as T-table (except its last row), 2-table, and F-tables, all
required the normality condition of the population. This condition must be tested
before using these tables, otherwise the conclusion might be wrong.
The sampling distribution is the density (for a continuous statistic, such as an estimated
mean), or probability function (for discrete statistic, such as an estimated proportion).
Derivation of the sampling distribution is the first step in calculating a confidence interval or
carrying out a hypothesis testing for a parameter.
Example: Suppose that x1,.......,xn are a simple random sample from a normally distributed
population with expected value and known variance 2. Then, the sample mean is normally
distributed with expected value and variance 2/n.
The main idea of statistical inference is to take a random sample from the entire particular
population and then to use the information from the sample to make inferences about the
particular population characteristics such as the mean (measure of central tendency), the
standard deviation (measure of dispersion, spread) or the proportion of units in the
population that have a certain characteristic. Sampling saves money, time, and effort.
Additionally, a sample can provide, in some cases, as much or more accuracy than a
corresponding study that would attempt to investigate an entire population. Careful collection
of data from a sample will often provide better information than a less careful study that tries
to look at everything.
Often, one must also study the behavior of the mean of sample values taken from different
specified populations; e.g., for comparison purposes.
Because a sample examines only part of a population, the sample mean will not exactly equal
the corresponding mean of the population . Thus, an important consideration for those
planning and interpreting sampling results is the degree to which sample estimates, such as
the sample mean, will agree with the corresponding population characteristic.
In practice, only one sample is usually taken. In some cases a small"pilot sample" is used to
test the data-gathering mechanisms and to get preliminary information for planning the main
sampling scheme. However, for purposes of understanding the degree to which sample means
will agree with the corresponding population mean , it is useful to consider what would
happen if 10, or 50, or 100 separate sampling studies, of the same type, were conducted. How
consistent would the results be across these different studies? If we could see that the results
from each of the samples would be nearly the same (and nearly correct!), then we would have
confidence in the single sample that will actually be used. On the other hand, seeing that
answers from the repeated samples were too variable for the needed accuracy would suggest
that a different sampling plan (perhaps with a larger sample size) should be used.
A sampling distribution is used to describe the distribution of outcomes that one would
observe from replication of a particular sampling plan.
Know that estimates computed from one sample will be different from estimates that would
be computed from another sample.
Understand that estimates are expected to differ from the population characteristics
(parameters) that we are trying to estimate, but that the properties of sampling distributions
allow us to quantify, based on probability, how they will differ.
Understand that different statistics have different sampling distributions with distribution
shape depending on (a) the specific statistic, (b) the sample size, and (c) the parent
distribution.
Understand the relationship between sample size and the distribution of sample estimates.
Understand that increasing the sample size can reduce the variability in a sampling
distribution.
See that in large samples, many sampling distributions can be approximated with a normal
distribution.
Sampling Distribution of the Mean and the Variance for Normal Populations: Given the
random variable X is distributed normally with mean and standard deviation , then for a
random sample of size n:
The central limit theorem (CLT) is a "limit" that is "central" to statistical practice. For
practical purposes, the main idea of the CLT is that the average (center of data) of a sample
of observations drawn from some population is approximately distributed as a normal
distribution if certain conditions are met. In theoretical statistics there are several versions of
the central limit theorem depending on how these conditions are specified. These are
concerned with the types of conditions made about the distribution of the parent population
(population from which the sample is drawn) and the actual sampling procedure.
One of the simplest versions of the central limit theorem stated by many textbooks is: if we
take a random sample of size (n) from the entire population, then, the sample mean which is a
random variable defined by:
xi / n,
has a histogram which converges to a normal distribution shape if n is large enough.
Equivalently, the sample mean distribution approaches to normal distribution as the sample
size increases.
Some students having difficulty reconciling their own understanding of the central limit
theorem with some of the textbooks statements. Some textbooks do not emphasize the on the
independent, random samples of fixed-size n (say more than 30).
The shape of the sampling distributions for means - becomes increasingly normal as the
sample size n becomes larger. The increasing sample size is what causes the distribution to
become increasingly normal and the independence condition provides the n contraction of
the standard deviation.
The CLT for proportion data, such as binary 0, 1, again the sampling distribution-- while
becoming increasingly "bell-shaped" -- remains confined to the domain [0,1]. This domain
represents a dramatic difference from a normal distribution, with has an unbounded domain.
However, as n increases without bound, the "width" of the bell becomes very small so that
the CLT "still works".
It can be shown that, if the parent population has mean and a finite standard deviation ,
then the sample mean distribution has the same mean but with smaller standard deviation
which is divided by n½.
You know by now that, whatever the parent population is, the standardized variable Z = (X -
)/ will have a distribution with a mean = 0 and standard deviation =1 under random
sampling. Moreover, if the parent population is normal, then Z is distributed exactly as the
standard normal. The central limit theorem states the remarkable result that, even when the
parent population is non-normal, the standardized variable is approximately normal if the
sample size is large enough. It is generally not possible to state conditions under which the
approximation given by the central limit theorem works and what sample sizes are needed
before the approximation becomes good enough. As a general guideline, statisticians have
used the prescription that, if the parent distribution is symmetric and relatively short-tailed,
then the sample mean more closely approximates normality for smaller samples than if the
parent population is skewed or long-tailed.
Under certain conditions, in large samples, the sampling distribution of the sample mean can
be approximated by a normal distribution. The sample size needed for the approximation to
be adequate depends strongly on the shape of the parent distribution. Symmetry (or lack
thereof) is particularly important.
For a symmetric parent distribution, even if very different from the shape of a normal
distribution, an adequate approximation can be obtained with small samples (e.g., 15 or more
for the uniform distribution). For symmetric, short-tailed parent distributions, the sample
mean more closely approximates normality for smaller sample sizes than if the parent
population is skewed and long-tailed. In some extreme cases (e.g. binomial) sample sizes far
exceeding the typical guidelines (e.g., over 30) are needed for an adequate approximation.
For some distributions without first and second moments (e.g., one is known as the Cauchy
distribution), the central limit theorem does not hold.
For some distributions, extremely large (impractical) samples would be required to approach
a normal distribution. In manufacturing, for example, when defects occur at a rate of less than
100 parts per million, using, a Beta distribution yields an honest Confidence Interval (CI) of
total defects in the population.
A question for you: Roll two perfectly balanced dice one time and the result will sum to an
integer between 2 and 12. Which sum is most likely? (Hint: what CLT implies?)
An Illustration of CLT
Sampling Distribution of the Sample Means: Instead of working with individual scores,
statisticians often work with means. What happens is that several samples are taken, the mean
is computed for each sample, and then the means are used as the data, rather than individual
scores being used. The sample is a sampling distribution of the sample means.
The central limit theorem explains why many distributions tend to be close to the normal
distribution. The key ingredient is that the random variable being observed should be the sum
or mean of many independent identically distributed random variables.
Sampling Distribution of Values (X): Consider the case where a single, fair die is rolled.
Here are the values that are possible and their probabilities.
X 1 2 3 4 5 6
Values
Probabi 1/ 1/ 1/ 1/ 1/ 1/
lity 6 6 6 6 6 6
Sampling Distribution of Samples' Mean (Xbar): Consider the case where two fair dice
are rolled instead of one.
Here are the sums that are possible and their probabilities.
S 2 3 4 5 6 7 8 9 1 1 1
u 0 1 2
m
P 1 2 3 4 5 6 5 4 3 2 1
r / / / / / / / / / / /
o 3 3 3 3 3 3 3 3 3 3 3
b 6 6 6 6 6 6 6 6 6 6 6
But, we are not interested in the sum of the dice, we are interested in the sample mean. We
find the sample mean by dividing the sum by the sample size.
X1 1 2 2 3 3 4 4 5 5 6
b . . . . . . . . . . .
a 0 5 0 5 0 5 0 5 0 5 0
r
P 1 2 3 4 5 6 5 4 3 2 1
r / / / / / / / / / / /
o 3 3 3 3 3 3 3 3 3 3 3
b 6 6 6 6 6 6 6 6 6 6 6
Now let us compute the mean, and variance, of this new random variable Xbar.
Here are the mean, and variance of the random variable Xbar:
But, if we take repeated samples of the same size from a population, and then we plot the
means of all those samples, our distribution will look a little better. We call distributions of
sample statistics, Sampling Distributions.
The reason for this is that you can get the middle values in many more different ways than the
extremes.
Example: When throwing two dice: 1+6 = 2+5 = 3+4 = 7, but only 1+1 = 2 and only 6+6 =
12. That is: even though you get any of the six numbers equally likely when throwing one
die, the extremes are less probable than middle values in sums of several dice.
To see how the central limit theorem works is in the distribution of scores from increasing
numbers of dice throws, as below.
In this illustration, the number on the top of each rolled die is an independent random event.
Independent because the results of each die roll does not depend on the result of any previous
roll, and random because, assuming that the die is "fair", the value on the top of the rolled die
cannot be predicted in advance. The sum of their results is the total number of dots on the
tops of all the rolled dice. The bar chart illustrates the distribution of the sum. The
distribution of each independent die roll is flat, not bell-shaped. See for yourself. Roll one die
a bunch of times and watch the bar chart evolve. The distribution of the sum of two
independent die rolls is triangular. Try it and see. What about five dice? What about ten?
The CLT says that no matter what the distribution of the population looks like, the sampling
distribution will be distributed normally, as long as your sample size is big enough (about
30). The distribution will have a mean equal to the population mean and a standard error
equal to the population standard deviation divided by the square root of the sample size.
The measure of spread that we use for Sampling Distributions is the standard error (SE). The
SE will always be smaller than the population standard deviation since the sampling
distribution is one of sample statistics. Each sample mean will dampen the effect of outliers,
bringing the tails of the sampling distribution in and creating a bigger "lump" in the middle,
centered on the population mean. You can interpret the SE for sampling distributions in the
same way as the standard deviation for populations.
When all of the possible sample means are computed, then the following properties are true:
The mean of the sample means will be the mean of the population
The variance of the sample means will be the variance of the population
divided by the sample size.
The standard deviation of the sample means (known as the standard error of
the mean) will be smaller than the population mean and will be equal to the
standard deviation of the population divided by the square root of the sample
size.
If the population has a normal distribution, then the sample means will have a
normal distribution.
If the population is not normally distributed, but the sample size is sufficiently
large, then the sample means will have an approximately normal distribution.
Some books define sufficiently large as at least 30 and others as at least 31.
Recall that in estimating the population's variance, we used (n-1) rather than n, in the
denominator. The factor (n-1) is called"degrees of freedom."
When we do not know the population's mean, we can still estimate the population variance;
but, now we compute deviations around the sample mean. This introduces an important
constraint because the sum of the deviations around the sample mean is known to be zero. If
we know the value for the first (n-1) deviations, the last one is known. There are only n-1
independent pieces of information in this estimate of variance.
If you study a system with n parameters xi, i =1..., n, you can represent it in an n-dimension
space. Any point of this space shall represent a potential state of your system. If your n
parameters could vary independently, then your system would be fully described in a n-
dimension hyper-volume (for n over 3). Now, imagine you have one constraint between the
parameters (an equation with your n parameters), then your system would be described by a
(n-1)-dimension hyper-surface (for n over 3). For example, in three dimensional space, a
linear relationship means a plane which is 2-dimensional.
In statistics, your n parameters are your n data. To evaluate variance, you first need to infer
the mean . So when you evaluate the variance, you have one constraint on your system
(which is the expression of the mean), and it remains only (n-1) degrees of freedom to your
system.
Therefore, we divide the sum of squared deviations by n-1, rather than by n, when we have
sample data. On average, deviations around the sample mean are smaller than deviations
around the population mean. This is because our sample mean is always in the middle of our
sample scores; in fact, the minimum possible sum of squared deviations for any sample of
numbers is around the mean for that sample of numbers. Thus, if we sum the squared
deviations from the sample mean and divide by n, we have an underestimate of the variance
in the population (which is based on deviations around the population mean).
If we divide the sum of squared deviations by n-1 instead of n, our estimate is a bit larger,
and it can be shown that this adjustment gives us an unbiased estimate of the population
variance. However, for large n, say, over 30, it does not make too much difference if we
divide by n, or n-1.
Degrees of Freedom in ANOVA: You will see the key parse"degrees of freedom" also
appearing in the Analysis of Variance (ANOVA) tables. If I tell you about 4 numbers, but
don't say what they are, the average could be anything. I have 4 degrees of freedom in the
data set. If I tell you 3 of those numbers, and the average, you can guess the fourth number.
The data set, given the average, has 3 degrees of freedom. If I tell you the average and the
standard deviation of the numbers, I have given you 2 pieces of information, and reduced the
degrees of freedom from 4 to 2. You only need to know 2 of the numbers' values to guess the
other 2.
In an ANOVA table, degree of freedom (df) is the divisor in (Sum of Squared deviations)/df
which will result in an unbiased estimate of the variance of a population.
In general, a degree of freedom d.f. = N - k, where N is the sample size, and k is a small
number, equal to the number of"constraints", the number of"bits of information"
already"used up". As we will see in the ANOVA section, degree of freedom is an additive
quantity; total amounts of it can be "partitioned" into various components. For example,
suppose we have a sample of size 13 and calculate its mean, and then the deviations from the
mean; only 12 of the deviations are free to vary. Once one has found 12 of the deviations, the
thirteenth one is determined.
In a one-way analysis of variance (ANOVA) with g groups, there are three ways of using the
data to estimate the population variance. If all the data are pooled, the conventional SST/(n-1)
would provide an estimate of the population variance.
If the treatment groups are considered separately, the sample means can also be considered as
estimates of the population mean, and thus SSb/(g - 1) can be used as an estimate. The
remaining ("within-group","error") variance can be estimated from SSw/(n - g). This example
demonstrates the partitioning of d.f.:
d.f. total = n - 1 = d.f.(between) + d.f.(within) = (g - 1) + (n - g).
Therefore, the simple 'working definition' of d.f. is ‘sample size minus the number of
estimated parameters'. A more complete answer would have to explain why there are
situations in which the degrees of freedom is not an integer. After we said all this, the best
explanation, is mathematical in that we use d.f. to obtain an unbiased estimate.
In summary, the concept of degrees of freedom is used for the following two different
purposes:
The following Figure demonstrates useful relationships among common statistical tables:
Some widely used applications of the popular statistical tables can be categorized as follows:
T - Table:
Conditions for using this table: Test for randomness of the data is needed before using this
table. Test for normality condition of the population distribution is also needed if the sample
size is small, or it may not be possible to invoke the central limit theorem.
Z - Table:
Note also that, in hypothesis testing concerning the parameter of binomial and Poisson
distributions for large sample sizes, the standard deviation is known under the null
hypotheses. That's why you may use the normal approximations for both of these
distributions.
Conditions for using this table: Test for randomness of the data is needed before using this
table. Test for normality condition of the population distribution is also needed if the sample
size is small, or it may not be possible to invoke the Central Limit Theorem.
Chi-square - Table:
Conditions for using this table: The necessary conditions for using this table for all the
above tests, except for the last one, can be found at Conditions for the Chi-square Based
Tests. The last application requires normality (condition) of the population distribution.
F - Table:
Conditions for using this table: Tests for randomness of the data and normality (condition)
of the populations are needed before using this table for ANOVA. Same conditions must be
satisfied for the residuals in regression analysis.
The following chart summarizes application of statistical tables with respect to test of
hypotheses and construction of confidence intervals for mean and variance 2 in one
population or the comparison of two or more populations.
Click on the image to enlarge it and THEN print it.
Selection of an Appropriate Statistical Table
You may like using Online Statistical Computation in performing most of these tests. The P-
values for the Popular Distributions Web site provides P-values useful in major statistical
testing. The results are more accurate than those that can be obtained (by interpolation) from
statistical tables of your textbook are.
Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions, Wiley, 2003.
Evans M., N. Hastings, and B. Peacock, Statistical Distributions, Wiley, 2000.
Kanji G., 100 Statistical Tests, Sage Publisher, 1995.
The presentation of the statistical tables is not universal. Some statistical textbooks authors’
enjoy given tabular values of the right-tail probabilities, while for others left-tail probabilities
are preferred. Even within each of these groups you will find some differences in presenting
each table differently than others, never in a unified format. This lack of uniformity often
confuses most of students while learning statistics.
The following presents some numerical examples of common statistical tables with some
applications. You may like using The P-values for the Popular Distributions JavaScript.
Binomial Probability
X B(n, p), read, the random variable X has a binomial distribution with parameters n trials,
and probability of a success is p.
Example: Find probability of at most k = 3 success from B(n = 7, p = 0.4). Using any
Binomial table, one should get:
P[k 3] = 0.7102.
Using The P-values for the Popular Distributions JavaScript, one gets:
Questions for you: Which of the following two events is more likely to happen? Getting
exactly 6 heads in tossing a fair coin (i.e, p=1/2), n = 10 times or tossing it n=20 times. Why?
Application: A traveling salesman has find that the probability of a sale on a single contact is
0.02. If the salesman contacts 200 prospects, find the probability that he will make at least
one sale.
X N(0, 1), read, the random variable X is distributed Normally with mean, and variance 0,
and 1, respectively.
= (X N(0, 1)
Similarly, P(X 2.1) = P(Z (2.11) / (2)) = P(Z .55) = 0.5 - .2088 = .2912
Using The P-values for the Popular Distributions JavaScript, the 2p-value is:
Questions for you: Compute P( X 3), P(1 X 4), P(X 1), find the value of such
that P(X ) = 0.4515
Applications:
A Fact: Given X N(?, ) and having a random realization of size n: x1, x 2, ..., xn, then
Given n = 4, xbar4 = 492 , test H0: at significance level = 0.05 if = 16?
The Z-statistic is Z = [492 500] / [16 / (41/2)] = -1, however the tabulated critical Z-value is
Z.025 = 1.96
Conclusion: No reason to reject H0.
Question for you: Given the same sampling information, test H0: = 505 vs. Ha: 505.
2. Setting a confidence interval on the mean, variance known.
Given xbar4 = 492 construct a 95% confidence interval for given = 16
Notice the Duality between the test of hypothesis and confidence interval.
Question for you: Given the same sampling information, construct a 90% confidence
interval for given the same information.
As a strong result, the CLT implies that if the sample size is large enough, then one may relax
the normality condition whenever dealing with the question of testing or constructing
confidence interval for population’s mean ().
T-Density Function
Applications:
Question for you: Given the same sampling information perform the test H0: = 11 vs 11
at = .01.
Question for you: Construct a 90% confidence interval for the same problem, is it wider than
the other one, why or why not?
Notice that the T-density converges to the standard normal N(0, 1) as sample size gets larger.
In fact the elements in the last row of t-table are the N(0,1) probabilities.
Applications:
Given n = 16 and S2 = 2.22 test that 2 = 2.0 at = .05. The sampling statistic is 20 = 16.65,
however from the table, the critical values are 2( 15, .025) = 27.4884 and 2(15, .975) = 6.26
Conclusion: There is no reason to reject that 2 = 2.0
Example: Given the same sampling information as above construct a 95% confidence
interval for
Plugging in the given information, you should get:
P[ 1.332
Again, notice the Duality between the test of hypothesis and confidence interval.
Question for you: Given the same sampling information should we reject that = 2.0? at =
.1
Note that 2(15, .05) = 8.55, and 2(15, .95) = 7.26.
F-Density Function
A Fact: Consider two independent samples, one form two normal populations with known
variance 21, and 22, then
(S12 / 12) / (S12 / 12) F(n 1 1, n2 1)
Example: Find F such that P[F8, 7 F] = .05 => The F value is F = 3.79
Notice: By now, you should have noticed that while every Statistical Table collected at the
end of your textbook, provides the critical values for the right-tail as well as the left-tail
probabilities, except the F-Table, which contains the critical values for the right-tail
probabilities only. However, one might use the following nice property of F-distribution that:
to obtain the critical values for the left-tail probabilities. Here is a numerical example:
You need both tails probabilities for test of hypothesis and construction of confidence
interval for the ratio of two independent populations' variances.
Example: Find P[F8, 7 F] = .95. We may not be able to get the critical value from the table,
however, one may utilize the fact that:
Applications:
Example: Given n1 = n2 = 16, S12 = 34.14, and S22 = 47.32, should we reject that 12 = 22 at
= 0.1
The sampling statistics is F = S12 / S22 = .785, but the critical values are F15, 15, .05 = 2.38, and F15,
15, .95 = 1 / 2.38 =0.421.
Question for you: Given the same sampling information, construct a 90% confidence interval
for variance ratio: 12 / 22
An important class of decision problems under uncertainty involves situations for which there
are only two possible random outcomes.
The binomial probability function gives probability of exact number of"successes" in n
independent trials, when probability of success p on single trial is a constant. Each single trial
is called a Bernoulli Trial satisfying the following conditions:
1. Each trial results in one of two possible, mutually exclusive, outcomes. One of
the possible outcomes is denoted (arbitrarily) as a success, and the other is
denoted a failure.
2. The probability of a success, denoted by p, remains constant from trial to trial.
The probability of a failure, 1-p, is denoted by q.
3. The trials are independent; that is, the outcome of any particular trial is not
affected by the outcome of any other trial.
The mean and variance of random variable r, are np and np(1-p), respectively, where q = 1 -
p. The skewness and kurtosis are (2q -1)/ (npq)½, and (1- 6pq)/(npq), respectively. From its
skewness, we notice that the distribution is symmetric for p =1/2 and most skewed when p is
0 or 1.
Its mode is within interval [(n+1)p -1, (n+1)p], therefore if (n+1) p is not an integer, then the
mode is an integer within the interval. However if (n+1)p is an integer, then its probability
function has two but adjacent modes: (n+1)p -1, and (n+1)p.
Determination of probabilities for p over 0.5: The binomial tables in some textbooks are
limited to deterring the probabilities for values of p up to 0.5. However, these tables can be
used for values of p over 0.5. By recasting a problem in terms of p to 1 -p, and setting r to n-r,
then the probability of obtaining r successes in n trials for a given value of p is equal to the
probability of obtaining n-r failures in n trials with 1-p.
Know that the binomial distribution is to satisfy the five following requirements: (1) each
trial can have only two outcomes or its outcomes can be reduced to two categories which are
called pass and fail, (2) there must be a fixed number of trials, (3) the outcome of each trail
must be independent, (4) the probabilities must remain constant, (5) and the outcome of
interest is the number of successes.
Normal approximation for binomial: All binomial tables are limited in their scope;
therefore it is necessary to use standard normal distribution in computing the binomial
probabilities. The following numerical example illustrates how good the approximation could
be. This provides an indication for real applications when n is beyond the given values in the
available binomial tables.
Numerical Example: A sample of 20 items are taken randomly from a manufacturing process
with defective probability p = 0.40. What is the probability of obtaining exactly 5 defective?
respectively; therefore, the standardized observation for r = 5, by using the continuity factor
(which always enlarges) are:
Therefore, the approximated P (5 out of 20) is P (z being within interval -1.60, -1.14). Now,
by using the standard normal table, we obtain:
Poisson approximation for binomial: Notice that, whenever you use Poisson approximation
to the binomial distribution with parameters n and p, then the goodness of the approximation
is largely determined by the smallness of the p parameter rather than how large is n.
You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.
You might like to use the Exact Confidence Interval Construction and Test of Hypothesis for
Binomial Population , and Binomial Probability Function JavaScript in performing some
numerical experimentation for validating the above assertions for a deeper understanding.
Geometric Distribution
In a sequence of independent and identically distributed Bernoulli (p) trials, the number of
trials required to get the 1st success has a Geometric(p) distribution.
A Typical Geometric Probability Function
Click on the image to enlarge it and THEN print it
If a single event or trial has two possible outcomes, say Xi can be 0 or 1 with P(Xi=1) = p, the
probability of having to observe k trials before the first "one" appears is given by the
geometric distribution.
The probability that the first "one" would appear on the first trial is p.
The probability that the first "one" appears on the second trial is p(1-p), because the first trial
had to have been a zero followed by a one.
By generalizing this procedure, the probability that there will be k-1 failures before the first
success is:
P (X = k) = (1 –p) k-1p
Application: A manufacturing process is monitored. As each product exits the process line, it
is tested for defective versus non-defective. On the first defect, the process is stopped for re-
adjustment. The random variable X follows a Geometric distribution with p = P(product is
non-defective).
The Geometric distribution has the memoryless property. Mathematically, for any non-
negative integers s and t, this property can be written
P(X = s + t | X s ) = P(X = t)
Application: Gives probability of requiring exactly x binomial trials before the first success
is achieved. Used in quality control, reliability, and other industrial situations.
Example: Determination of probability of requiring exactly five tests firings before first
success is achieved.
The Geometric distribution is the discrete analogue of the Exponential distribution, which
models the time needed to get a success.
The Exponential distribution is the continuous analog of the Geometric distribution. Like the
Geometric distribution, the Exponential distribution also has the memoryless property.
Mathematically, for any non-negative real numbers s and t, this property can be written
The Exponential distribution is a special case of the Gamma distribution (r = 1). Furthermore,
the sum of r independent and identically distributed Exponential () random variables has a
Gamma distribution with parameters r and theta.
In a Poisson () process, the waiting times between consecutive events are distributed as
Exponential with mean 1/().
You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.
This is an extension of the geometric distribution, describing the waiting time until r "ones"
have appeared. The probability of the rth "one" appearing on the kth trial is given by the
negative binomial distribution:
in other words, the first part is the probability of r-1 success in the previous k-1 trails as a
binomial probability, the last tem is the probability of success.
Application: Suppose we are at a rifle range with an old gun that misfires 5 out of 6 times.
Define ``success'' as the event the gunfires and let X be the number of failures before the third
success. Then X has a negative binomial with parameters (3, 1/6). The probability that there
are 10 failures before the third success is given by:
P(X = 10) = 2C12 (1/6)3 (5/6)10 = 5%
The expected value and variance of X are: E(X) = 3(1-5/6) / (1/6) = 15, and Var(X) = 3(1-
5/6) / (1/6)2 = 90.
In a sequence of independent and identically distributed Bernoulli (p) trials, the number of
trials required to get the rth success has a Negative Binomial (r,p) distribution.
Example: The number of oil wells that must be drilled to get r productive wells.
Application: Gives probability similar to Poisson distribution when events do not occur at a
constant rate and occurrence rate is a random variable that follows a gamma distribution.
You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.
Hypergeometric Distribution
The Hypergeometric (x; n, M, N) Distribution applies when we are sampling n items without
replacement from a population of M successes and N-M failures.
The hypergeometric distribution arises when a random selection (without repetition) is made
among objects of two distinct types. Typical examples:
The above Venn diagram depicts choosing a random subset of size r from n items of which M
= m items belong in a particular category, the probability that x = k of the selected items
belong to that category.
The Binomial distribution looks at n trials "with replacement." The hypergeometric
distribution is for the case "without replacement."
Here p changes from one Bernoulli trial to the next. Specifically, we have a population of size
N with M out of the N members being "Successes" and the remaining (N-M) being
"Failures." We choose a random sample of n (equivalent to taking out n members in
succession without replacement).
P (X = x) = xC M n-xC N-M / mC N
for all integers x between Max [0, n -(N+M)] and Min [n, M].
respectively.
In other words, there is a total number of N chips in the urn and n chips are drawn at random
without replacement. Out of these n chips, k chips are red, and the remainder (n - k) are
white. So, the formula is the number of ways to choose k chips from r red chips in the urn
multiplied by the number of ways to choose n - k chips from white chips. This is divided by
the sample space, or the number of ways to select n chips from the total of N chips in the urn.
Application: Gives probability of picking exactly x good units in a sample of n units from a
population of N units when there are k bad units in the population. Used in quality control
and related applications.
Example: Given a lot with 21 good units and four defective. What is the probability that a
sample of five will yield not more than one defective?
Example: The number of defective items in a sample of size n from a box containing N items
of which k are defective.
Application: A manufacturing process is monitored. As each product exits the process line, it
is tested for defective versus non-defective. On the fifth defect, the process is stopped for re-
adjustment. The random variable X follows a Negative Binomial distribution with r = 5 and p
= P(product is non-defective).
By extension, since the Binomial can be approximated by the Poisson, we can also
approximate the Hypergeometric by a Poisson if the Binomial approximation is appropriate
and n is reasonably large with k/N small.
You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.
An important class of decision problems under uncertainty concerns the random durations
between events. For example, the the length of time between breakdowns of a machine not
exceeding a certain time interval, such as the copying machine in your office not breaking
down during this week.
f(t) = exp(-t),
where is the average number of events per unit of time, which is a positive number.
The mean and the variance of the random variable t (time between events) are 1/ , and 1/2,
respectively.
Applications include probabilistic assessment of the time between arrivals of patients to the
emergency room of a hospital, and time between arrivals of ships at a particular port.
You might like to use Exponential Density to perform your computations, and Lilliefors Test
for Exponentiality to perform the goodness-of-fit test.
F-Density Function
The F-distribution is the distribution of the ratio of two independent sampling (of size of n1,
and n2, respectively) estimates of variance from standard normal distributions. It is also
formed by the ratio of two independent chi-square variables divided by their respective
independent degrees of freedom.
Its main applications are in testing equality of two independent population variances based on
two independent random samples, ANOVA, and regression analysis.
By now, you should have noticed that while every Statistical Table collected at the end of
your textbook, provides the critical values for the right-tail as well as the left-tail
probabilities, except the F-Table, which contains the critical values for the right-tail
probabilities only. However, one might use the following nice property of F-distribution that:
to obtain the critical values for the left-tail probabilities. Here is a numerical example:
F 2, 3, 0.9 = 1 / F 3, 2, 0.1 = 1 / 9.16 = 0.109
You need both tails probabilities for test of hypothesis and construction of confidence
interval for the ratio of two independent populations' variances.
The expected value of Chi-square statistic is its d.f., its variance is twice of its d.f., and its
mode is equal to (d.f.- 2).
Similar to Normal random variables, the Chi-square has the additive property. For example,
for two independent Chi-square variables, their sum is also Chi-square with degrees of
freedom equal to the sum of the d.f. of the individual d.f.s. Thus the unbiased sample
variance for a sample of size n from N (0,1) is a sum of n-1 Chi-squares, each with d.f. = 1,
hence Chi-square with d.f. = n-1.
The Chi-square Test for Association which is a non-parametric test; therefore, it can be
used for nominal data too. It is a test of statistical significance widely used bivariate tabular
association analysis. Typically, the hypothesis is whether or not two populations are different
in some characteristic or aspect of their behavior based on two random samples. This test
procedure is also known as the Pearson Chi-square test.
Expected value is another name for the mean and (arithmetic) average.
2, 3, 2, 2, 0, 3
The average is Summing up all the numbers and dividing by their counts:
(2 + 3 + 2 + 2 + 0 + 3) / 6
which is the sum of each distinct observation times its probability. Right?
Expected value is known also as the First Moment, borrowed from Physics,
because it is the point of balance where the data and the probabilities are the
distances and the weights, respectively.
We simplify this using the above rules. First, because the expectation of a sum
equals the sum of expectations:
Finally, notice that E[X2] can be written as E[g(X)] where g(X)=X2. From the
final fact about expectations, we can calculate this:
E[X2] = x2 P(X = x), for all x
For example, suppose we toss two fair coins and we are interested in
determining the expected value and the variance of the outcome:
E[X2] = (0) 2P(X=0) + (1) 2P(X=1) + (2) 2P(X=2) = 0(1/4) + 1(1/2) + 4(1/4) =
3/2.
Application: Notice that the above two examples are among some the tools
well suited for reducing or even in preventing computational statistics round-
off errors as well as computers' over/under flows.
X: 220, 220, 260, 280, 270, 250, 300, 290, 240.
We wish to estimate the mean and the variance of the population based on this
sample.
Let a = 10, and b = 22, then dividing the observational data set by a = 10, and
then subtracting 22 fron each value, we obtain a new data set Y:
Y: 0, 0, 4, 6, 5, 3, 8, 7, 2.
Computing the mean and the variance of set Y, we obtain:
yi = 35, yi2 = 203
Hence, the estimated mean and variance using the Y data set are 35/9, and
[203 – 9(35/9)2] / 8 = 8.36, respectively. However, notice that X = 10Y + 22,
therefore, the estimated mean and variance for the population are E(X) = 10
E(Y) + 22 = 350 + 22 = 372, and Var(X) = 102 Var(Y) = 836, respectively.
Notice that, the variance is not expressed in the same units as the expected
value. So, the variance is hard to understand and to explain as a result of the
squared term in its computation. This can be alleviated by working with the
square root of the variance, which is called the Standard (i.e., having the
same unit as the data have) Deviation:
For the dynamic process, the Volatility as a measure for risk includes the time
period over which the standard deviation is computed. The Volatility
measure is defined as standard deviation divided by the square root of the
time duration.
CV =100 |/ | %
Notice that the CV is independent from the expected value measurement. The
coefficient of variation demonstrates the relationship between standard
deviation and expected value, by expressing the risk as a percentage of the
expected value. The inverse of CV (namely 1/CV) is called the Signal-to-
Noise Ratio.
You might like to use Multinomial Applet for checking your computation and
performing computer-assisted experimentation.
- Two Investments -
Investment I Investment II
Payoff % Prob. Payoff % Prob.
1 0.25 3 0.33
7 0.50 5 0.33
12 0.25 8 0.34
Hence, the expected amount of money spent in the store is (50)(80) = $4000.
In the Descriptive Statistic Section of this Web site, we have been concerned
with how empirical scores are distributed and how best to describe their
distribution. We have discussed several different measures, but the mean
will be the measure that we use to describe the center of the distribution, and
the standard deviation will be the measure we use to describe the spread of
the distribution. Knowing these two facts gives us ample information to make
statements about the probability of observing a certain value within that
distribution. If I know, for example, that the average Intelligence Quotient
(I.Q.) score is 100 with a standard deviation of = 20, then I know that
someone with an I.Q. of 140 is very smart. I know this because 140 deviates
from the mean by twice the average amount as the rest of the scores in the
distribution. Thus, it is unlikely to see a score as extreme as 140 because most
of the I.Q. scores are clustered around 100 and only deviate 20 points from the
mean .
Many applications arise from the central limit theorem (CLT). The CLT states
that, average of values of n observations approaches normal distribution,
irrespective of the form of original distribution under quite general conditions.
Consequently, normal distribution is an appropriate model for many, but not
all, physical phenomena, such as distribution of physical measurements on
living organisms, intelligence test scores, product dimensions, average
temperatures, and so on.
Know that the Normal distribution is to satisfy seven requirements: (1) the
graph should be bell shaped curve; (2) mean, median and mode are all equal;
(3) mean, median and mode are located at the center of the distribution; (4) it
has only one mode, (5) it is symmetric about mean, (6) it is a continuous
function; (6) it never touches x-axis; and (7) the area under curve equals one.
When we know the mean and variance of a Normal then it allows us to find
probabilities. So, if, for example, you knew some things about the average
height of women in the nation, including the fact that heights are distributed
normally, you could measure all the women in your extended family and find
the average height. This enables you to determine a probability associated with
your result, if the probability of getting your result, given your knowledge of
women nationwide, is high. Then your family's female height cannot be said
to be different from average. If that probability is low, then your result is rare
(given the knowledge about women nationwide), and you can say your family
is different. You have just completed a test of the hypothesis that the average
height of women in your family is different from the overall average.
Let X denotes the random portfolio loss distributed as X ~ N (0, 10 0002). The
value at risk v5% is defined by definition a number such that
Life is good for only two things, discovering mathematics and teaching
mathematics.
-- Simeon Poisson
An important class of decision problems under uncertainty is characterized by
the small chance of the occurrence of a particular event, such as an accident.
Poisson probability function computes the probability of exactly x
independent occurrences during a given period of time, if events take place
independently and at a constant rate. Poisson probability function also
represent number of occurrences over constant areas or volumes:
Poisson probabilities are often used; for example in quality control, software
and hardware reliability, insurance claim, number of incoming telephone calls,
and queuing theory.
A process that creates fabric is monitored. If the number of defects (X) per
meter of fabric exceeds 5 then the process is stopped for diagnosis. The
random variable X follows a Poisson distribution with rate = number of
defects per meter of fabric.
The mean and variance of random variable n are both . However if the mean
and variance of a random variable have equal numerical values, then it is not
necessary that its distribution is a Poisson. Its mode is within interval [ -1, ].
Applications:
P ( 0 arrival) = e-
P ( 1 arrival) = e-/ 1!
P ( 2 arrival) = 2 e-/ 2!
Normal approximation for Poisson: All Poisson tables are limited in their
scope; therefore, it is necessary to use standard normal distribution in
computing the Poisson probabilities. The following numerical example
illustrates how good the approximation could be.
= = 1, and = 1/2 = 1,
Notice that by taking the square root of a Poisson random variable, the
transformed variable is more symmetric. This is a useful transformation in
regression analysis of Poisson observations.
Poisson approximation for binomial: Notice that, whenever you use Poisson
approximation to the binomial distribution with parameters n and p, then the
goodness of the approximation is largely determined by the smallness of the p
parameter rather than how large is n.
You might like to use Common Discrete Probability Functions to obtain
probability and the cumulative probability functions.
You might like to use Poisson Probability Function JavaScript to perform your
computation, and Testing Poisson to perform the goodness-of-fit test.
Further Reading:
Barbour et al., Poisson Approximation, Oxford University Press, 1992.
(n-1)Z / 2
has a t-distribution with (n-1) d.f. For large sample size (say, n over 30), the
new random variable has an expected value equal to zero, and its variance is
(n-1)/(n-3) which is close to one.
Notice that the t- statistic is related to F-statistic as follow: F = t2, where F has
(d.f.1 = 1, and d.f.2 = d.f. of the t-table)
The parameters for the triangular distribution are Minimum (a), Maximum (b),
and Likeliest (c). There are three conditions underlying triangular distribution:
The following are the general Triangular density function, together with the
expected value and the variance for a Triangular random variable X (a, c, b):
Further Reading:
Evans M., Hastings N., and B., Peacock, Triangular Distribution, Ch. 40 in Statistical Distributions,
Wiley, pp. 187-188, 2000.
You might like to use Goodness-of-Fit Test for Uniform and performing some
numerical experimentation for a deeper understanding of the concepts.
Notice that any Uniform distribution has uncountable number of modes having
equal density value; therefore it is considered as a homogeneous population.
Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions, Wiley, 2003.
1. Any undetected outliers may have major impact and may influence the results of
almost all statistical estimation and testing procedures.
2. Homogeneous population. That is, there is not more than one mode. Perform Test for
Homogeneity of a Population
3. The sample must be random. Perform Test for Randomness.
4. In addition to the Homogeneity requirement, each population has a normal
distribution. Perform the Lilliefors' Test for Normality.
5. Homogeneity of variances. Variation in each population is almost the same as in the
other(s). Perform The Bartlett's Test.
For two populations use the F-test. For 3 or more populations, there is a practical rule
known as the"Rule of 2". In this rule, one divides the highest variance of a sample by
the lowest variance of the other sample. Given that the sample sizes are almost the
same, and the value of this division is less than 2, then the variations of the
populations are almost the same.
Notice: This important condition in analysis of variance (ANOVA and the t-test for
mean differences) is commonly tested by the Levene test or its modified test known as
the Brown-Forsythe test. Interestingly, both tests rely on the homogeneity of
variances condition!
These conditions are crucial, not for the method of computation, but for the testing using the
resultant statistic. Otherwise, we can do ANOVA and regression without any assumptions,
and the numbers come out the same. Simple computations give us least-square fits, partitions
of variance, regression coefficients, and so on. We do need the above conditions when test of
hypotheses are our main concern.
Further Readings:
Good Ph., and J. Hardin, Common Errors in Statistics, Wiley, 2003.
Wang H., Improved confidence estimators for the usual one-sided confidence intervals for the ratio of two normal
variances, Statistics & Probability Letters, Vol. 59, No.3, 307-315, 2002.
Robust statistical techniques are needed to cope with any undetected outliers; otherwise they
are more likely to invalidate the conditions underlying statistical techniques, and they may
seriously distort estimates and produce misleading conclusions in test of hypotheses. A
common approach consists of assuming that contaminating models, different from the one
generating the rest of the data, generate the (possible) outliers.
Because of a potentially large variance, outliers could be the outcome of sampling errors or
clerical errors such as recording data. Therefore, you must be very careful and cautious.
Before declaring an observation"an outlier," find out why and how such observation
occurred. It could even be an error at the data entering stage while using any computer
package.
In practice, any observation with a standardized value greater than 2.5 in absolute value is a
candidate for being an outlier. In such a case, one must first investigate the source of the
datum. If there is no doubt about the accuracy or veracity of the observation, then it should be
removed, and the model should be refitted.
1. Compute the mean ( ) and standard deviation (S) of the whole sample.
2. Set limits for the mean :
- k S, + k S.
An Application: Suppose you ask ten of your classmates to measure a given length X. The
results (in mm) are:
46, 48, 38, 45, 47, 58, 44, 45, 43, 44
Is 58 an outlier? Computing the mean and the variance of the ten measurement using the
Descriptive Sampling Statistics JavaScript, are 45.8, and 5.1(after the needed adjustment),
respectively. The Z-value for 58 is Z (58) = 2.4. Since the measurements, in general, follow a
normal distribution, therefore,
Probability [X as large as 2.4 times standard deviation] = 0.008,
obtained by using the Standard Normal P-value JavaScript, or from the normal table in your
textbook.
According this probability, one expects only 0.09 of the ten measurements as bad as this one.
This is a very rare event, however, in spite of such small probability, it has occurred,
therefore, it might be an outlier.
The next most suspected measurement is 38, is it an outlier? It is a question for you.
A Notice: Outlier detection in the single population setting is not too difficult. Quite often,
however, one can argue that the detected outliers are not really outliers, but form a second
population. If this is the case, a data separation approach needs to be taken.
You might like to use the Identification of Outliers JavaScript in performing some numerical
experimentation for validating and for a deeper understanding of the concepts
Further Reading:
Rothamsted V., V. Barnett, and T. Lewis, Outliers in Statistical Data, Wiley, 1994.
Homogeneous Population
Notice that, e.g., a Uniform distribution has uncountable number of modes having equal
density value; therefore it is considered as a homogeneous population.
A basic condition in almost all inferential statistics is that a set of data constitutes a random
sample from a given homogeneous population. The condition of randomness is essential to
make sure the sample is truly representitive of the population. The widely used test for
randomness is the Runs test.
Consider the following sequence (D for Defective items, N for Non-defective items) from a
production line: DDDNNDNDNDDD. Number of runs is R = 7, with n1 = 8, and n2 = 4
which are number of D's and N's.
The Runs Tests, which is also known as Wald-Wolfowitz Test, is designed to test the
randomness of a given sample at 100(1- )% confidence level. To conduct a runs test on a
sample, perform the following steps:
Step 2: going through the sample sequence, replace any observation with +, or - depending
on whether it is above or below the mean. Discard any ties.
=1 + 2n1n2/(n 1 + n2).
Step 6: Conclusion:
Note: This test is valid for cases for which both n1 and n2 are large, say greater than 10. For
small sample sizes, special tables must be used.
For example, suppose for a given sample of size 50, we have R = 24, n1 = 14 and n2 = 36.
Test for randomness at = 0.05.
The Plugging these into the above formulas we have = 16.95, = 2.473, and z = -2.0 From
Z-table, we have Z = 1.645. Therefore, there might be a trend, which means that the sample is
not random.
The standard test for normality is the Lilliefors' statistic. A histogram and normal probability
plot will also help you distinguish between a systematic departure from normality when it
shows up as a curve.
Lilliefors' Test for Normality: This test is a special case of the Kolmogorov-Smirnov
goodness-of-fit test, developed for testing the normality of population's distribution. When
applying the Lilliefors test, a comparison is made between the standard normal cumulative
distribution function, and a sample cumulative distribution function with standardized
random variable. If there is a close agreement between the two cumulative distributions, the
hypothesis that the sample was drawn from population with a normal distribution function is
supported. If, however, there is a discrepancy between the two cumulative distribution
functions too great to be attributed to chance alone, then the hypothesis is rejected.
The difference between the two cumulative distribution functions is measured by the statistic
D, which is the greatest vertical distance between the two functions.
You might like to use the well-known Lilliefors' Test for Normality to assess the goodness-
of-fit.
Further Readings
Thode T., Testing for Normality, Marcel Dekker, Inc., 2001. Contains the major tests for normality.
Introduction to Estimation
To estimate means to esteem (to give value to). An estimator is any quantity calculated from
the sample data which is used to give information about an unknown quantity in the
population. For example, the sample mean is an estimator of the population mean .
Estimators of population parameters are sometimes distinguished from the true value by
using the symbol 'hat'. For example, true population standard deviation is estimated from a
sample population standard deviation.
Again, the usual estimator of the population mean is = xi / n, where n is the size of the
sample and x1, x2, x3,.......,xn are the values of the sample. If the value of the estimator in a
particular sample is found to be 5, then 5 is the estimate of the population mean µ.
A"Good" estimator is the one which provides an estimate with the following qualities:
Consistency: The standard deviation of an estimate is called the standard error of that
estimate. The larger the standard error the more error in your estimate. The standard deviation
of an estimate is a commonly used index of the error entailed in estimating a population
parameter based on the information in a random sample of size n from the entire population.
An estimator is said to be"consistent" if increasing the sample size produces an estimate with
smaller standard error. Therefore, your estimate is"consistent" with the sample size. That is,
spending more money to obtain a larger sample produces a better estimate.
Efficiency: An efficient estimate is one which has the smallest standard error among all
unbiased estimators.
The"best" estimator is the one which is the closest to the population parameter being
estimated:
The above figure illustrates the concept of closeness by means of aiming at the center for
unbiased with minimum variance. Each dart board has several samples:
The first one has all its shots clustered tightly together, but none of them hit the center. The
second one has a large spread, but around the center. The third one is worse than the first two.
Only the last one has a tight cluster around the center, therefore has good efficiency.
If an estimator is unbiased, then its variability will determine its reliability. If an estimator is
extremely variable, then the estimates it produces may not on average be as close to the
population parameter as a biased estimator with small variance.
The following chart depicts the quality of a few popular estimators for the population mean
µ:
The widely used estimator of the population mean is = xi/n, where n is the size of the
sample and x1, x2, x3,......., xn are the values of the sample that have all of the above good
properties. Therefore, it is a"good" estimator.
If you want an estimate of central tendency as a parameter for a test or for comparison, then
small sample sizes are unlikely to yield any stable estimate. The mean is sensible in a
symmetrical distribution as a measure of central tendency; but, e.g., with ten cases, you will
not be able to judge whether you have a symmetrical distribution. However, the mean
estimate is useful if you are trying to estimate the population sum, or some other function of
the expected value of the distribution. Would the median be a better measure? In some
distributions (e.g., shirt size) the mode may be better. BoxPlot will indicate outliers in the
data set. If there are outliers, the median is better than the mean as a measure of central
tendency.
You might like to use Descriptive Statistics JavaScript for obtaining"good" estimates.
Further Readings
Casella G., and R. Berger, Statistical Inference, Wadsworth Pub. Co., 2001.
Lehmann E., and G. Casella, Theory of Point Estimation, Springer Verlag, New York, 1998.
In most studies, investigators are usually interested in determining the size of difference of a
measured outcome between groups, rather than a simple indication of whether or not it is
statistically significant. Confidence intervals present a range of values, on the basis of the
sample data, in which the value of such a difference may lie.
Know that a confidence interval computed from one sample will be different from a
confidence interval computed from another sample.
Understand the relationship between sample size and width of confidence interval, moreover,
know that sometimes the computed confidence interval does not contain the true value.
Let's say you compute a 95% confidence interval for a mean . The way to interpret this is to
imagine an infinite number of samples from the same population, at leat 95% of the
computed intervals will contain the population mean , and at most 5% will not. However, it
is wrong to state,"I am 95% confident that the population mean falls within the interval."
Is the probability of occurrence of the population mean greater in the confidence interval (CI)
center and lowest at the boundaries? Does the probability of occurrence of the population
mean in a confidence interval vary in a measurable way from the center to the boundaries? In
a general sense, normality condition is assumed, and then the interval between CI limits is
represented by a bell shaped t distribution. The expectation (E) of another value is highest at
the calculated mean value, and decreases as the values approach the CI limits.
Tolerance Interval and CI: A good approximation for the single measurement tolerance
interval is n½ times confidence interval of the mean.
You may use Estimations With Confidence, and Confidence Intervals for Two Populations to
check your hand computations.
You need to use Sample Size Determination JavaScript at the design stage of your statistical
investigation in decision making with specific subjective requirements.
A Note on Multiple Comparison via Individual Intervals: Notice that, if the confidence
intervals from two samples do not overlap, there is a statistically significant difference, say at
5%. However, the other way is not true; two confidence intervals can overlap even when
there is a significant difference between them.
As a numerical example, consider the means of two independent samples. Suppose their
values are 10 and 22 with equal standard error of 4. The 95% confidence interval for the two
statistics (using the critical value of 1.96) are: [2.2, 17.8] and [14.2, 29.8], respectively. As
you see they display considerable overlap. However, the z-statistic for the two-population
mean is: |22 -10|/(16 + 16)½ = 2.12 which is clearly significant under the same conditions as
applied for constructing the confidence intervals.
One should examine the confidence interval for the difference explicitly. Even if the
confidence intervals are overlapping, it is hard to find the exact overall confidence level.
However, the sum of individual confidence levels can serve as an upper limit. This is evident
from the fact that: P(A and B) P(A) + P(B).
Numerical examples for construction of confidence intervals are given in The Statistical
Tables section.
Further Reading:
Cohen J., Statistical Power Analysis for the Behavioral Sciences, L. Erlbaum Associates, 1988.
Kraemer H., and S. Thiemann, How Many Subjects? Provides basic sample size tables, explanations, and power analysis.
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998. Provides a simple and general sample
size determination for hypothesis tests.
Newcombe R., Interval estimation for the difference between independent proportions: Comparison of eleven methods,
Statistics in Medicine, 17, 873-890, 1998.
Hahn G. and W. Meeker, Statistical Intervals: A Guide for Practitioners, Wiley, 1991.
Schenker N., and J. Gentleman, On judging the significance of differences by examining the overlap between confidence
intervals, The American Statistician, 55(2), 135-139, 2001.
Estimation is the process by which sample data are used to indicate the value of an unknown
quantity in a population.
Whenever we use point estimation, we calculate the margin of error associated with that point
estimate. For example, for the estimation of the population proportion, by the means of
sample proportion (p), the margin of error is calculated often as follows:
±1.96 [p(1-p)/n]1/2
In newspapers and television reports on public opinion polls, the margin of error often
appears in a small font at the bottom of a table or screen. However, reporting the amount of
error only, is not informative enough by itself, what is missing is the degree of the
confidence in the findings. The more important missing piece of information is the sample
size n; that is, how many people participated in the survey, 100 or 100000? By now, you
know well that the larger the sample size the more accurate is the finding, right?
The reported margin of error is the margin of"sampling error". There are many non-sampling
errors that can and do affect the accuracy of polls. Here we talk about sampling error. The
fact that sub-groups might have sampling error larger than the group, one must include the
following statement in the report:
"Other sources of error include, but are not limited to, individuals refusing to participate in
the interview and inability to connect with the selected number. Every feasible effort was
made to obtain a response and reduce the error, but the reader (or the viewer) should be aware
that some error is inherent in all research."
If you have a yes/no question in a survey, you probably want to calculate a proportion P of
Yes's (or No's). In a simple random sample survey, the variance of p is p(1-p)/n, ignoring the
finite population correction, for large n, say over 30. Now a 95% confidence interval is
For continuous random variables, such as the estimation of the population mean , the
margin of error is calculated often as follows:
±1.96 S/n1/2.
The margin of error can be reduced by one or a combination of the following strategies:
You might like to use Descriptive Statistics JavaScript to check your computations, and
Sample Size Determination JavaScript at the design stage of your statistical investigation in
decision making with specific subjective requirements.
Further Reading
Levy P., and S. Lemeshow, Sampling of Populations: Methods and Applications, Wiley, 1999.
Some inferencial statistical techniques do not require distributional assumptions about the
statistics involved. These modern non-parametric methods use large amounts of computation
to explore the empirical variability of a statistic, rather than making a priori assumptions
about this variability, as is done in the traditional parametric t- and z- tests.
Bootstrapping: Bootstrapping method is to obtain an estimate by combining estimators to
each of many sub-samples of a data set. Often M randomly drawn samples of T observations
are drawn from the original data set of size n with replacement, where T is less n.
Jackknife Estimator: A jackknife estimator creates a series of estimate, from a single data set
by generating that statistic repeatedly on the data set leaving one data value out each time.
This produces a mean estimate of the parameter and a standard deviation of the estimates of
the parameter.
Monte Carlo simulation allows for the evaluation of the behavior of a statistic when its
mathematical analysis is intractable. Bootstrapping and jackknifing allow inferences to be
made from a sample when traditional parametric inference fails. These techniques are
especially useful to deal with statistical problems, such as small sample size, statistics with no
well-developed distributional theory, and parametric inference condition violations. Both are
computer intensive. Bootstrapping means you take repeated samples from a sample and then
make statements about a population. Bootstrapping entails sampling-with-replacement from a
sample. Jackknifing involves systematically doing n steps, of omitting 1 case from a sample
at a time, or, more generally, n/k steps of omitting k cases; computations that
compare"included" vs."omitted" can be used (especially) to reduce the bias of estimation.
Both have applications in reducing bias in estimations.
Following the first publication of the general technique (and the bootstrap) in 1969 by Julian
Simon and subsequent independent development by Bradley Efron, resampling has become
an alternative approach for testing hypotheses.
There are other findings: "The bootstrap started out as a good notion in that it presented, in
theory, an elegant statistical procedure that was free of distributional conditions. In practice
the bootstrap technique doesn't work very well, and the attempts to modify it make it more
complicated and more confusing than the parametric procedures that it was meant to replace."
While resampling techniques may reduce the bias, they achieve this at the expense of
increase in variance. The two major concerns are:
1. The loss in accuracy of the estimate as measured by variance can be very large.
2. The dimension of the data affects drastically the quality of the samples and therefore
the estimates.
Further Readings:
Young G., Bootstrap: More than a Stab in the Dark?, Statistical Science, l9, 382-395, 1994. Provides the pros and cons on
the bootstrap methods.
Yatracos Y., Assessing the quality of bootstrap samples and of the bootstrap estimates obtained with finite resampling,
Statistics and Probability Letters, 59, 281-292, 2002.
Prediction Intervals
In many application of business statistics, such as forecasting, we are interested in
construction of a statistical interval for random variable, rather than a parameter of a
population distribution.
The Tchebysheff's inequality is often used to put bounds on the probability that a
proportion of random variable X will be within k 1 standard deviation of the mean
for any probability distribution. In other words:
The above bounds can be improved (i.e., becomes tighter) if we have some
knowledge about the population distribution. For example, if the population is
homogeneous; that is, its distribution is unimodal; then,
Now, let X be a random variable distributed normally with estimated mean and
standard deviation S, then a prediction interval for the sample mean with 100(1- )
% confidence level is:
± t/2 S (1+1/n)1/2.
This is the range of a random variable with 100(1- )% confidence, using t-table.
Relaxing the normality condition for sample-mean prediction interval, requires a large
sample size, say n over 30.
Further Readings:
Grant E., and R. Leavenworth, Statistical Quality Control, McGraw-Hill, 1996.
Ryan T., Statistical Methods for Quality Improvement, John Wiley & Sons, 2000. A very good book for a starter.
For statistical inference, namely statistical testing and estimation, one needs to estimate the
population's parameter(s). Estimation involves the determination, with a possible error due to
sampling, of the unknown value of a population parameter, such as the proportion having a
specific attribute or the average value of some numerical measurement. To express the
accuracy of the estimates of population characteristics, one must also compute the standard
errors of the estimates. These are measures of accuracy that determine the possible errors
arising from the fact that the estimates are based on random samples from the entire
population, and not on a complete population census.
Standard error is a statistic indicating the accuracy of an estimate. That is, it tells us to assess
how different the estimate (such as ) is from the population parameter (such as ). It is
therefore, the standard deviation of a sampling distribution of the estimator such as . The
following is a collection of standard errors for the widely used statistics:
As one expects, the standard error decreases as the sample size increases.
However the standard deviation of the estimate decreases by a factor of n½ not
n. For example, if you wish to reduce the error by 50%, the sample size must
be 4 times n, which is expensive. Therefore, as an alternative to increasing
sample size, one may reduce the error by obtaining"quality" data that provide
a more accurate estimate.
For a finite population of size N, the standard error of the sample mean of size
n, is:
S [(N -n)/(nN)]½.
{ 1 S22/n2 + 2 S12/n1}½.
[P(1-P)/n]½
[P(1-P)(N -n)/(nN)]½.
The last two formulas for finite population are frequently used when we wish
to compare a sub-sample of size n with a larger sample of size N, which
contains the sub-sample. In such a comparison, it would be wrong to treat the
two samples"as if" there were two independent samples. For example, in
comparing the two means one may use the t-statistic but with the standard
error:
SN [(N -n)/(nN)]½
as its denominator. Similar treatment is needed for proportions.
Sres[(Sxx + n 2
) /(n Sxx] ½.
Sy(1 - r2)½.
Sy (1 - r2)½.
Note that if r = 0, then the standard error reaches its maximum possible value,
which is standard deviation in Y.
Stability of an estimator: An estimator is stable if, by taking two different samples of the
same size, they produce two estimates having"small" absolute difference. The stability of an
estimator is measured by its reliability:
The larger the standard error, the less reliable is the estimate. Reliability of estimators is often
used to select the"best" estimator among all unbiased estimators.
At the planning stage of a statistical investigation, the question of sample size (n) is critical.
This is an important question therefore it should not be taken lightly. To take a larger sample
than is needed to achieve the desired results is wasteful of resources, whereas very small
samples often lead to what are no practical use of making good decisions. The main objective
is to obtain both a desirable accuracy and a desirable confidence level with minimum cost.
Students sometimes ask me, what fraction of the population do you need for good estimation?
I answer,"It's irrelevant; accuracy is determined by sample size." This answer has to be
modified if the sample is a sizable fraction of the population.
The confidence level of conclusions drawn from a set of data depends on the size of the data
set. The larger the sample, the higher is the associated confidence. However, larger samples
also require more effort and resources. Thus, your goal must be to find the smallest sample
size that will provide the desirable confidence.
For an item scored 0 or 1, for no or yes, the standard error (SE) of the estimated proportion p,
based on your random sample observations, is given by:
SE = [p(1-p)/n]1/2
where p is the proportion obtaining a score of 1, and n is the sample size. This SE is the
standard deviation of the range of possible estimate values.
The SE is at its maximum when p = 0.5, therefore the worst case scenario occurs when 50%
are yes, and 50% are no.
Under this extreme condition, the sample size, n, can then be expressed as the largest integer
less than or equal to:
n = 0.25/SE2
To have some notion of the sample size, for example for SE to be 0.01 (i.e. 1%), a sample
size of 2500 will be needed; 2%, 625; 3%, 278; 4%, 156, 5%, 100.
Note, incidentally, that as long as the sample is a small fraction of the total population, the
actual size of the population is entirely irrelevant for the purposes of this calculation.
Pilot Studies: When the needed estimates for sample size calculation is not available from an
existing database, a pilot study is needed for adequate estimation with a given precision. A
pilot, or preliminary, sample must be drawn from the population, and the statistics computed
from this sample are used in determination of the sample size. Observations used in the pilot
sample may be counted as part of the final sample, so that the computed sample size minus
the pilot sample size is the number of observations needed to satisfy the total sample size
requirement.
Sample Size with Acceptable Absolute Precision: The following present the widely used
method for determining the sample size required for estimating a population mean and
proportion.
Let us suppose we want an interval that extends unit on either side of the estimator. We can
write
Suppose, based on a pilot sample of size n, the estimated proportion is p, then the required
sample size with the absolute error size not exceeding , with 1- confidence is:
where t = t being the value taken from the t-table with parameter d.f. = = n-1,
corresponding to the desired 1- confidence interval.
For large pilot sample sizes (n), say over 30, the simplest sample size determinate is:
where is the desirable margin of error (i.e., the absolute error), which is the half-length of
the confidence interval with 100(1- )% confidence interval.
Sample Size with Acceptable Type I and Type II Errors: One may use the following sample
size determinate, which is based on the size of type I and Type II errors:
2(Z + Z)2S2/2,
where and are the desirable type I, and type II errors, respectively. S2 is the variance
obtained from the pilot run, and is the difference between the null and alternative (0 -a).
Sample Size with Acceptable Relative Precision: You may use the following sample size
determinate for a desirable relative error in %, which requires an estimate of the
coefficient of variation (CV in %) from a pilot sample with size over 30:
Sample Size Based on the Null and an Alternative: One may use power of the test to
determine the sample size. The functional relation of the power and the sample size is known
as the operating characteristic curve. On this curve, as sample size increases, the power
function increases rapidly. Let be such that:
a = 0 +
is an alternative to represent departure from the null hypothesis. We wish to be reasonably
confident to find evidence against the null, if in fact the particular alternative holds. That is,
the type error , is the probability of failing to find evidence at least at level of , when the
alternative holds. This implies
Required sample size = (z1 + z2) S2/ 2
Where: z1 = |mean - 0|/ SE, z2 = |mean - a|/ SE, the mean is the current estimate for , and S
is the current estimate for .
All of the above sample size determinates could also be used for estimating the mean of any
unimodal population, with discrete or continuous random variables, provided the pilot run
size (n) is larger than (say) 30.
In estimating the sample size, when the standard deviation is not known, instead of S2 one
may use 1/4 of the range for sample size over 30 as a"good" estimate for the standard
deviation. It is a good practice to compare the result with IQR/1.349.
One may extend the sample size determination to other useful statistics, such as correlation
coefficient (r) based on acceptable Type I and Type II errors:
The aim of applying any one of the above sample size determinates is at improving your pilot
estimates at feasible costs.
You might like to use Sample Size Determination JavaScript to check your computations.
Further Reading:
Kish L., Survey Sampling, Wiley, 1995.
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998. Provides a simple and general sample
size determination for hypothesis tests.
However, what is the variance of all k groups combined? The answer must consider the
sample size ni of the ith group:
Notice that the above formula allows us to split up the total variance into its two component
parts. This splitting process permits us to determine the extent to which the overall variation
is inflated by the difference between group means. What the variation would be if all groups
had the same mean? ANOVA is a well-known application of this concept where the equality
of several means is tested.
Subjective Mean and Variance: In many applications, we saw how to make decisions based
on objective data; however, an informative decision-maker might be able to combine his/her
subjective input and the two sources of information.
Application: Suppose the following information is available from two independent sources:
You might like to use Revising the Mean and Variance JavaScript in performing some
numerical experimentation. You may apply it for validating the above example and for a
deeper understanding of the concept where more than two sources of information are to be
combined.
Further Reading:
Tsao H. and T. Wright, On the maximum ratio: A tool for assisting inaccuracy assessment, The
American Statistician, 37(4), 1983.
where
Also, the precision (or inverse of the variance) of the posterior distribution of
is W1 + W2, that is, the sum of the sample precision and prior precision.
The posterior mean will lie between the sample mean and the prior mean. The
posterior variance will be less than both the sample and prior variances.
In this Web site do not discuss Bayesian inference because this would take us
into a lot more detail than we intend to cover. However, the basic notion of
combining the sample mean and prior mean in inverse proportion to their
variances will be of interest while being useful.
You may like using the Bayesian Statistical Inference JavaScript for checking
your computation and performing some experiment.
Further Reading:
Ghosh M., and G. Meeden, Bayesian Methods for Finite Population Sampling, Chapman & Hall/CRC,
1997.
As indicated in the above matrix a Type-I error occurs when, based on your
data, you reject the null hypothesis when in fact it is true. The probability of a
type-I error is the level of significance of the test of hypothesis and is denoted
by .
Type-I error is often called the producer's risk that consumers reject a good
product or service indicated by the null hypothesis. That is, a producer
introduces a good product, in doing so, he or she take a risk that consumer will
reject it.
A type II error occurs when you do not reject the null hypothesis when it is in
fact false. The probability of a type-II error is denoted by . The quantity 1 -
is known as the Power of a Test. A Type-II error can be evaluated for any
specific alternative hypotheses stated in the form"Not Equal to" as a
competing hypothesis.
Type-II error is often called the consumer's risk for not rejecting possibly a
worthless product or service indicated by the null hypothesis.
Students often raise questions, such as what are the 'right' confidence intervals,
and why do most people use the 95% level? The answer is that the decision-
maker must consider both the Type I and II errors and work out the best
tradeoff. Ideally one wishes to reduce the probability of making these types of
error; however, for a fixed sample size, we cannot reduce one type of error
without at the same time increasing the probability of another type of error.
Nevertheless, to reduce the probabilities of both types of error simultaneously
is to increase the sample size. That is, by having more information one makes
a better decision.
Rule 1: Accept lots with one or fewer defectives; therefore, a lot has
either 0 defective or 1 defective.
Rule 2: Accept lots with two or fewer defectives; therefore, a lot has
either 0,1, or 2 defective(s).
On the basis of the binomial distribution, the P(0 or 1) is 0.7367. This means
that, with a defective rate of 0.10, the Big Y will accept 74% of tested lots and
will reject 26% of the lots even though they are good lots. The 26% is the
producer's risk or the level. This level is analogous to a Type I error --
rejecting a true null. Or, in other words, rejecting a good lot. In this example,
for illustration purposes, the lot represents a null hypothesis. The rejected lot
goes back to the producer; hence, producer's risk. If Big Y is to take rule 2,
then the producer's risk decreases. The P(0 or, or 1, or 2) is 0.9298 therefore,
Big Y will accept 93% of all tested lots, and 7% will be rejected, even though
the lot is acceptable. The primary reason for this is that, although the
probability of defective is 0.10, the Big Y through rule 2 allows for a higher
defective acceptance rate. Big Y increases its own risk (consumer's risk), as
stated previously.
Making Good Decision: Given that there is a relevant profit (which could be
negative) for the outcome of your decision, and a prior probability (before
testing) for the null hypothesis to be true, the objective is to make a good
decision. Let us denote the profits for each cell in the decision table as $a, $b,
$c and $d (column-wise), respectively. The expectation of profit is [a + (1-
)b], and + [(1-)c + d], depending whether the null is true.
Now having a prior (i.e., before testing) subjective probability of p that the
null is true, then the expected profit of your decision is:
A good decision makes this profit as large as possible. To this end, we must
suitably choose the sample size and all other factors in the above profit
function.
Note that, since we are using a subjective probability expressing the strength
of belief assessment of the truthfulness of the null hypothesis, it is called a
Bayesian Approach to statistical decision making, which is a standard
approach in decision theory.
Further Reading:
Cochran W., Planning and Analysis of Observational Studies, Wiley, 1983.
Hypothesis Testing: Rejecting a Claim
To perform a hypothesis test, one must be very specific about the test one
wishes to perform. The null hypothesis must be clearly stated, and the data
must be collected in a repeatable manner. If there is any subjectivity, the
results are technically not valid. All of the analyses, including the sample size,
significance level, the time, and the budget, must be planned in advance, or
else the user runs the risk of"data diving".
The real question is in statistics not whether a null hypothesis is correct, but
whether it is close enough to be used as an approximation.
Test of Hypotheses
Click on the image to enlarge it and THEN print it
In most statistical tests concerning , we start by assuming the 2, and the
higher moments, such as skewness and kurtosis, are equal. Then, we
hypothesize that the 's are equal wich is null hypothesis.
Then we test with a calculated t-value. For simplicity, suppose we have a two-
sided test. If the calculated t is close to 0, we say"it is good", as we expected.
If the calculated t is far from 0, we say,"the chance of getting this value of t,
given my assumption that the populations are statistically the same, is so small
that I will not believe the assumption. We will say that the populations are not
equal; specifically the means are not equal."
In this test, we need (among others) the condition that the population variances
(i.e., treatment impacts on central tendency but not variability) are equal.
However, this test is robust to violations of that condition if n's are large and
almost the same size. A counter example would be to try a t-test between (11,
12, 13) and (20, 30, 40). The pooled and unpooled tests both give t statistics of
3.10, but the degrees of freedom are different: d.f. = 4 (for pooled) or d.f.
about 2 (for unpooled). Consequently the pooled test gives p = 0.036 and the
unpooled p = 0.088. We could go down to n = 2 and get something still more
extreme.
More numerical examples with applications are given in The Statistical Tables
section.
You might like to use Testing the Mean, and Testing the Variance in
performing more of these tests
You might need to use Sample Size Determination JavaScript at the design
stage of your statistical investigation in decision making with specific
subjective requirements.
In this treatment there are two parties: One party (or a person) proposes the
null hypothesis (the claim). Another party proposes an alternative hypothesis.
A significance level and a sample size n are agreed upon by both parties.
The next step is to compute the relevant statistic based on the null hypothesis
and the random sample of size n. Finally, one determines the rejection region.
The conclusion based on this approach is as follows:
If the computed statistic falls within the rejection region, then Reject the null
hypothesis; otherwise Do Not Reject the null hypothesis (the claim).
You may ask: How do you determine the critical value (such as z-value) for
the rejection interval for one and two-tailed hypotheses?. What is the rule?
First, you have to choose a significance level . Knowing that the null
hypothesis is always in"equality" form then, the alternative hypothesis has one
of the three possible forms:"greater-than","less-than", or"not equal to". The
first two forms correspond to a one-tail hypothesis while the last one
corresponds to a two-tail hypothesis.
If your alternative is in the form of "greater-than", then z is the value that gives you an
area to the right tail of the distribution that is equal to .
If your alternative is in the form of "less-than", then z is the value that gives you an
area to the left tail of the distribution that is equal to .
If your alternative is in the form of "not equal to", then there are two z values, one
positive and the other negative. The positive z is the value that gives you an /2 area
to the right tail of the distribution. While, the negative z is the value that gives you an
/2 area to the left tail of the distribution.
The above rule can be generalized and implemented for determining the critical value for any
test of hypothesis, you must first master reading the statistical tables, because, as you see, not
all tables in your textbook are presented in the same format.
The p-value, which directly depends on a given sample attempts to provide a measure of the
strength of the results of a test for the null hypothesis, in contrast to a simple reject or do not
reject in the classical approach to the test of hypotheses. If the null hypothesis is true, and if
the chance of random variation is the only reason for sample differences, then the p-value is a
quantitative measure to feed into the decision-making process as evidence. The following
table provides a reasonable interpretation of p-values:
P-value Interpretation
P 0.01 very strong evidence against H0
0.01 P 0.05 moderate evidence against H0
0.05 P 0.10 suggestive evidence against H0
0.10 P little or no real evidences against H0
This interpretation is widely accepted, and many scientific journals routinely publish papers
using this interpretation for the result of a test of hypothesis.
For the fixed-sample size, when the number of realizations is decided in advance, the
distribution of p is uniform, assuming the null hypothesis is true. We would express this as
P(p x) = x. That means the criterion of p 0.05 achieves of 0.05.
Understand that the distribution of p-values under null hypothesis H0 is uniform, and thus
does not depend on a particular form of the statistical test. In a statistical hypothesis test, the
P value is the probability of observing a test statistic at least as extreme as the value actually
observed, assuming that the null hypothesis is true. The value of p is defined with respect to a
distribution. Therefore, we could call it"model-distribution hypothesis" rather than"the null
hypothesis".
In short, it simply means that, if the null had been true, the p-value is the probability against
the null in that case. The p-value is determined by the observed value; however, this makes it
difficult to even state the inverse of p.
Finally, since the p-values are random variables, one cannot compare several p-values for any
statistical conclusions (nor order them). This is a common mistake many people do, therefore,
the above table is not intended for such a comparison.
You might like to use The P-values for the Popular Distributions JavaScript.
Further Readings:
Arsham H., Kuiper's P-value as a Measuring Tool and Decision Procedure for the Goodness-of-fit Test, Journal of Applied
Statistics, Vol. 15, No.3, 131-135, 1988.
Good Ph.., Resembling Methods: A Practical Guide to Data Analysis, Springer Verlag, 1999.
Blending the Classical and the P-value Based Approaches in Test of Hypotheses
A p-value is a measure of how much evidence you have against the null hypothesis. Notice
that the null hypothesis is always in = form, and does not contain any forms of inequalities.
The smaller the p-value, the more evidence you have. In this setting, the p-value is based on
the hull hypothesis and has nothing to do with an alternative hypothesis and therefore with
the rejection region. In recent years, some authors try to use the mixture of the classical and
the p-value approaches. It is based on the critical value obtained from given , the computed
statistics and the p-value. This is a blend of two different schools of thought. In this setting,
some textbooks compare the p-value with the significance level to make decisions on a given
test of hypothesis. The larger the p-value is when compared with (in one-sided alternative
hypothesis, and /2 for the two sided alternative hypotheses), the less evidence we have for
rejecting the null hypothesis. In such a comparison, if the p-value is less than some threshold
(usually 0.05, sometimes a bit larger like 0.1 or a bit smaller like 0.01) then you reject the
null hypothesis. The following deal with such a combined approach.
Use of P-value and : In this setting, we must also consider the alternative hypothesis in
drawing the rejection region. There is only one p-value to compare with (or /2). Know
that, for any test of hypothesis, there is only one p-value. The following outlines the
computation of the p-value and the decision process involved in a given test of hypothesis:
1. P-value for One-sided Alternative Hypotheses: The p-value is defined as the area
under the right tail of distribution, if the rejection region in on the right tail; if the
rejection region is on the left tail, then the p-value is the area under the left tail (in
one-sided alternative hypotheses).
2. P-value for Two-sided Alternative Hypotheses: If the alternative hypothesis is two-
sided (that is, rejection regions are both on the left and on the right tails), then the p-
value is the area under the right tail or to the left tail of the distribution, depending on
whether the computed statistic is closer to the right rejection region or left rejection
region.
For symmetric densities (such as t-density), the left and right tails p-values are the
same. However, for non-symmetric densities (such as Chi-square) use the smaller of
the two. This makes the test more conservative. Notice that, for a two sided-test
alternative hypotheses, the p-value is never greater than 0.5.
3. After finding the p-value as defined here, you compare it with a pre-set value for
one-sided tests, and with /2 for two sided-test. The larger the p-value is when
compared with (in one-sided alternative hypothesis, and /2 for the two sided
alternative hypotheses), the less evidence we have for rejecting the null hypothesis.
To avoid looking-up the p-values from the limited statistical tables given in your textbook,
most professional statistical packages such as SAS and SPSS provide the two-tailed p-value.
Based on where the rejection region is, you must find out what p-value to use.
Some textbooks have many misleading statements about p-value and its applications. For
example, in many textbooks you find the authors double the p-value to compare it with
when dealing with the two-sided test of hypotheses. One wonders how they do it in the case
when"their" p-value exceeds 0.5? Notice that, while it is correct to compare the p-value with
for a one sided tests of hypotheses , for two-sided hypotheses, one must compare the p-
value with /2, NOT with 2 times p-value, as some textbooks advise. While the decision is
the same, there is a clear distinction here and an important difference, which the careful
reader will note.
How to set the appropriate value? You may have wondered why = 0.05 is so popular
in a test of hypothesis. = 0.05 is traditional for tests, but is arbitrary in its origins suggested
by R.A. Fisher, who suggested it in the spirit of 0.05 being the biggest p-value at which one
would think maybe the null hypothesis in a statistical experiment was to be considered false.
This was also a tradeoff between"type I error" and "type II error"; that we do not want to
accept the wrong null hypothesis, but we do not want to fail to reject the false null
hypothesis, either. As a final note, the average of these two p-values is often called the mid-p
value.
Conversions from two-sided to one-sided probabilities: Let C be the probability for a two-
sided confidence interval (CI) constructed for an estimate. The probability (C1) that either the
estimate is greater than the lower limit or that it is less than the upper limit can be computed
by using:
You might need to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific, subjective requirements.
One way to make the Bonferroni t-test less conservative is to use the estimate of the
population variance computed from within the groups in the analysis of variance.
t = ( 1 - 2 )/ ( 2 / n1 + 2 / n2 )1/2,
Suppose we have n number of P-values: p(i), i =1, .., n, in ascending order corresponding to
independent tests. Let j be the largest integer, such as:
If no such j exists, reject all hypotheses; otherwise, reject all hypotheses with p(i) / j. This
provides a strong control of the family-wise error rate at level.
There are other improvements on the Bonferroni adjustment when multiple tests are
independent or positively dependent. However, the Hommel's method is the most powerful
compared with other methods.
Further Readings:
Hommel G., Bonferroni procedures for logically related hypotheses, Journal of Statistical Planning and Inference, 82, 119-
128, 1999.
Kost J., and M. McDermott, Combining dependent P-values, Statistics and Probability Letters, 60, 183-190, 2002.
Wasteful P., and S. Young, Resembling-Based Multiple Testing: Examples and Methods for P-Value Adjustment, Wiley, 1992.
Wright S., Adjusted P-values for simultaneous inference, Biometrics, 48, 1005-1013, 1992.
The power of a test plays the same role in hypothesis testing that Standard Error played in
estimation. It is a measuring tool for assessing the accuracy of a test or in comparing two
competing test procedures.
The power of a test is the probability of rejecting a false null hypothesis when the null
hypothesis is false. This probability is inversely related to the probability of making a Type II
error, not rejecting the null hypothesis when it is false. Recall that we choose the probability
of making a Type I error when we set . If we decrease the probability of making a Type I
error, then we increase the probability of making a Type II error. Therefore, there are
basically two errors possible when conducting a statistical analysis; type I error and and type
II error:
Power and Alpha (): Thus, the probability of not rejecting a true null has the same
relationship to Type I errors as the probability of correctly rejecting an untrue null does to
Type II error. Yet, as I mentioned if we decrease the odds of making one type of error we
increase the odds of making the other type of error. What is the relationship between Type I
and Type II errors? For a fixed sample size, decreasing one type of error increases the size of
the other one.
Power and the Size Effect: Anytime we test whether a sample differs from a population, or
whether two samples come from 2 separate populations, there is the condition that each of the
populations we are comparing has its own mean and standard deviation (even if we do not
know it). The distance between the two population means will affect the power of our test.
This is known as the size of treatment, also known as the effect size, as shown in the
following table with the three popular values for :
Power and the Size of Variance 2: The greater the variance S2, the lower the power 1-.
Anything that effects the extent to which the two distributions share common values will
increase (the likelihood of making a Type II error)
Power and the Sample Size: The smaller the sample sizes n, the lower the power. Very
small n produces power so low that false hypotheses are accepted.
In practice, the first three factors are often fixed. Only the sample size can be controlled by
the statistician and that only within budget constraint. There exists a tradeoff between budget
and achievement of desirable accuracy in any analysis.
A Numerical Example: The power of a test is most easily understood by viewing it in the
context of a composite test. A composite test requires the specification of a population mean
as the alternative hypothesis. For example, using Z-test of hypothesis in the following Figure.
The power is developed from specification of an alternative hypothesis such as = 2.5, and
= 3. The resultant distribution under this alternative shifts to the right 2.5 units with the
shaded area representing the power of the test, correctly rejecting a false null.
Power of a Test
Click on the image to enlarge it
Not rejecting the null hypothesis when it is false is defined as a Type II error, and is denoted
by the region. In the above Figure this region lies to the left of the critical value. In the
configuration shown in this Figure, falls to the left of the critical value (and below the
statistic's density (or probability) function under the alternative hypothesis Ha). The is also
defined as the probability of not-rejecting a false null hypothesis when it is false, also called a
miss. Related to the value of is the power of a test. The power is defined as the probability
of rejecting the null hypothesis given that a specific alternative is true, and is computed as (1-
).
A Short Discussion: Consider testing a simple null versus simple alternative. In the Neyman-
Pearson setup, an upper bound is set for the probability of a given Type I error (), and then
it is desirable to find tests with low probability of type II error () given this. The usual
justification for this is that"we are more concerned about a Type I error, so we set an upper
limit on the that we can tolerate." I have seen this sort of reasoning in elementary texts and
also in some advanced ones. It doesn't seem to make any sense. When the sample size is
large, for most standard tests, the ratio / tends to 0. If we care more about Type I error than
Type II error, why should this concern dissipate with increasing sample size?
This is indeed a drawback of the classical theory of testing statistical hypotheses. A second
drawback is that the choice lies between only two test decisions: reject the null or accept the
null. It is worth considering approaches that overcome these deficiencies. This can be done,
for example, by the concept of profile-tests at a 'level' . Neither the Type I nor Type II error
rates are considered separately, but they are the ratio of a correct decision. For example, we
accept the alternative hypothesis Ha and reject the null H0, if an event is observed which is at
least a-times greater under Ha than under H0. Conversely, we accept H0 and reject Ha, if an
event is observed which is at least a-times greater under H0 than under Ha. This is a
symmetric concept which is formulated within the classical approach.
Power of Parametric versus Non-parametric Tests: As a general rule, for a given sample
size n, the parametric tests are more powerful than their non-parametric counterparts. The
primarily reason for this is that we have emphasized parametric tests. Moreover, among the
parametric tests, those which use correlation are more powerful, such as the before-and-after
test. This is known as a Variance Reduction Technique used in system simulation to increase
the accuracy (i.e., reduce variation) without increasing the sample size.
Correlation Coefficient as a Measuring Tool and Decision Criterion for the Effect Size:
The correlation coefficient could be obtained and used as a measuring tool and decision
criteron for the strength of the effect size based on the computed test-statistic for major
hypothesis testing.
The correlation coefficient r stands as a very useful and accessible index of the magnitude of
effect. It is commonly accepted that the small, medium, and large effect sizes correspond to r-
values over 0.1, 0.3, and 0.5, respectively. The following are needed transformation of some
major inferential statistics to the r-value:
You might like to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific subjective requirements.
Further Reading:
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998.
One must use a statistical technique called non-parametric if it satisfies at least one of the
following five types of criteria:
1. The data entering the analysis are enumerative; that is, counted data represent the
number of observations in each category or cross-category.
2. The data are measured and/or analyzed using a nominal scale of measurement.
3. The data are measured and/or analyzed using an ordinal scale of measurement.
4. The inference does not concern a parameter in the population distribution; for
example, the hypothesis that a time-ordered set of observations exhibits a random
pattern.
5. The probability distribution of the statistic upon which the analysis is based is not
dependent upon specific information or conditions (i.e., assumptions) about the
population(s) from which the sample(s) are drawn, but only upon general
assumptions, such as a continuous and/or symmetric population distribution.
According to these creteria, the distinction of non-parametric is accorded either because of
the level of measurement used or required for the analysis, as in types 1 through 3; the type of
inference, as in type 4, or the generality of the assumptions made about the population
distribution, as in type 5.
For example, one may use the Mann-Whitney Rank Test as a non-parametric alternative to
Students T-test when one does not have normally distributed data.
Non-parametric tests are those used when some specific conditions for the ordinary tests are
violated.
Distribution-free tests are those for which the procedure is valid for all different shape of the
population distribution.
For example, the Chi-square test concerning the variance of a given population is parametric
since this test requires that the population distribution be normal. The Chi-square test of
independence does not assume normality condition, or even that the data are numerical. The
Kolmogorov-Smirnov test is a distribution-free test, which is applicable to comparing two
populations with any distribution of continuous random variable.
The following section is an interesting non-parametric procedure with various and useful
applications.
R = Pr (X Y)
One may use:
The estimator RS = U/(r s),
where U is the number of pairs (xi, yj) such that xi yj, for all i = 1, 2, ,r, and j = 1, 2,..,s.
This estimator is an unbiased one with the minimum variance for R. It is important to know
that the estimate has an upper limit, non-negative delta value for its accuracy:
You might like to use the Kolmogorov-Smirnov Test for Two Populations and Comparing
Two Random Variables in checking your computations and performing some numerical
experiment for a deeper understanding of these concepts.
Further Readings:
Arsham H., A generalized confidence region for stress-strength reliability, IEEE Transactions on Reliability, 35(4), 586-589,
1986.
Conover W., Practical Nonparametric Statistics, Wiley, 1998.
Hollander M., and D. Wolfe, Nonparametric Statistical Methods, Wiley, 1999.
Kotz S., Y. Lumelskii, and M. Pensky, The Stress-Strength Model and Its Generalizations: Theory and Applications,
Imperial College Press, London, UK, 2003, distributed by World Scientific Publishing.
Hypotheses Testing
Let us consider a simple problem of inference about population mean. We have a large
population with known mean. We take a sample and wish to know whether the sample mean
is significantly different from the population mean. Our null hypothesis is that it is not.
The theory of probability is only capable of dealing with random variables which generate a
frequency distribution "in the long run". We have one fixed population and one fixed sample.
There is nothing random about this problem and the experiment is conducted once, so there is
no "long run".
We pretend that the experiment was not conducted once, but an infinite number of times, that
is, we consider all possible samples of the same size. We assume that each sample mean
includes an "error", which is independently and normally distributed about zero. The sample
mean now becomes our random variable, which we call our "statistic". We can now apply the
t-test or z-test interpretation of probability.
We are now able to determine the probability of a randomly chosen sample mean having a
value at least as extreme as our original sample mean. Note that we are implicitly assuming
that the null hypothesis is true. This probability is our p-value which we apply to the original
problem.
Remember that, in the t-tests for differences in means, there is a condition of equal
population variances that must be examined. One way to test for possible differences in
variances is to do an F test. However, the F test is very sensitive to violations of the normality
condition; i.e., if populations appear not to be normal, then the F test will tend to reject too
often the null of no differences in population variances.
You might like to use the following JavaScript to check your computations and to perform
some statistical experiments for deeper understanding of these concepts:
The purpose is to compare the sample mean with the given population mean. The aim is to
judge the claimed mean value, based on a set of random observations of size n. A necessary
condition for validity of the result is that the population distribution is normal, if the sample
size n is small (say less than 30).
H0 = = 0
T = [( - 0) n1/2] / S
2
Where is the estimated mean and S is the estimated variance based on n random
observations.
The above statistic is distributed as a t-distribution with parameter d.f. = = (n-1). If the
absolute value of the computed T-statistic is"too large" compared with the critical value of
the t-table, then one rejects the claimed value for the population's mean.
This test could also be used for testing similar claims for other unimodal populations
including those with discrete random variables, such as proportion, provided there are
sufficient observations (say, over 30).
You might like to use Testing the Mean JavaScript in checking your computations. and
Sample Size Determination JavaScript at the design stage of your statistical investigation in
decision making with specific subjective requirements.
If an estimate is an unbiased such as sample mean, then it is a good idea to pool the estimates
to get a single estimate from several relatively small samples. The pooled estimate is a
“good” estimate when compared with each individual estimates.
Pooled Mean: Supposed we have m number of estimates (i), of sample size n(i), for the
population expected value , the pooled estimate is:
[ n(i) (i)] / [n(i)], both sums are over all values of i = 1, 2,. . ., m.
Pooled Variance: Since the sample variance is also unbiased estimate of population variance
2, therefore, it is a good idea to pool the estimates to get a single estimate from m number of
estimates S(i)2, of sample size n(i), the pooled estimate is:
{[ [n(i) – 1] S(i)2 ] } / {[ n(i)] – m}, both sums are over all values of i = 1, 2,…, m.
We pool variance estimates for other good reasons. Depending on a particular reason, then
the conclusion might have to be made explicitly conditional on e.g., the validity of the equal-
variance model. There are several different good reasons for pooling:
to get a single stable estimate from several relatively small samples, where
variance fluctuations seem not to be systematic; or
for convenience, when all the variance estimates are near enough to equality;
or
when there is no choice but to model variance, as in simple linear regression
with no replicated X values.
You might like to use JavaScript Pooling the Means, and Variances.
Pooled Standard Deviation: Both the sample mean, and variance are unbiased estimates for
the population parameters, , and 2, respectively, however the sample standard deviation in
NOT an unbiased estimate of population standard deviation . This is so, because of an
equality known as the Jensen's inequality when applied to a concave function, i.e., the square
root of the unbiased variance estimate. Therefore, pooling standard deviation directly is
meaningless; the best one can do to take the square root of the pooled variance
Notice that, when sample sizes are large and nearly equal, so that there is essentially no
difference between the pooled and unpooled estimates of standard errors of paired-data
samples, and degrees of freedom are nearly asymptotic. This rationale can fall apart for any
other cases. One must pool variance rather than merely taking a shortcut in the computation
of standard errors.
If you calculate the test without the assumption, you have to determine the degrees of
freedom (d.f.). The formula works in such a way that d.f. will be less if the larger sample
variance is in the group with the smaller number of observations. This is the case in which the
two tests will differ considerably. A study of the formula for the d.f. is most enlightening, and
one must understand the correspondence between the unfortunate design, having the most
observations in the group with little variance, and the low d.f. and accompanying large t-
value.
Applications: When doing t tests for differences in means of populations, for independent
samples case:
1. For differences in means that do not make any assumption about equality of
population variances, use the standard error formula:
[S21/n1 + S22/n2]½,
with d.f. = = n1 or n2 whichever is smaller.
with parameter d.f. = = (n1 + n2- 2), for n1, and n2 greater than to 1, where the pooled
variance is:
3. If total N is less than 50 and one sample is 1/2 the size of the other (or less), and if the
smaller sample has a standard deviation at least twice as large as the other sample,
then apply the procedure given in item no. 1, but adjust d.f. parameter of the t-test to
the largest integer less than or equal to:
where:
A = [S21/n1 + S22/n2]2,
Otherwise, do not worry about the problem of having an actual level that is much
different than what you have set it to be.
The last approach, which is very general with conservative results, can be implemented using
Testing Two Populations JavaScript.
You might like to use JavaScript Testing the Mean for One Population
Duncan's multiple-range test: This is one of the many multiple comparison procedures. It is
based on the standardized range statistic by comparing all pairs of means while controlling
the overall Type I error at a desirable level. While it does not provide interval estimates of the
difference between each pair of means, it does indicate which means are significantly
different from the others. For determining the significant differences between a single control
group mean and the other means, one may use the Dunnett's multiple-comparison test.
Two random variables X and Y having distribution FX(x) and FY(y) respectively, are said to
be equivalent, or equal in rule, or equal in distribution, if and only if they have the same
distribution function. That is,
FX(z) = FY(z), for all z,
There are different tests depending on the intended applications. The widely used tests for
statistical equality of populations are as follow:
1. Equality of Two Normal Populations: One may use the Z-test and F-test to
check the equality of the means, and the equality of variances, respectively.
2. Testing a Shift in Normal Populations: Often we are interested in testing for a
given shift in a given population distribution, that is testing if a random
variable Y is equal in distribution to another X + c for some constant c. In
other words, the distribution of Y is the distribution of X shifted. In testing any
shift in distribution one needs to test for normality first, and then testing the
difference in expected values by applying the two-sided Z-test with the null
hypothesis of:
H0: Y - X = c.
The normal or Gaussian distribution is a continuous symmetric distribution that follows the
familiar bell-shaped curve. One of its nice features is that, the mean and variance uniquely
and independently determines the distribution.
Therefore, for testing the statistical equality of two independent normal populations, one must
first perform the Lilliefors' Test for Normality to assess this condition. Given that both
populations are normally distributed, then one must performing two more tests, namely the
test for equality of the two means and the test for equality of the two variances. Both of these
tests can be carried out by using the Test of Hypotheses for Two Populations JavaScript.
The tests we have learned up to this point allow us to test hypotheses that examine the
difference between only two means. Analysis of Variance or ANOVA will allow us to test
the difference between two or more means. ANOVA does this by examining the ratio of
variability between two conditions and variability within each condition. For example, say we
give a drug that we believe will improve memory to a group of people and give a placebo to
another group of people. We might measure memory performance by the number of words
recalled from a list we ask everyone to memorize. A t-test would compare the likelihood of
observing the difference in the mean number of words recalled for each group. An ANOVA
test, on the other hand, would compare the variability that we observe between the two
conditions to the variability observed within each condition. Recall that we measure
variability as the sum of the difference of each score from the mean. When we actually
calculate an ANOVA we will use a short-cut formula
Thus, when the variability that we predict between the two groups is much greater than the
variability we don't predict within each group, then we will conclude that our treatments
produce different results.
Consider the following (small integers, indeed for illustration while saving space) random
samples from three different populations.
With the null hypothesis:
H0: µ1 = µ2 = µ3,
and the alternative:
Ha: at least two of the means are not equal.
At the significance level = 0.05, the critical value from F-table is
F 0.05, 2, 12 = 3.89.
Sum Mean
Sample P1 2 3 1 3 1 10
Sample P2 3 4 3 5 0 15
Sample P3 5 5 5 3 2 20
Computation of sample SST: With the grand mean = 3, first, start with taking the difference
between each observation and the grand mean, and then square it for each data point.
Sum
Sample P1 1 0 4 0 4 9
Sample P2 0 1 0 4 9 14
Sample P3 4 4 4 0 1 13
Second, let all the data in each sample have the same value as the mean in that sample. This
removes any variation WITHIN. Compute SS differences from the grand mean.
Sum
Sample P1 1 1 1 1 1 5
Sample P2 0 0 0 0 0 0
Sample P3 1 1 1 1 1 5
Therefore SSB = 10, with d.f = (m-1)= 3-1 = 2 for m=3 groups.
Third, compute the SS difference within each sample using their own sample means. This
provides SS deviation WITHIN all samples.
Sum
Sample P1 0 1 1 1 1 4
Sample P2 0 1 0 4 9 14
Sample P3 1 1 1 1 4 8
SSW = 26 with d.f = 3(5-1) = 12. That is, 3 groups times (5 observations in each -1)
Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as expected.
Now, construct the ANOVA table for this numerical example by plugging the results of your
computation in the ANOVA Table. Note that, the Mean Squares are the Sum of squares
divided by their Degrees of Freedom. F-statistics is the ratio of the two Mean Squares.
Conclusion: There is not enough evidence to reject the null hypothesis H0.
The Logic behind ANOVA: First, let us try to explain the logic and then illustrate it with a
simple example. In performing the ANOVA test, we are trying to determine if a certain
number of population means are equal. To do that, we measure the difference of the sample
means and compare that to the variability within the sample observations. That is why the test
statistic is the ratio of the between-sample variation (MSB) and the within-sample variation
(MSW). If this ratio is close to 1, there is evidence that the population means are equal.
Here is a good application for you: Many people believe that men get paid more in the
business world, in a specific profession at specific level, than women, simply because they
are male. To justify or reject such a claim, you could look at the variation within each group
(one group being women's salaries and the other group being men's salaries) and compare
that to the variation between the means of randomly selected samples of each population. If
the variation in the women's salaries is much larger than the variation between the men's and
women's mean salaries, one could say that because the variation is so large within the
women's group that this may not be a gender-related problem.
Now, getting back to our numerical example of the drug treatment to improve memory vs the
placebo. We notice that: given the test conclusion and the ANOVA test's conditions, we may
conclude that these three populations are in fact the same population. Therefore, the ANOVA
technique could be used as a measuring tool and statistical routine for quality control as
described below using our numerical example.
Construction of the Control Chart for the Sample Means: Under the null hypothesis, the
ANOVA concludes that µ1 = µ2 = µ3; that is, we have a "hypothetical parent population."
The question is, what is its variance? The estimated variance (i.e., the total mean squares) is
36 / 14 = 2.57. Thus, estimated standard deviation is = 1.60 and estimated standard deviation
for the means is 1.6 / 5½ 0.71. Under the conditions of ANOVA, we can construct a control
chart with the warning limits = 3 ± 2(0.71); the action limits = 3 ± 3(0.71). The following
figure depicts the control chart.
Click on the image to enlarge it and THEN print it.
ANOVA and Quality Control
Conditions for Using ANOVA Test: The following conditions must be tested prior to using
ANOVA test, otherwise the results are not valid: Randomness of the samples, Normality of
populations, and Equality of Variances for all populations.
You May Ask Why Not Using Pair-wise T-test Instead ANOVA? Here are two reasons:
Performing pair-wise t-test for K populations, you will need to perform, K(K-1)/2 pair-wise t-
test. Now suppose the significance level for each test is set at 5% level, then the overall
significance level would be approximately equal to 0.05K(K-1)/2. For example, for K = 5
populations, you have to performing 10 pair-wise t-tests, moreover, the overall significance
level is equal to 50%, which is too high type-I error for any statistical decision making.
You might like to use ANOVA: Testing Equality of Means for your computations, and then
to interpret the results in managerial (not technical) terms.
You might need to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific subjective requirements.
In testing the equality of several means, often the raw data are not available. In such a case,
one must perform the needed analysis based on secondary data using the data summaries;
namely, the triple-set: The sample sizes, the sample means, and the sample variances.
Suppose one of the samples is of size n having the sample mean , and the sample variance
S2. Let:
The JavaScript Subjective Assessment of Estimates tests the claim that at least the ratio of
one estimate to the largest estimate is as large as a given claimed value.
Further Reading:
Larson D., Analysis of variance with just summary statistics as input, The American Statistician, 46(2), 151-152, 1992.
Mean
0 oz 2 3 1 3 1 4 1 3 2 1 2.1
2 oz 3 2 1 4 2 3 1 5 1 2 2.4
4 oz 3 1 2 4 2 5 2 4 3 2 3.1
H0: µ1 = µ2 = µ3,
Using the ANOVA for Dependent Populations JavaScripts, we obtain the needed information
in constructing the following ANOVA table:
A"block design sampling" implies studying more than two dependent populations. For testing
the equality of means of more than two populations based on block design sampling, you may
use Two-Way ANOVA Test JavaScript. In the case of having block design data with
replications, use Two-Way ANOVA with Replications JavaScript to obtain the needed
information for constructing the ANOVA tables.
H0: P1 = P2 = ..... = Pk
That is, all three population proportions are almost identical. The sample data
from each of the three populations are given in the following table:
The Chi-square statistic is 8.95 with d.f. = (3-1)(3-1) = 4. The p-value is equal
to 0.062, indicating that there is moderate evidence against the null hypothesis
that the three populations are statistically identical.
For statistical equality of two populations, one may use the Kolmogorov-
Smirnov Test (K-S Test) for two populations. The K-S test seeks differences
between the two population's distribution function based on their two
independent random samples. The test rejects the null hypothesis of no
difference between the two populations if the difference between the two
empirical distribution functions is "large".
Prior to applying the K-S test it is necessary to arrange each of the two sample
observations in a frequency table. The frequency table must have a common
classification. Therefore the test is based on the frequency table, which
belongs to the family of distribution-free tests.
The manager of the first branch is claiming that"since the daily sales are
random phenomena, my overall performance is as good as the other manager's
performance." In other words:
H0: The daily sales at the two stores are almost the same.
Ha: The performance of the managers is significantly different.
Following the above process for this test, the K-S statistic is 0.421 with the p-
value of 0.0009, indicating a strong evidence against the null hypothesis.
There is enough evidence that the performance of the manager of the second
branch is better.
The variance is not the only thing for which you use a Chi-square test for.
One of the disadvantages of some of the Chi-square tests is that they do not
permit the calculation of confidence intervals; therefore, determination of the
sample size is not readily available.
By compiling the number of people in each category, you can ultimately test
whether drug usage is independent of cigarette smoking by using the Chi-
square distribution (this is approximate, but works well). Again, the
methodology for this is in your textbook. The degrees of freedom is equal to
(number of rows-1)(number of columns -1). That is, these many numbers
needed to fill in the entire body of the crosstable, the rest will be determined
by using the given row sums and the column sums values.
Do not forget the conditions for the validity of Chi-square test and related
expected values greater than 5 in 80% or more of the cells. Otherwise, one
could use an"exact" test, using either a permutation or resampling approach.
Under the hypothesis that there is no relation, the expected (E) frequency
would be:
Ei, j = (ri)(cj)/N
The Observed (O) and Expected (E) frequencies are recorded in the following
table:
The quantity
2 = [(O - E )2 / E]
with d.f. = (2-1)(3-1) = 2, that has the p-value of 0.14, suggesting little or no
real evidences against the null hypothesis.
The main question is how large is this measure. The maximum value of this
measure is:
2max = N(A-1),
The coefficient of determination which has a range of [0, 1], provides relative
strength of relationship, computed as
This statistic ranges between 0 and 1 and can be interpreted like the
correlation coefficient. This measure also indicates that the curriculum chosen
by students is related to the occupation of their parents.
You might like to use Chi-square Test for Crosstable Relationship in
performing this test, and he P-values for the Popular Distributions JavaScript
to findout the p-values of Chi-square statistic.
Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Fleiss J., Statistical Methods for Rates and Proportions, Wiley, 1981.
2 by 2 Crosstable Analysis
Using Chi-square in a 2x2 table requires the Yates's correction. One first
subtracts 0.5 from the absolute differences between observed and expected
frequencies for each of the three genotypes before squaring, dividing by the
expected frequency, and summing. The formula for the Chi-square value in a
2x2 table can be derived from the Normal Theory comparison of the two
proportions in the table using the total incidence to produce the standard
errors. The rationale of the correction is a better equivalence of the area under
the normal curve and the probabilities obtained from the discrete frequencies.
In other words, the simplest correction is to move the cut-off point for the
continuous distribution from the observed value of the discrete distribution to
midway between that and the next value in the direction of the null hypothesis
expectation. Therefore, the correction essentially only applied to one d.f. tests
where the"square root" of the Chi-square looks like a"normal/t-test" and where
a direction can be attached to the 0.5 addition.
Given the following 2x2 table, one may compute some relative risk measures:
a b
c d
The rate difference and rate ratio are appropriate when you are contrasting two
groups whose sizes (a+c and b+d) are given. The odds ratio is for when the
issue is association rather than difference.
The risk-ratio (RR) is the ratio of the proportion (a/(a+b)) to the proportion (c/
(c+d)):
RR = (a / (a + b)) / (c / (c + d))
RR is thus a measure of how much larger the proportion in the first row is
compared to the second. RR value of 1.00 indicating a 'negative' association
[a/(a+b) c/(c+d)], 1.00 indicating no association [a/(a+b) = c/(c+d)], and
1.00 indicating a 'positive' association [a/(a+b) c/(c+d)]. The further from
1.00 the RR is, the stronger the association.
Notice that the odds ratio (OR) is equal to the simple crossproduct ratio of a
2×2 table.
The OR can be written as: (a/b)/(c/d) which is the ratio of these two odds --
hence its name, the odds ratio. Both the numerator and denominator are odds.
For example, the numerator, a/b, gives the odds of a positive versus negative
rating by Rater 2 given that Rater 1's rating is positive. The denominator c/d
gives the odds of a positive versus negative rating by Rater 2 given that Rater
1's rating is negative.
Since the odds ratio is skewed, so we cannot easily compute a standard error
for the odds ratio itself. We can, however, find a standard error for the natural
logarithm of the odds ratio. It is simply:
Notice that, you need to compute the confidence interval on the log scale
and then transform the results back to the original scale of measurement.
We see that as any or all of the counts in the two by two table increase, the
confidence interval for the log odds ratio shrinks. Also, it turns out that the
smallest count in the 2 by 2 table plays the largest role in determining the size
of the standard error.
Test of homogeneity is much like the Test for Crosstable Relationship in that
both deal with the cross-classification of nominal data; that is, r c tables. The
method of computing Chi-square statistic is the same for both tests, with the
same d.f.
The two tests differ, however, in the following respect. The Test for
Crosstable Relationship is made on data drawn from a single population (with
fixed total) where one is concerned with whether one set of attributes is
independent of another set. The test for homogeneity, on the other hand, is
designed to test the null hypothesis that two or more random samples are
drawn from the same population or from different populations, according to
some criterion of classification applied to the samples.
The homogeneity test is concerned with the question: Are the samples drawn
form populations that are homogeneous (i.e., the same) with respect to some
criterion of classification?
In the crosstable for this test, either the row or the column categories may
represent the populations from which the samples are drawn.
14 9
7
5 4
The problem is not to determine whether or not the union members are in
favor of the change. The question is to test if there is a significant difference in
the proportions of opinion of the three populations' members concerning the
proposed change.
The Chi-square statistic is 9.58 with d.f. = (3-1)(3-1) = 4. The p-value is equal
to 0.048, indicating that there is moderate evidence against the null hypothesis
that the three union locals are the same.
You might like to use Populations Homogeneity Test to perfor this test.
Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Clark Ch., and L. Schkade, Statistical Analysis for Administrative Decisions, South-Western Pub., 1979.
Generally, the median provides a better measure of location than the mean
when there are some extremely large or small observations; i.e., when the data
are skewed to the right or to the left. For this reason, median income is used as
the measure of location for the U.S. household income.
H0: The public and private school teachers' salaries are almost the same.
The median of all data (i.e., combined) is 33.5. Now determine in each group
the number of observations falling above and below the common median of
33.5. The resulting frequencies are shown in the following table:
The Chi-square statistic based on this table is 2.33. The p-value for the
computed test statistic with d.f. = (2-1)(2-1) = 1 is 0.127, therefore, we are
unable to reject the null hypothesis.
There are other tests that might use the Chi-square, such as goodness-of-fit test
for discrete random variables. Therefore, Chi-square is a statistical test that
measures"goodness-of-fit". In other words, it measures how much the
observed or actual frequencies differ from the expected or predicted
frequencies. Using a Chi-square table will enable you to discover how
significant the difference is. A null hypothesis in the context of the Chi-square
test is the model that you use to calculate your expected or predicted values. If
the value you get from calculating the Chi-square statistic is sufficiently high
(as compared to the values in the Chi-square table), it tells you that your null
hypothesis is probably wrong.
p1 = P(Yi D1)
p2 = P(Yi D2)
pm = P(Yi Dm)
Since the union of the mutually exclusive intervals D1, D2,...., Dm is the set of
all possible values for the Yi's, (p1 + p2 + .... + pm) = 1. Define the set of
discrete random variables X1, X2, ...., Xm, where
:
:
and (X1+ X2+ .... + Xm) = n. Then the set of discrete random variables X1,
X2, ...., Xmwill have a multinomial probability distribution with parameters n
and the set of probabilities {p1, p2, ..., pm}. If the intervals D1, D2, ...., Dm are
chosen such that npi 5 for i = 1, 2, ..., m, then;
H0 : fY(y) = fo(y)
Ha : fY(y) fo(y)
is distributed as 2 m-1-r.
For this goodness-of-fit test, we formulate the null and alternative hypothesis
as
An Application: A die is thrown 300 times and the following frequencies are
observed. Test the hypothesis that the die is fair at level 0.05. Under the null
hypothesis that the die is fair, the expected frequencies are all equal to 300/6 =
50. Both the Observed (O) and Expected (E) frequencies are recorded in the
following table together with the random variable Y that represents the
number on each sides of the die:
The quantity
2 = [(O - E )2 / E] = 22.04
S (Ni - N)2/N
has a Chi-square distribution with d.f. = k-1. Where i is the count's number, Ni
is its counts, and N = Ni/k.
One may extend this useful test to where the duration of obtaining the ith count
is ti. Then the above test statistic becomes:
and has a Chi-square distribution with d.f. = k-1, where i is the count's
number, Ni is its counts, and N = Ni/ti.
Necessary conditions for the Chi-square based tests for crosstable data are:
However, since 2 n-1, 0.95 = 5.99, there is no reason to reject that there is no
difference, which is a very strange conclusion. What is wrong with this
application?
½ = [(n-1)S2] / 2
You might like to use Testing the Variance JavaScript to check your
computations.
Testing the Equality of Multi-Variances
The Bartlett test statistic is designed to test for equality of variances across
groups against the alternative that variances are unequal for at least two
groups. Formally,
In the above, Si2 is the variance of the ith group, ni is the sample size of the ith
group, k is the number of groups, and S2 is the pooled variance. The pooled
variance is a weighted average of the group variances and is defined as:
and
With an F = 4.38 and a p-value of 0.023, we reject the null at = 0.05. This is
not good news, since ANOVA, like the two-sample t-test, can go wrong when
the equality of variances condition is not met.
Further Readings:
Hand D., and C. Taylor, Multivariate Analysis of Variance and Repeated Measures, Chapman and Hall,
1987.
Miller R. Jr, Beyond ANOVA: Basics of Applied Statistics, Wiley, 1986.
2 = [(ni - 3).Zi2] - [(ni - 3).Zi]2 / [(ni - 3)], the sums are over all i = 1,
2, .., k.
The test statistic 2 has (k-1) degrees of freedom, where k is the number of
populations.
Using the above formula 2-statistic = 19.916, that has a p-value of 0.02.
Therefore, there is moderate evidence against the null hypothesis.
In such a case, one may omit a few outliers from the group, then use the Test
for Equality of Several Correlation Coefficients JavaScript. Repeat this
process until a possible homogeneous sub-group may emerge.
You might need to use Sample Size Determination JavaScript at the design
stage of your statistical investigation in decision making with specific
subjective requirements.
= x /n
This is just the mean of the x values.
= y /n
This is just the mean of the y values.
Sxx = SSxx = (x(i) - )2 = x2 - (x)2 / n
Syy = SSyy = (y(i) - )2 = y2 - (y) 2 / n
Sxy = SSxy = (x(i) - )(y(i) - ) = (x y) – (x) (y) / n
Slope m = SSxy / SSxx
Intercept, b = - m .
y-predicted = yhat(i) = mx(i) + b
Residual(i) = Error(i) = y – yhat(i)
SSE = Sres = SSres = SSerrors = [y(i) – yhat(i)]2 = SSyy – m SSxy
Standard deviation of residuals = s = Sres = Serrors = [SSres / (n-2)]1/2
Standard error of the slope (m) = Sres / SSxx1/2
Standard error of the intercept (b) = Sres[(SSxx + n. 2) /(n SSxx] 1/2
R2 = (SSyy - SSE) / SSyy
Based on our practical knowledge and the scattered diagram of the data, we
hypothesize a linear relationship between predictor X, and the cost Y.
Least Square Method: The best fit line results when there is the smallest value
for the sum of the squares of the deviations between y and yhat. Notice that if
you used regression of Y against X to estimate the slope, and the intercept the
estimated values would be very different to if using a regression of X against
Y.
Now the question is how we can best (i.e., least square) use the sample
information to estimate the unknown slope (m) and the intercept (b)? The first
step in finding the least square line is to construct a sum of squares table to
find the sums of x values (x), y values (y), the squares of the x values (x2),
the squares of the x values (y2), and the cross-product of the corresponding x
and y values (xy), as shown in the following table:
x y x2 xy y2
2 2 4 4 4
3 5 9 15 25
4 7 16 28 49
5 10 25 50 100
6 11 36 66 121
The second step is to substitute the values of x, y, x2, xy, and y2 into the
following formulas:
To estimate the intercept of the least square line, use the fact that the graph of
the least square line always pass through ( , ) point, therefore,
The intercept = b = – (m)( ) = (y)/ 5 – (2.3) (x/5) = 35/5 – (2.3)(20/5) =
-2.2
After estimating the slope and the intercept the question is how we determine
statistically if the model is good enough, say for prediction. The standard error
of slope is:
tslope = m / Sm.
which is large enough, indication that the fitted model is a"good" one.
You may ask, in what sense is the least squares line the"best-fitting" straight
line to 5 data points. The least squares criterion chooses the line that
minimizes the sum of square vertical deviations, i.e., residual = error = y -
yhat:
Notice that this value of SSE agrees with the value directly computed from the
above table. The numerical value of SSE gives the estimate of variation of the
errors s2:
s2 = SSE / (n -2) = 1.1 / (5 - 2) = 0.36667
The estimate the value of the error variance is a measure of variability of the y
values about the estimated line. Clearly, we could also compute the estimated
standard deviation s of the residuals by taking the square roots of the variance
s2.
As the last step in the model building, the following Analysis of Variance
(ANOVA) table is then constructed to assess the overall goodness-of-fit using
the F-statistics:
Sum of Mean
Source DF F Value Prob > F
Squares Square
For practical proposes, the fit is considered acceptable if the F-statistic is more
than five-times the F-value from the F distribution tables at the back of your
textbook. Note that, the criterion that the F-statistic must be more than five-
times the F-value from the F distribution tables is independent of the sample
size.
Notice also that there is a relationship between the two statistics that assess the
quality of the fitted line, namely the T-statistics of the slope and the F-
statistics in the ANOVA table. The relationship is:
t2slope = F
If sample size is large enough, say over 30 pairs of (x, y), then R2 has stronger
and more useful meaning. That is, the value of the R2 is the percentage of
variation in y that can be attributed to the variation in predictor x to predict y
by using the constructed linear model.
Confidence Region the Regression Line as the Whole: When the entire line
is of interest, a confidence region permits one to simultaneously make
confidence statements about estimates of Y for a number of values of the
predictor variable X. In order that region adequately covers the range of
interest of the predictor variable X; usually, data size must be more than 10
pairs of observations.
Yp Se { (2 F2, n-2, ) . [1/n + (X0 – )2/ Sx]}1/2
In all cases the JavaScript provides the results for the nominal (x) values. For
other values of X one may use computational methods directly, graphical
method, or using linear interpolations to obtain approximated results. These
approximation are in the safe directions i.e., they are slightly wider that the
exact values.
Many problems in analyzing data involve describing how variables are related.
The simplest of all models describing the relationship between two variables is
a linear, or straight-line, model. Linear regression is always linear in the
coefficients being estimated, not necessarily linear in the variables.
Know that a single summary statistic, like a correlation coefficient, does not
tell the whole story. A scatterplot is an essential complement to examining the
relationship between the two variables.
Again, the regression line is a group of estimates for the variable plotted on
the Y-axis. It has a form of y = b + mx, where m is the slope of the line. The
slope is the rise over run. If a line goes up 2 for each 1 it goes over, then its
slope is 2.
If you plug each x in the regression equation, then you obtain a predicted
value for y. The difference between the predicted y and the observed y is
called a residual, or an error term. Some errors are positive and some are
negative. The sum of squares of the errors plus the sum of squares of the
estimates add up to the sum of squares of Y:
Partitioning the Three Sum of Squares
Click on the image to enlarge it and THEN print it
The regression line is the line that minimizes the variance of the errors. The
mean error is zero; so, this means that it minimizes the sum of the squares
errors.
The reason for finding the best fitting line is so that you can make a reasonable
prediction of what y will be if x is known (not vise-versa).
In the usual regression modeling the estimated slope and intercept are
correlated; therefore, any error in estimating the slope influences the estimate
of the intercept. One of the main advantages of using the standardized data is
that the intercept is always equal to zero.
R2 = m yx . mxy
where
c = { (xi - xbar)2yi - n[(xi - xbar) 2yi]} / {n(xi - xbar) 4 - [(xi - xbar)2]
2
}
b = [(xi- xbar) yi]/[(xi - xbar)2] - 2cxbar
a = {yi - [c(x i - xbar) 2)}/n - (cxbarxbar + bxbar),
You might like to use Quadratic Regression JavaScript to check your hand
computation. For higher degrees than quadratic, you may like to use the
Polynomial Regressions JavaScript.
For small sample size, you may like to use the Multiple Linear Regression
JavaScript.
Structural Changes: When a regression model has been estimated using the
available data set, an additional data set may sometimes become available. To
test if previous model is still valid or the two separate models are equivalent or
not, one may use the analysis of covariance testing described on this site.
You might like to use the Regression Analysis JavaScript to check your
computations and to perform some numerical experimentation for a deeper
understanding of the concepts.
Further Reading:
Chatterjee S., B. Price, and A. Hadi, Regression Analysis by Example, Wiley, 1999.
When you have more than one regression equation based on data, to select
the"best model", you should compare:
1. R-squares: That is, the percentage of variance [in fact, the sum of squares] in Y
accounted for by variance in X captured by the model.
2. When you want to compare models of different sizes (different numbers of
independent variables (p) and/or different sample sizes n), you must use the Adjusted
R-Square, because the usual r-square tends to grow with the number of independent
variables.
r2 a = 1 - (n - 1)(1 - r2)/(n - p - 1)
3. Standard deviation of error terms, i.e., observed y-value - predicted y-value for each
x.
4. Trends in errors as a function of control variable x. Systematic trends are not
uncommon.
5. The T-statistic of individual parameters.
6. The values of the parameters and its content to content underpinnings.
7. Fdf1 df2 value for overall assessment. Where df1 (numerator degrees of freedom) is the
number of linearly independent predictors in the assumed model minus the number of
linearly independent predictors in the restricted model; i.e., the number of linearly
independent restrictions imposed on the assumed model, and df2 (denominator
degrees of freedom) is the number of observations minus the number of linearly
independent predictors in the assumed model.
The observed F-statistic should exceed not merely the selected critical value of F-
table, but at least four times the critical value.
Finally in statistics for business, there exists an opinion that with more than 4 parameters, one
can fit an elephant so that if one attempts to fit a regression funtion that depends on many
parameters, the result should not be regarded as very reliable.
Further Reading:
Draper N., and H. Smith, Applied Regression Analysis, Wiley, 1998.
Suppose that X and Y are two random variables for the outcome of a random experiment.
The covariance of X and Y is defined by
Cov (X, Y) = E{[X - E(X)][Y - E(Y)]}
and, given that the variances are strictly positive, the correlation of X and Y is defined by
(X, Y) = Cov(X, Y) / [sd(X) . sd(Y)]
Correlation is a scaled version of covariance; note that the two parameters always have the
same sign (positive, negative, or 0). When the sign is positive, the variables are said to be
positively correlated; when the sign is negative, the variables are said to be negatively
correlated; and when it is 0, the variables are said to be uncorrelated.
Notice that the correlation between two random variables is often due only to the fact that
both variables are correlated with the same third variable.
As these terms suggest, covariance and correlation measure a certain kind behavior in both
variables. Correlation is very similar to the derivative of a function that you may have studies
in high school.
Properties: The following exercises give some basic properties of expected values. The main
tool that you will need is the fact that expected value is a linear operation.
You might like to use this Applet in performing some numerical experimentation to:
There are measures that describe the degree to which two variables are linearly related. For
the majority of these measures, the correlation is expressed as a coefficient that ranges from
1.00 to -1.00. A value of 1 is indicating a perfect linear relationship, such that knowing the
value of one variable will allow perfect prediction of the value of the related value. A value
of 0 is indicating no predictability by a linear model. With negative values indicating that,
when the value of one variable is higher than average, the other is lower than average (and
vice versa); and positive values indicating that, when the value of one variable is high, so is
the other (and vice versa).
Correlation is similar to the derivative you have learned in calculus (a deterministic course).
The Pearson's product correlation is an index of the linear relationship between two variables.
A positive relationship indicates that if an individual value of x is above the mean of x's, then
this individual x is likely to have a y value that is above the mean of y's, and vice versa. A
negative relationship would be an x score above the mean of x and a y score below the mean
of y. It is a measure of the relationship between variables and an index of the proportion of
individual differences in one variable that can be associated with the individual differences in
another variable.
Notice that, the correlation coefficient is the mean of the cross-products of scores. Therefore,
if you have three values for r of 0.40, 0.60, and 0.80, you cannot say that the difference
between r = 0.40 and r = 0.60 is the same as the difference between r = 0.60 and r = 0.80, or
that r = 0.80 is twice as large as r = 0.40 because the scale of values for the correlation
coefficient is not interval or ratio, but ordinal. Therefore, all you can say is that, for example,
a correlation coefficient of +.80 indicates a high positive linear relationship and a correlation
coefficient of +.40 indicates a some what lower positive linear relationship.
The square of the correlation coefficient equals the proportion of the total variance in Y that
can be associated with the variance in x. It can tell us how much of the total variance of one
variable can be associated with the variance of another variable.
Note that a correlation coefficient is done on linear correlation. If the data forms a parabola,
then a linear correlation of x and y will produce an r-value equal to zero. So one must be
careful and look at data.
The standard statistics for hypothesis testing: H0: = 0, is the Fisher's normal
transformation:
Alternatively,
You might like to use this calculator for your needed computation. You may perform Testing
the Population Correlation Coefficient .
In the Spearman case, the X(i)'s and Y(i)' are ranks, and so the sums of the ranks, and the
sums of the ranks squared, are entirely determined by the number of cases (without any ties).
where P is the sum of the product of each pair of ranks X(i)Y(i). This reduces to:
where d is the difference rank between each x(i) and y(i) pair.
An important consequence of this is that if you enter ranks into a Pearson formula, you get
precisely the same numerical value as that obtained by entering the ranks into the Spearman
formula. This comes as a bit of a shock to those who like to adopt simplistic slogans, such
as"Pearson is for interval data, Spearman is for ranked data". Spearman doesn't work too well
if there are many tied ranks. That's because the formula for calculating the sums of squared
ranks no longer holds true. If one has many tied ranks, use the Pearson formula.
This interpretation is widely accepted, and many scientific journals routinely publish papers
using this interpretation for the estimated result, and even for the test of hypothesis.
Point-Biserial Correlation is used when one random variable is binary (0, 1) and the other is
a continuous random variable; the strength of relationship is measured by the point-biserial
correlation:
r = (X1 - X0)[pq/S2] ½
Where X1and X0 are the means of scores having 1, and 0 values, and p and q are their
proportions, respectively. S2 is the sample variance of the continuous random variable. This
is a simplified version of the Pearson correlation for the case when one of the two random
variables is a (0, 1) Nominal random variable.
Note also that r has the shift-invariant property for any positive scale. That is ax + c, and by +
d, have same r as x and y, for any positive a and b.
It is intuitive that with very few data points, a high correlation may not be statistically
significant. You may see statements such as,"correlation is significant between x and y at the
= 0.005 level" and"correlation is significant at the = 0.05 level." The question is: how do
you determine these numbers?
As you see, F statistic is monotonic function with respect to both: r2, and the sample size n.
Notice that the test for the statistical significance of a correlation coefficient requires that the
two variables be distributed as a bivariate normal.
In the sense that it is used in statistics; i.e., as an assumption in applying a statistical test; a
random sample from the entire population provides a set of random variables X1,...., Xn, that
are identically distributed and mutually independent. Mutually independent is stronger than
pairwise independence. The random variables are mutually independent if their joint
distribution is equal to the product of their marginal distributions.
In the case of joint normality, independence is equivalent to zero correlation, but not in
general. Independence will imply zero correlation but not conversely. Not that not all random
variables have a first moment, let alone a second moment, and hence there may not be a
correlation coefficient.
However; if the correlation coefficient of two random variables is not zero then the random
variables are not independent.
Given that two populations have normal distributions, we wish to test for the following null
hypothesis regarding the equality of correlation coefficients:
and n1= sample size associated with r1, and n2 =sample size associated with r2.
The distribution of the Z-statistic is the standard Normal(0,1); therefore, you may reject H0 if
|Z| 1.96 at the 95% confidence level.
An Application: Suppose r1 = 0.47, r2 = 0.63 are obtained from two independent random
samples of size n1=103, and n2 = 103, respectively. Therefore, the z1 = 0.510, and z2 = 0.741,
with Z-statistics:
This result is not within the rejection region of the two-tails critical values at = 0.05,
therefore is not significant. Therefore, there is not sufficient evidence to reject the null
hypothesis that the two correlation coefficients are equal
Clearly, this test can be modified and applied for test of hypothesis regarding population
correlation based on observed r obtained from a random sample of size n:
with a common covariare X, one may use the following test statistics:
with n - 3 degree of freedom, where n is the tripled-ordered sample size, provided all absolute
value of r's are not equal to 1.
Numerical example: Suppose n = 87, rxy = 0.631, rxz = 0.428, and ryz = 0.683, then t-statistic
is equal to 3.014, with p-value equal to 0.002, indicating a strong evidence against the null
hypothesis.
Adjusted R2: In modeling selection process based of R2 values, it is often necessary and
meaningful to adjust the R2's for their degrees of freedom. Each Adjusted R2 is calculated by:
You might like to use the Testing the Population Correlation Coefficient JavaScript in
performing some numerical experimentation for validating and a deeper understanding of the
concepts.
Almost all models of reality, including regression models, have assumptions that must be
verified in order that the model has power to test hypotheses and for it to be able to predict
accurately.
The following is the list of basic assumptions (i.e., conditions) and the tools to check these
necessary conditions.
1. Any undetected outliers may have major impact on the regression model. Outliers are
a few observations that are not well fitted by the"best" available model. In such case
one, must first investigate the source of data, if there is no doubt about the accuracy or
veracity of the observation, then it should be removed and the model should be
refitted.
You might like to use the Determination of the Outliers JavaScript to perform some
numerical experimentation for validating and for a deeper understanding of the
concepts
2. The dependent variable Y is a linear function of the independent variable X. This can
be checked by carefully examining all the points in the scatter diagram, and see if it is
possible to bound them all within two parallel lines. You may also use the Detective
Testing for Trend to check this condition, see the numerical example for the details.
Click on the image to enlarge it and THEN print it.
A Typical Scatter-diagram for a Linear Model
3. The distribution of the residual must be normal. You may check this condition by
using the Lilliefors Test for Normality.
4. The residuals should have a mean equal to zero, and a constant standard deviation
(i.e., homoskedastic condition). You may check this condition by dividing the
residuals data into two or more groups; this approach is known as the Goldfeld-
Quandt test. You may use the Stationary Testing Process to check this condition.
5. The residuals constitute a set of random variables. You may use the Test for
Randomness and Test for Randomness of Fluctuations to check this condition.
6. Durbin-Watson (D-W) statistic quantifies the serial correlation of least-squares errors
in its original form. D-W statistic is defined by:
where ej is the jth error. D-W takes values within [0, 4]. For no serial correlation, a
value close to 2 is expected. With positive serial correlation, adjacent deviates tend to
have the same sign, therefore D-W becomes less than 2; whereas, with negative serial
correlation, alternating signs of error, D-W takes values larger than 2. For a least-
squares fit where the value of D-W is significantly different from 2, the estimates of
the variances and covariances of the parameters (i.e., coefficients) can be in error,
being either too large or too small. The serial correlation of the deviates arise also
time series analysis and forecasting. You may use the Measuring for Accuracy
JavaScript to check this condition.
The"good" regression equation candidate is further analyzed using a plot of the residuals
versus the independent variable(s). If any patterns are seen in the graph; e.g., an indication of
non-constant variance; then there is a need for data transformation. The following are the
widely used transformations:
You might like to use the Regression Analysis with Diagnostic Tools JavaScript to check
your computations, and to perform some numerical experimentation for a deeper
understanding of the concepts.
We wish to test the following test of hypothesis on the two means of the dependent variable
Y1, and Y2:
H0: The difference between the two means is about a given value M.
Ha: The difference between the two means is quite different than it is claimed.
Since we are dealing with dependent variables, it's natural to investigate the linear regression
coefficients of the two samples; namely, the slopes and the intercepts.
Suppose we are interested in testing the equality of two slopes. In other words, we wish to
determine if two given lines are statistically parallel. Let m1 represent the regression
coefficient for explanatory variable X1 in sample 1 with size n1. Let m2 represent the
regression coefficient for X2 in sample 2 with size n2. The difference between the two
estimated slopes has the following variance:
V= Var [m1 - m2] = {Sxx1 Sxx2[(n1 -2)Sres12 + (n2 -2)Sres22] /[(n1 + n2 - 4)(Sxx1 + Sxx2].
Then, the quantity:
(m1 - m2) / V½
has a t-distribution with d.f. = n1 + n2 - 4.
This test and its generalization in comparing more than two slopes are called the Analysis of
Covariance (ANOCOV). The ANOCOV test is the same as in the ANOVA test; however
there is an additional variable called covariate. ANOCOV enables us to conduct and to
extend the before-and-after test for two different populations. The process is as follows:
1. Find a linear model for (X1, Y1) = (before1, after1), and one for (X2, Y2) =
(before2, after2) that fit best.
2. Perform the test of the hypothesis m1 = m2.
3. If the test result indicates that the slopes are almost equal, then compute the
common slope of the two parallel regression lines:
4. Now, perform the test for the difference between the two the intercepts, which
is the vertical difference between the two parallel lines:
Depending on the outcome of the last test, one may reject the null hypothesis.
For our numerical example, using the Analysis of Covariance JavaScripts, we obtained the
following statistics:
Slope 1 = 1.3513514; its standard error = 0.2587641
Slope 2 = 1.4883721; its standard error = 1.0793906
These indicate that there is no evidence against equality of the slopes. Now, we may test for
any differences in the intercepts. Suppose we wish to test the null hypothesis that the vertical
distance between the two parallel lines is about 4 units.
Using the second function in the Analysis of Covariance JavaScript, we obtained the
statistics: Common Slope = 1.425, Intercept =5.655, providing a moderate evidence against
the null hypothesis.
Further Reading:
Wall F., Statistical Data Analysis Handbook, by McGraw-Hill, New York, 1986.
Residential Properties Appraisal Application
The market value assessment of a set of selected houses involves performing an assessment
by a few individual appraisers for each property and then computing an average obtained
from the few individuals.
Individual appraisal refers to the process of estimating the exchange value of a house on the
basis of a direct comparison between its profiles and the profiles of a set of other comparable
properties sold on acceptable conditions. The profile of a property consists of all the relevant
attributes of each house, such as the location, size, gross living space, age, one-story, two-
story or more, garage, swimming pool, basement, etc. Data on prices and characteristics of
individual houses are available; e.g., from the U.S Bureau of the Census.
Often regression analysis is used to determine what characteristics influence the price of the
houses. Thus it is important to correct the subjective elements in the appraisal value before
carrying out the regression analysis. Coefficients that are not significantly different from zero
as indicated by insignificant t-statistics at a 5% level are dropped from the regression model.
There are several practical questions to be answered before the actual data collection can take
place.
The first step is to use statistical techniques, such as geographic clustering, to define
homogeneous groupings of houses within an urban area.
How many houses should we look at? Ideally, one would collect information on as many
houses as time and money allow. It is these practical considerations that make statistics so
useful. Hardly anyone could spend the time, money, and effort needed to look at every house
for sale. It is unrealistic to obtain information on every house of interest, or in statistical
terms, on every item in the population. Thus, we can look only at a sample of houses -- a
subset of the population -- and hope that this sample will give us reasonably accurate
information about the population. Let us say we can afford to look at 16 houses.
We would probably choose to select a simple random sample-that is, a sample in which,
roughly speaking, every house in the population has equal chance of being included. Then we
would expect to get a reasonably representative sample of houses throughout this selected
size range, reflecting prices for the whole neighborhood. This sample should give us some
information about all houses of all sizes within this range, since a simple random sample
tends to select as many larger houses as smaller houses, and as many expensive as less
expensive ones.
Suppose that the 16 houses in our random sample have the sizes and prices shown in the
following Table. If 160 houses are randomly selected, variables Y, X1, and X2 are random
variables. We have no control over them and cannot know what specific values will be
selected. It is chance only that determines them.
What can we tell about the relationship between size and price from our sample? Reading the
data from the above table row-wise, and entering them in the Regression Analysis with
Diagnostic Tools JavaScript, we found the following simple regression model:
Now consider the problem of estimating the price (Y) of a house from knowing its size (X1)
and also its age (X2). The sizes and prices will be the same as in the simple regression
problem. What we have done is add ages of houses to the existing data. Note carefully that in
real life, one would not first go out and collect data on sizes and prices and then analyze the
simple regression problem. Rather, one would collect all data, which might be pertinent on all
twenty houses at the outset. Then the analysis performed would throw out predictors which
turn out not to be needed.
The objectives in a multiple regression problem are essentially the same as for a simple
regression. While the objectives remain the same, the more predictors we have the
calculations and interpretations become more complicated. For large data set one may use the
multiple regression module of any statistical package such as SAS and SPSS. Using the
Multiple Linear Regression JavaScript, for our numerical example with X1 = Size, X2 = Age,
and Y = Price, we obtain the following statistical model:
The following is the list of basic assumptions (i.e., conditions) and the tools to check these
necessary conditions.
1. Any undetected outliers may have major impact on the regression model. Using the
Determination of the Outliers JavaScript we found that there is no outlier in the above
data set.
2. The dependent variable Price is a linear function of the independent variable Size. By
carefully examining the scatter diagram we found that the linearity condition is
satisfied.
3. The distribution of the residual must be normal. Reading the data from the above table
row-wise, and entering them in the Regression Analysis with Diagnostic Tools
JavaScript, we found that the normality condition is also satisfied.
4. The residuals should have a mean equal to zero, and a constant standard deviation
(i.e., homoskedastic condition). By the Regression Analysis with Diagnostic Tools
JavaScript, the results are satisfactory.
5. The residuals constitute a set of random variables. The persistent non-randomness in
the residuals violates the best linear unbiased estimator condition. However, since the
numerical statistics corresponding to the residuals obtained by using Regression
Analysis with Diagnostic Tools JavaScript, are not significant, therefore our ordinary
least square regression is adequate for our analysis.
6. Durbin-Watson (D-W) statistic quantifies the serial correlation of least-squares errors
in its original form. D-W statistic for this model is 1.995, which is good enough in
rejecting any serial correlation.
7. More Useful Statistics for the Model: The standard errors for the Slope and the
Intercept are0.881, and 1.916, respectively, which are small enough. The F-statistic is
213.599, which is large enough indicating that the model is good enough overall for
prediction purposes.
Notice that since the above analysis is performed on a specific set of data, as always, one
must be careful in generalizing its findings. For example, one may ask, Is the aim prediction,
or is the interest in interpretation of individual regression coefficients? In the latter case,
inferences that condition on "other things being constant" will not be valid unless all other
relevant variables are included in the regression equation. Even if their effect is not
"significant" they have to be included, unless it can be shown that their exclusion makes little
difference to the values of other coefficients. Regression is not at all robust against departures
from the assumption that data have been randomly sampled from the population that is of
interest. This is an issue for all observational data.
The importance of these conditions using Monte Carlo simulations demonstrates that for
linear regression the Normality assumption of the residuals is not all crucial. Lack of
Normality is moderated by sample size depending on number of independent variable, so the
bigger the sample size, the more non-Normality you can tolerate. However the independence
assumption of the errors terms and constancy of its variance are very important. Any large
error in the independent variables has also has a big effect.
Further Reading:
Lovell, R., and French, N., Estimated realization price: what do the banks want and what can be realistically provided?
Journal of property finance, 6, 7-16, 1995.
Newsome, B.A. and Zeitz, J., 1992. Adjusting comparable sales using multiple regression analysis-the need for
segmentation, The Appraisal Journal, 8, 129-135.
As you will see, although one would hope that all tests give the same results this is not
always the case. It all depends on how informative the data are and to what extend they have
been condensed before presenting them to you for analysis (while becoming a good
statistician). The following sections are illustrations in examining how much useful
information they provide and how they may result in opposite conclusions, if one is not
careful enough.
One of the main advantages of constructing a confidence interval (CI) is to provide a degree
of confidence for the point estimate for the population parameter. Moreover, one may utilize
CI for the test of hypothesis purposes. Suppose you wish to test the following general test of
hypothesis:
Ha: The population parameter is not even close to the claimed value.
The process of executing the above test of hypothesis at level of significance using CI is as
follows:
1. Ignore the claimed value in the null hypothesis, for the time being.
2. Construct a 100(1- )% confidence interval based on the available data.
3. If the constructed CI does not contain the claimed value, then there is enough
evidence to reject the null hypothesis; otherwise, there is no reason to reject
the null hypothesis.
You might like to use the Hypothesis Testing with Confidence JavaScript to perform some
numerical experimentation for validating the above assertions and for a deeper understanding.
This suggests a negative relationship; as people get older, they have lower income, on
average. Although slope is small, it cannot be considered as zero, since the t-statistic for it is
-0.70, which is significant.
Now suppose you have only the following secondary data, where the original data have been
condensed:
One may use ANOVA in testing that there is no relationship among age and income.
Performing the analysis provides the F-statistic equal to 3.87 which is quite significant; i.e.,
rejecting the hypothesis of no difference in population average income for the three age
groups.
Now, suppose more condensed secondary data are provided as in the following table:
One may use the Chi-square test for the null hypothesis that age and income are unrelated.
The Chi-square statistic is 1.70, which is not significant; therefore there is no reason to
believe income and age are related! But of course, these data are over-condensed, because
when all data in the sample were used, there was an observable relationship.
There are very direct relationships among linear regression, analysis of variance, t-test and
the coefficient of determination. The following small data set is for illustrating the
connections among the above statistical procedures, and therefore relationships among
statistical tables:
X1 4 5 4 6 7 7 8 9 9 11
X2 8 6 8 10 10 11 13 14 14 16
Suppose we apply the t-test. The statistic is t = 3.207, with d.f. = 18. The p-value is 0.003
indicating a very strong evidence against the null hypothesis.
Now, by introducing a dummy variable x with two values, say 0 and 1, representing the two
data sets, respectively, we are able to apply regression analysis:
x 0 0 0 0 0 0 0 0 0 0
y 4 5 4 6 7 7 8 9 9 11
x 1 1 1 1 1 1 1 1 1 1
y 8 6 8 10 10 11 13 14 14 16
Among other statistics, we obtain a large slope = m = 4 0, indicating the rejection of the
null hypothesis. Notice that, the t-statistic for the slope is: t-statistic = slope/(its standard
error) = 4/ 1.2472191 = 3.207, which is the t-statistic we obtained from the t-test. In general,
the square of t-statistic of the slope is the F-statistic in the ANOVA table; i.e.,
tm2 = F-statistic
Moreover, the coefficient of determination r 2 = 0.36, which is always obtainable from the t-
test, as follows:
r2 = t 2 / (t 2 + d.f.).
For our numerical example, the r 2 is (3.207) 2 / [(3.207) 2 + 18] = 0.36, as expected.
Now, applying ANOVA on the two sets of data, we obtain the F-statistic = 10.285, with d.f.1
= 1, and d.f.2 = 18. The F-statistic is not large enough; therefore, one must reject the null
hypothesis. Note that, in general,
F , (1, n) = t 2 , n.
For our numerical example, F = t = (3.207) 2 = 10.285, as expected.
2
As expected, by just looking at the data, all three tests indicate strongly that the means of the
two sets are quite different.
Particular attention must be paid to a first course in statistics. When I first began studying
statistics, it bothered me that there were different tables for different tests. It took me a while
to learn that this is not as haphazard as it appeared. Binomial, Normal, Chi-square, t, and F
distributions that you will learn are actually closely connected.
A problem with elementary statistical textbooks is that they not only don't provide
information of this kind, to permit a useful understanding of the principles involved, but they
usually don't provide these conceptual links. If you want to understand connections between
statistical concepts, then you should practice making these connections. Learning by doing
statistics lends itself to active rather than passive learning. Statistics is a highly interrelated
set of concepts, and to be successful at it, you must learn to make these links conscious in
your mind.
Students often ask: Why T- table values with d.f. = 1 are much larger compared with other
d.f. values? Some tables are limited. What should I do when the sample size is too large?,
How can I become familiar with tables and their differences. Is there any type of integration
among tables? Is there any connection between test of hypotheses and confidence interval
under different scenarios? For example, testing with respect to one, two, more than two
populations, and so on.
The following Figure demonstrates useful relationships among distributions and a unification
of statistical tables:
Click on the image to enlarge it and THEN print it.
Useful Relationships Among Common Density Functions
For example, the following are some nice connections between major tables:
Standard normal z and F-statistics: F = z2, where F has (d.f.1 = 1, and d.f.2 is the
largest available in the F-table)
T- statistic and F-statistic: F = t2, where F has (d.f.1 = 1, and d.f.2 = d.f. of the t-table)
Chi-square and F-statistics: F = Chi-square/d.f.1, where F has (d.f.1 = d.f. of the Chi-
square-table, and d.f.2 is the largest available in the F-table)
T-statistic and Chi-square: (Chi-square)½ = t, where Chi-square has d.f.=1, and t has
d.f. = .
Standard normal z and T-statistic: z = t, where t has d.f. = .
Standard normal z and Chi-square: (2 Chi-square)½ - (2d.f.-1)½ = z, where d.f. is the
largest available in the Chi-square table).
Standard normal z, Chi-square, and T- statistic: z/[Chi-aquare/n)½ = t with d.f. = n.
F-statistics and its Inverse: F(n1, n2) = 1/F(n2, n1), therefore it is only necessary to
tabulate, say the upper tail probabilities.
Correlation coefficient r and T-statistic: t = [r(n-2)½]/[1 - r2]½.
You may like using the statistical tables at the back of your book and/or P-values JavaScript
in performing some numerical experimentation for validating the above relationships for a
deeper understanding of the concepts. You might need to use a scientific calculator, too.
Further Reading:
Kagan. A., What students can learn from tables of basic distributions, Int. Journal of Mathematical Education in Science
and Technology, 30(6), 1999.
The primary purposes of an index number are to provide a value useful for comparing
magnitudes of aggregates of related variables to each other, and to measure the changes in
these magnitudes over time. Consequently, many different index numbers have been
developed for special use. There are a number of particularly well-known ones, some of
which are announced on public media every day. Government agencies often report time
series data in the form of index numbers. For example, the consumer price index is an
important economic indicator. Therefore, it is useful to understand how index numbers are
constructed and how to interpret them. These index numbers are developed usually starting
with base 100 that indicates a change in magnitude relative to its value at a specified point in
time.
For example, in determining the cost of living, the Bureau of Labor Statistics (BLS) first
identifies a"market basket" of goods and services the typical consumer buys. Annually, the
BLS surveys consumers to determine what they buy and the overall cost of the goods and
services they buy: What, where, and how much. The Consumer Price Index (CPI) is used to
monitor changes in the cost of living (i.e. the selected market basket) over time. When the
CPI rises, the typical family has to spend more dollars to maintain the same standard of
living. The goal of the CPI is to measure changes in the cost of living. It reports the
movement of prices, not in dollar amounts, but with an index number.
The simplest and widely used measure of inflation is the Consumer Price Index (CPI). To
compute the price index, the cost of the market basket in any period is divided by the cost of
the market basket in the base period, and the result is multiplied by 100.
If you want to forecast the economic future, you can do so without knowing anything about
how the economy works. Further, your forecasts may turn out to be as good as those of
professional economists. The key to your success will be the Leading Indicators, an index of
items that generally swing up or down before the economy as a whole does.
Period 1 Period 2
q1 = p1 = q1 = p1 =
Items
Quantity Price Quantity Price
Apples 10 $.20 8 $.25
Oranges 9 $.25 11 $.21
we found that using period 1 quantity, the price index in period 2 is
A better price index could be found by taking the geometric mean of the two. To find the
geometric mean, multiply the two together and then take the square root. The result is called a
Fisher Index.
In USA, since January 1999, the geometric mean formula has been used to calculate most
basic indexes within the Comsumer Price Indeces (CPI); in other words, the prices within
most item categories (e.g., apples) are averaged using a geometric mean formula. This
improvement moves the CPI somewhat closer to a cost-of-living measure, as the geometric
mean formula allows for a modest amount of consumer substitution as relative prices within
item categories change.
Notice that, since the geometric mean formula is used only to average prices within item
categories, it does not account for consumer substitution taking place between item
categories. For example, if the price of pork increases compared to those of other meats,
shoppers might shift their purchases away from pork to beef, poultry, or fish. The CPI
formula does not reflect this type of consumer response to changing relative prices.
The following provides the computational procedures with applications for some Index
numbers, including the Ratio Index, and Composite Index numbers.
Suppose we are interested in the labor utilization of two manufacturing plants A and B with
the unit outputs and man/hours, as shown in the following table, together with the national
standard over the last three months:
The labor utilization for the Plant A in the first month is:
Upon computing the labor utilization for both plants for each month, one can present the
results by graphing the labor utilization over time for comparative studies.
You might like to use the Index Numbers JavaScript to check your hand computation.
Consider the total labor, and material cost for two consecutive years for an industrial plant, as
shown in the following table:
Year 2000 Year 2001
Unit Needed Unit Cost Total Unit Cost Total
Labor 20 10 200 11 220
Almunium 02 100 200 110 220
Electricity 02 50 100 60 120
Total 500 560
From the information given in the above table, the index for the two consecutive years are
500/500 = 1, and 560/500 = 1.12, respectively.
Further Readings:
Watson C., P. Billingsley, D. Croft, and D. Huntsberger, Statistics for Management and Economics, Allyn & Bacon, Inc.,
1993.
A commonly used index of variation measure and comparison for nominal and ordinal data is
called the index of dispersion:
D = k (N2 - fi2)/[N2(k-1)]
where k is the number of categories, fi is the number of ratings in each category, and N is the
total number of rating. D is a number between zero and 1 depending if all ratings fall into one
category, or if ratings were equally divided among the k categories.
Category Frequency
A 25
B 42
C 8
D 13
E 12
Therefore the dispersion index is: D = 5 (1002 - 2766)/[1002(4)] = 0.904, indicating a good
spread of scores across the categories.
You might like to use the Index Numbers JavaScript to check your hand computation.
Is a given city an economically depressed area? The degree of unemployment among labor
(L) force is considered to be a proper indicator of economic depression. To construct the
unemployment index, each person is classified both with respect to membership in the labor
force and the degree of unemployment in fractional value, ranging from 0 to 1. The fraction
that indicates the portion of labor that is idle is:
L = [UiPi] / Pi, the sums are over all i = 1, 2,…, n.
where Pi is the proportion of a full workweek for each resident of the area held or sought
employment and n is the total number of residents in the area. Ui is the proportion of Pi for
which each resident of the area unemployed. For example, a person seeking two days of work
per week (5 days) and employed for only one-half day would be identified with Pi = 2/5 =
0.4, and Ui = 1.5/2 = 0.75. The resulting multiplication UiPi = 0.3 would be the portion of a
full workweek for which the person was unemployed.
Now the question is What value of L constitutes an economic depressed area. The answer
belongs to the decision-maker to decide.
Seasonal index represents the extent of seasonal influence for a particular segment of the
year. The calculation involves a comparison of the expected values of that period to the grand
mean.
We need to get an estimate of the seasonal index for each month, or other periods such as
quarter, week, etc, depending on the data availability. Seasonality is a pattern that repeats for
each period. For example annual seasonal pattern has a cycle that is 12 periods long, if the
periods are months, or 4 periods long if the periods are quartets.
A seasonal index is how much the average for that particular period tends to be above (or
below) the grand average. Therefore, to get an accurate estimate for it, we compute the
average of the first period of the cycle, and the second period, etc, and divide each by the
overall average. The formula for computing seasonal factors is:
Si = Di/D,
where:
Si = the seasonal index for ith period,
Di = the average values of ith period,
D = grand avrage,
i = the ith seasonal period of the cycle.
A seasonal index of 1.00 for a particular month indicates that the expected value of that
month is 1/12 of the overall average. A seasonal index of 1.25 indicates that the expected
value for that month is 25% greater than 1/12 of the overall average. A seasonal index of 80
indicates that the expected value for that month is 20% less than 1/12 of the overall average.
Deseasonalizing Process: Deseasonalizing the data, also called Seasonal Adjustment is the
process of removing recurrent and periodic variations over a short time frame (e.g., weeks,
quarters, months). Therefore, season variations are regularly repeating movements in series
values that can be tied to recurring events. The Deseasonalized data is obtained by simply
dividing each time series observation by the corresponding seasonal index.
Almost all time series published by the government are already deseasonalized using the
seasonal index to unmasking the underlying trends in the data, which could have been caused
by the seasonality factor.
A Numerical Application: The following table provides monthly sales ($000) at a college
bookstore.
M Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
T
1 196 188 192 164 140 120 112 140 160 168 192 200 1972
2 200 188 192 164 140 122 132 144 176 168 196 194 2016
3 196 212 202 180 150 140 156 144 164 186 200 230 2160
4 242 240 196 220 200 192 176 184 204 228 250 260 2592
Mean: 208.6 207.0 192.6 182.0 157.6 143.6 144.0 153.0 177.6 187.6 209.6 221.0 2185
Index: 1.14 1.14 1.06 1.00 0.87 0.79 0.79 0.84 0.97 1.03 1.15 1.22 12
The sales show a seasonal pattern, with the greatest number when the college is in session
and decrease during the summer months. For example, for January the index is:
You might like to use the Seasonal Index JavaScript to check your hand computation. As
always you must first use Plot of the Time Series as a tool for the initial characterization
process.
For testing seasonality based on seasonal index, you may like to use Test for Seasonality
JavaScript.
For modeling the time series having both the seasonality and trend components, visit the
Business Forecasting site.
The foundation of Ideal Weight rests on historical, social, behavioral, cultural, physiological,
metabolic, and genetic perspectives.
The normal digestive process: Normally, as food moves along the digestive tract, digestive
juices and enzymes digest and absorb calories and nutrients (see figure 1). After we chew and
swallow our food, it moves down the esophagus to the stomach, where a strong acid
continues the digestive process. The stomach can hold about 3 pints of food at one time.
When the stomach contents move to the duodenum, the first segment of the small intestine,
bile and pancreatic juice speed up digestion. Most of the iron and calcium in the foods we eat
is absorbed in the duodenum. The jejunum and ileum, the remaining two segments of the
nearly 20 feet of small intestine, complete the absorption of almost all calories and nutrients.
The food particles that cannot be digested in the small intestine are stored in the large
intestine until eliminated.
The history of the formulas for calculating ideal body weight began in 1871 when a French
medical doctor developed a model. These formulas pre-dated and probably influenced
development of the Metropolitan Life tables of height and weight. However, these formulas
have no method to compensate for Age and Current Weight. They are only based on Height.
For people who are very overweight or obese the formulas would suggest an ideal weight that
is virtually impossible to achieve or maintain through dieting.
Body Mass Index or BMI is the standardized method for determining whether your body
weight and the amount of body fat you have are in a healthy range. A BMI Metric Calculator
uses a weight-to-height ratio (BMI=kg/m2) and assigns a number to the result. To get your
approximate BMI using English system, multiply your weight in pounds by 703, then divide
the result by your height in inches, and divide that result by your height in inches a second
time, i.e. BMI = 703W/h2
The BMI range is between 18.5 - 30 or greater. Generally speaking, a Body Mass Index over
25 is considered overweight and 30 or above is obese. People with a higher percentage of
body fat tend to have a higher BMI except for body builders
The BMI ranges for adults are shown in the following chart.
Click on the image to enlarge
They are not exact ranges of healthy and unhealthy weights. However, they show that health
risk increases at higher levels of overweight and obesity. Even within the healthy BMI range,
weight gains can carry health risks for adults.
This Body Mass Index chart lets you see if your weight falls within a healthy range. Use this
as a guide only. Work closely with your doctor to develop a weight control plan that is right
for you.
Overweight refers to an excess of body weight, but not necessarily body fat. Obesity means
an excessively high proportion of body fat. Health professionals use a measurement called
body mass index (BMI) to classify an adult's weight as healthy, overweight, or obese. BMI
describes body weight relative to height and is correlated with total body fat content in most
adults. For example, having excess abdominal body fat is a health risk. Men with a waist of
more than 40 inches around and women with a waist of 35 inches or more are at risk for
health problems.
For Men:
For Women:
Women tend to imagine their ideal weight is unrealistically low, so they diet unnecessarily.
Men tend to allow their ideal weight to be higher than medically recommended. Men and
Women should learn from each other.
You might like to use the Body Mass Index JavaScript to check your hand computation.
Notice that, for example, a large waist and wide hips signal accumulation of so-called "intra-
abdominal fat" -- the particularly harmful deep "hidden" fat that surrounds the abdominal
organs and is linked to diabetes, high blood pressure and heart disease. Therefore, one must
reflects on other fat distribution different from that indicated by weight and height.
Further Readings:
Pai M., and Paloucek F. The origin of the "Ideal" body weight equations, Ann Pharmacol, 34, 1066-1069, 2000.
Statistical Technique and Index Numbers
One must be careful in applying or generalizing any statistical technique to the index
numbers. For example, the correlation of rates raises the potential problem. Specifically, let
X, Y, and X be three independent variables, so that pair-wise correlations are zero; however,
the ratios X/Y, and Z/Y will be correlated due to the common denominator.
Let I = X1/X2 where X1, and X2 are dependent variables with correlation r, having mean and
coefficient of variation m1, c1 and m2, c2, respectively; then,
For more index numbers and ratios, visit Economics and Financial Ratios and Indices site.
This section is a part of the JavaScript E-labs learning technologies for decision making.
Technical Details and Applications: At the end of each JavaScript you will find a link
under"For Technical Details and Applications Back to:".
MENU
1. Testing the
Summarizing Medians
Data Testing the
Variance
Bivariat
e 5. One population
Samplin & two or more
g variables
Statistic
s The Before-
Descript and-After
ive Test for
Statistic
s Means and
Determi Variances
nation The Before-
of the and-After
Outliers Test for
Empiric Proportions
al Chi-square
Distribu Test for
tion Crosstable
Functio Relationship
n Multiple
Histogr Regressions
am Polynomial
The Regressions
Three Quadratic
Means Regression
Simple
2. Regression
Computationa with
l probability Diagnostic
Tools
Combin Testing the
atorial Population
Maths Correlation
Compar Coefficient
ing Two
Random 6. Two
Variabl populations & one
es variable
Multino
mial Confidence
Distribu Intervals for
tions Two
P- Populations
values K-S Test for
for the Equality of
Popular Two
Distribu Populations
tions Two
Populations
3. Testing
Requirements Means &
for most tests Variances
& estimations
7. Several
Remova populations & one
l of the or more variables
Outliers
Sample
Size Analysis of
Determi Covariance
nation ANOVA:
Test for Testing
Homog Equality of
eneity the Means
of ANOVA for
Populati Condensed
on Data Sets
Test for Compatibilit
Normali y of Multi-
ty Counts
Test for Equality of
Random Multi-
ness variances:
The
4. One Bartlett's
population & Test
one variable Identical
Populations
Binomi Test for
al Exact Crosstable
Confide Data
nce Testing the
Interval Proportions
s Testing
Estimati Several
ons Correlation
with Coefficients
Confide
nce
Goodne
ss-of-Fit
for
Discrete
Variabl
es
Testing
the
Mean