Introduction To Statistics

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 21

Introduction to Statistics

Statistics is the discipline that concerns the collection,


organization, analysis, interpretation and presentation of data.
The Purpose of Statistics:
 Statistics teaches people to use a limited sample to make
intelligent and accurate conclusions about a greater population.
The use of tables, graphs, and charts play a vital role in
presenting the data being used to draw these conclusions.
Two main statistical methods are used in data analysis:
 descriptive statistics, which summarize data from a sample
using indexes such as the mean or standard deviation,
and inferential statistics, which draw conclusions from data that
are subject to random variation (e.g., observational errors,
sampling variation).[
The field of statistics is the science of learning from data.
Statistical knowledge helps you use the proper methods to collect
the data, employ the correct analyses, and effectively present the
results. Statistics is a crucial process behind how we make
discoveries in science, make decisions based on data, and make
predictions. Statistics allows you to understand a subject much
more deeply.
What is Data?

Data is a raw and unorganized fact that required to be processed


to make it meaningful. Data can be simple at the same time
unorganized unless it is organized. Generally, data comprises
facts, observations, perceptions numbers, characters, symbols,
image, etc.
Data is always interpreted, by a human or machine, to derive
meaning. So, data is meaningless. Data contains numbers,
statements, and characters in a raw form.

What is Information?

Information is a set of data which is processed in a meaningful


way according to the given requirement. Information is processed,
structured, or presented in a given context to make it meaningful
and useful.

It is processed data which includes data that possess context,


relevance, and purpose. It also involves manipulation of raw data.
Information assigns meaning and improves the reliability of the
data. It helps to ensure undesirability and reduces uncertainty.

Functions or Uses of Statistics

(1) Statistics helps in providing a better understanding and exact


description of a phenomenon of nature.
(2) Statistics helps in the proper and efficient planning of a
statistical inquiry in any field of study.
3) Statistics helps in collecting appropriate quantitative data.
(4) Statistics helps in presenting complex data in a suitable
tabular, diagrammatic and graphic form for easy and clear
comprehension of the data.
(5) Statistics helps in understanding the nature and pattern of
variability of a phenomenon through quantitative observations.
(6) Statistics helps in drawing valid inferences, along with a
measure of their reliability about the population parameters from
the sample data.

Statistical series: It is a systematic flow of data in a logical or


specific order. There are three types of series in Statistical series,
1.   Individual Series
2.  Discrete series  
3.  Continuous series.
Individual series is where the value of the variable occurs only
one time. The value only occurs at a single time. In the discrete
series, there are different values of a variable given in a
discontinuous manner with their respective frequencies.
Here one of the values has a frequency of more than one. In the
continuous series the different values of the variables are shown
in a continuous manner with their respective frequencies. These
series can be arranged in a descending or in ascending order.
Frequency distribution
In statistics, a frequency distribution is a list, table or graph that
displays the frequency of various outcomes in a sample. Each
entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval
A frequency distribution is an overview of all distinct values
in some variable and the number of times they occur. That is,
a frequency distribution tells how are distributed over values.
Measures of central tendency
A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within that
set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also
classed as summary statistics. The mean (often called the
average) is most likely the measure of central tendency that you
are most familiar with, but there are others, such as the median
and the mode.
The mean, median and mode are all valid measures of central
tendency, but under different conditions, some measures of
central tendency become more appropriate to use than others.

Arithmetic Mean
In mathematics and statistics, the arithmetic mean, or simply the
mean or average when the context is clear, is the sum of a
collection of numbers divided by the count of numbers in the
collection, The arithmetic mean is the simplest and most widely
used measure of a mean, or average. It simply involves taking the
sum of a group of numbers, then dividing that sum by the count of
the numbers used in the series.
Median
The median is the middle number in a sorted, ascending or
descending, list of numbers and can be more descriptive of
that data set than the average. The median is sometimes used as
opposed to the mean when there are outliers in the sequence that
might skew the average of the values.
In statistics and probability theory, a median is a value separating
the higher half from the lower half of a data sample, a population
or a probability distribution. For a data set, it may be thought of as
"the middle" value
Mode
The mode is the number that appears most frequently in a data
set. A set of numbers may have one mode, more than one mode,
or no mode at all. The mode of a set of data values is the value
that appears most often. If X is a discrete random variable, the
mode is the value x at which the probability mass function takes
its maximum value. In other words, it is the value that is most
likely to be sampled.
Combined mean
A combined mean is simply a weighted mean, where
the weights are the size of each group. For more than two
groups: Add the means of each group—each weighted by the
number of individuals or data points, Divide the sum from Step 1
by the sum total of all individuals (or data points
Measures of dispersion
The measures of central tendency are not adequate to describe
data. Two data sets can have the same mean but they can be
entirely different. Thus to describe data, one needs to know the
extent of variability. This is given by the measures of dispersion.
Range, interquartile range, Mean deviation and standard
deviation are the commonly used measures of dispersion.
Range - The range is the difference between the largest and the
smallest observation in the data
Inter Quartile Range - Interquartile range is defined as the
difference between the 25th and 75th percentile (also called the
first and third quartile). Hence the interquartile range describes
the middle 50% of observations
Standard Deviation
In statistics, the standard deviation is a measure of the amount of
variation or dispersion of a set of values. A low standard deviation
indicates that the values tend to be close to the mean of the set,
while a high standard deviation indicates that the values are
spread out over a wider range.
The main and most important purpose of standard deviation is
to understand how spread out a data set is. ... A high standard
deviation implies that, on average, data points in the first cloud
are all pretty far from the average (it looks spread out). A
low standard deviation means most points are very close to the
average
Mean Deviation
Referred to as average deviation, it is defined as the sum of the
deviations(ignoring signs) from an average divided by the number
of items in a distribution The average can be mean, median or
mode. Theoretically median is d best average of choice because
sum of deviations from median is minimum, provided signs are
ignored. However, practically speaking, arithmetic mean is the
most commonly used average for calculating mean deviation
What Is Skewness?

Skewness refers to distortion or asymmetry in a symmetrical bell


curve, or normal distribution, in a set of data. If the curve is shifted
to the left or to the right, it is said to be skewed. Skewness can be
quantified as a representation of the extent to which a given
distribution varies from a normal distribution. A normal distribution
has a skew of zero,

Correlation
 Correlation is a statistical technique that can show
whether and how strongly pairs of
variables are related. Correlation is a statistic that measures
the degree to which two variables move in relation to each
other.
 In finance, the correlation can measure the movement of a
stock with that of a benchmark index, such as the S&P 500.
 Correlation measures association, but doesn't show if x
causes y or vice versa, or if the association is caused by a
third–perhaps unseen–factor.
 Positive correlation is a relationship between two variables in
which both variables move in the same direction. This is
when one variable increases while the other increases and
visa versa. For example, positive correlation may be that the
more you exercise, the more calories you will burn.
Whilst negative correlation is a relationship where one
variable increases as the other decreases, and vice versa.
 1 indicates a perfect positive correlation.

 -1 indicates a perfect negative correlation.

 0 indicates that there is no relationship between the different


variables.

 Values between -1 and 1 denote the strength of the


correlation

Regression analysis

is a set of statistical processes for estimating the


relationships between a dependent variable and one or more
independent variables. regression analysis is a set of
statistical processes for estimating the relationships between
a dependent variable (often called the 'outcome variable')
and one or more independent variable

Regression analysis is primarily used for two conceptually


distinct purposes. First, regression analysis is widely used
for prediction and forecasting,

Time Series

A time series is a series of data points indexed in time order. Most


commonly, a time series is a sequence taken at successive
equally spaced points in time
Values taken by a variable over time (such as daily sales
revenue, weekly orders, monthly overheads, yearly income) and
tabulated or plotted as chronologically ordered numbers or data
points. To yield valid statistical inferences, these values must be
repeatedly measured, often over a four to five year period.
Time series consist of four components:
(1) Seasonal variations that repeat over a specific period such as
a day, week, month, season, etc.,
(2) Trend variations that move up or down in a reasonably
predictable pattern,
(3) Cyclical variations that correspond with business or economic
'boom-bust' cycles or follow their own peculiar cycles, and
(4) Random variations that do not fall under any of the above
three classifications.

Index numbers. 

An index number is an economic data figure reflecting price or


quantity compared with a standard or base value.
The base usually equals 100 and the index number is usually
expressed as 100 times the ratio to the base value.

In economics and finance, an index is a statistical measure of


change in a representative group of individual data points. These
data may be derived from any number of sources, including
company performance, prices, productivity, and employment.
Economic indices track economic health from different
perspectives.
The value of money does not remain constant over time. It rises
or falls and is inversely related to the changes in the price level. A
rise in the price level means a fall in the value of money and a fall
in the price level means a rise in the value of money. Thus,
changes in the value of money are reflected by the changes in the
general level of prices over a period of time. Changes in the
general level of prices can be measured by a statistical device
known as ‘index number.’

Index number is a technique of measuring changes in a variable


or group of variables with respect to time, geographical location or
other characteristics. There can be various types of index
numbers, but, in the present context, we are concerned with price
index numbers, which measures changes in the general price
level (or in the value of money) over a period of time.

Price index number indicates the average of changes in the prices


of representative commodities at one time in comparison with that
at some other time taken as the base period. 

(i)Index numbers are a special type of average. Whereas mean,


median and mode measure the absolute changes and are used to
compare only those series which are expressed in the same units,
the technique of index numbers is used to measure the relative
changes in the level of a phenomenon where the measurement of
absolute change is not possible and the series are expressed in
different types of items.

(ii) Index numbers are meant to study the changes in the effects
of such factors which cannot be measured directly. For example,
the general price level is an imaginary concept and is not capable
of direct measurement. But, through the technique of index
numbers, it is possible to have an idea of relative changes in the
general level of prices by measuring relative changes in the price
level of different commodities.

Probability
Probability is the branch of mathematics concerning numerical
descriptions of how likely an event is to occur or how likely it is
that a proposition is true. The probability of an event is a number
between 0 and 1, where, roughly speaking, 0 indicates
impossibility of the event and 1 indicates certainty.

A classic example of a probabilistic experiment is a fair coin toss,


in which the two possible outcomes are heads or tails. In this
case, the probability of flipping a head or a tail is 1/2. In an actual
series of coin tosses, we may get more or less than exactly 50%
heads. But as the number of flips increases, the long-run
frequency of heads is bound to get closer and closer to 50%.
The probability formula is defined as the possibility of an event to
happen is equal to the ratio of the number of favourable outcomes
and the total number of outcomes.

Probability of event to happen P(E) = Number of favourable


outcomes/Total Number of outcomes

Probability Tree
The tree diagram helps to organize and visualize the different
possible outcomes. Branches and ends of the tree are two main
positions. Probability of each branch is written on the branch,
whereas the ends are containing the final outcome
Mutually exclusive events

In logic and probability theory, two events are mutually exclusive


or disjoint if they cannot both occur at the same time. A clear
example is the set of outcomes of a single coin toss, 

Examples:
 Turning left and turning right are Mutually Exclusive (you
can't do both at the same time)
 Tossing a coin: Heads and Tails are Mutually Exclusive
 Cards: Kings and Aces are Mutually Exclusive

What is not Mutually Exclusive:

 Turning left and scratching your head can happen at the


same time
 Kings and Hearts, because we can have a King of Hearts!

Independent events

This is a fundamental notion in probability theory, as in statistics


and the theory of stochastic processes. Two events are
statistically independent if the occurrence of one does not affect
the probability of occurrence of the other

Dependent events

Two events are dependent if the outcome or occurrence of the


first affects the outcome or occurrence of the second so that the
probability is changed.

Data classification

In the field of data management, data classification as a part of


the Information Lifecycle Management process can be defined as
a tool for categorization of data to enable/help organizations to
effectively answer the following questions: What data types are
available? Where are certain data located?
Data classification is the process of analyzing structured or
unstructured data and organizing it into categories based on the
file type and contents.
Bar charts

A bar chart or bar graph is a chart or graph that


presents categorical data with rectangular bars
with heights or lengths proportional to the values that they
represent. The bars can be plotted vertically or horizontally. A
vertical bar chart is sometimes called a column chart.
A bar graph shows comparisons among discrete categories. One
axis of the chart shows the specific categories being compared,
and the other axis represents a measured value. Some bar
graphs present bars clustered in groups of more than one,
showing the values of more than one measured variable.
Bar charts are a type of graph that are used to display and
compare the number, frequency or other measure (e.g. mean) for
different discrete categories of data. Bar charts are one of the
most commonly used types of graph because they are simple to
create and very easy to interpret. They are also a flexible chart
type and there are several variations of the standard bar chart
including horizontal bar charts, grouped or component charts, and
stacked bar charts.

Simple Bar Chart


Multiple Bar Chart
Sub Divided Bar Chart

Meaning of Graphic Representation of Data:


Graphic representation is another way of analysing numerical
data. A graph is a sort of chart through which statistical data are
represented in the form of lines or curves drawn across the
coordinated points plotted on its surface.

Graphs enable us in studying the cause and effect relationship


between two variables. Graphs help to measure the extent of
change in one variable when another variable changes by a
certain amount.

General Principles of Graphic Representation:


There are some algebraic principles which apply to all types of
graphic representation of data. In a graph there are two lines
called coordinate axes. One is vertical known as Y axis and the
other is horizontal called X axis. These two lines are
perpendicular to each other. Where these two lines intersect each
other is called ‘0’ or the Origin. On the X axis the distances right
to the origin have positive value (see fig. 7.1) and distances left to
the origin have negative value. On the Y axis distances above the
origin have a positive value and below the origin have a negative
value.

Pie Diagram
A pie chart is a circular statistical graphic, which is divided into
slices to illustrate numerical proportion. In a pie chart, the arc
length of each slice, is proportional to the quantity it represents
Uses of Pie diagram:
1. Pie diagram is useful when one wants to picture proportions of
the total in a striking way.

2. When a population is stratified and each strata is to be


presented as a percentage at that time pie diagram is used.

Merits of a Graph

 The graph presents data in a manner which is easier to


understand.
 It allows us to present statistical data in an attractive manner
as compared to tables. Users can understand the main
features, trends, and fluctuations of the data at a glance.
 A graph saves time.
 It allows the viewer to compare data relating to two different
time-periods or regions.
 The viewer does not require prior knowledge of mathematics
or statistics to understand a graph.
 We can use a graph to locate the mode, median, and mean
values of the data.
 It is useful in forecasting, interpolation, and extrapolation of
data.

Sampling

In statistics, quality assurance, and survey methodology,


sampling is the selection of a subset of individuals from within a
statistical population to estimate characteristics of the whole
population. Statisticians attempt for the samples to represent the
population in question.

Sampling is a process used in statistical analysis in which a


predetermined number of observations are taken from a larger
population. The methodology used to sample from a larger
population depends on the type of analysis being performed, but
it may include simple random sampling or systematic sampling.
Types of Sampling

There are five types of sampling: Random, Systematic,


Convenience, Cluster, and Stratified.

 Random sampling is analogous to putting everyone's name


into a hat and drawing out several names. Each element in
the population has an equal chance of occuring. While this is
the preferred way of sampling, it is often difficult to do. It
requires that a complete list of every element in the
population be obtained. Computer generated lists are often
used with random sampling. You can generate random
numbers using the TI82 calculator.
 Systematic sampling is easier to do than random sampling.
In systematic sampling, the list of elements is "counted off".
That is, every kth element is taken. This is similar to lining
everyone up and numbering off "1,2,3,4; 1,2,3,4; etc". When
done numbering, all people numbered 4 would be used.
 Convenience sampling is very easy to do, but it's probably
the worst technique to use. In convenience sampling, readily
available data is used. That is, the first people the surveyor
runs into.
 Cluster sampling is accomplished by dividing the population
into groups -- usually geographically. These groups are
called clusters or blocks. The clusters are randomly
selected, and each element in the selected clusters are
used.
 Stratified sampling also divides the population into groups
called strata. However, this time it is by some characteristic,
not geographically. For instance, the population might be
separated into males and females. A sample is taken from
each of these strata using either random, systematic, or
convenience sampling.
 Choice of sampling method depends upon purpose , time
available, budget

Limitations of sampling

The serious limitation of the sampling method is that


it involves biased selection and thereby leads us to draw
erroneous conclusions. Bias arises when the method of selection
of sample employed is faulty. Relative small samples properly
selected may be much more reliable than large samples poorly
selected.
Test of significance

A test of significance is a formal procedure for comparing


observed data with a claim (also called a hypothesis), the truth of
which is being assessed. 
What Are Tests for Significance
    Two questions arise about any hypothesized relationship
between two variables:
1) what is the probability that the relationship exists;
2) if it does, how strong is the relationship
    There are two types of tools that are used to address these
questions: the first is addressed by tests for statistical
significance; and the second is addressed by Measures of
Association.

 Tests for statistical significance are used to address the question:


what is the probability that what we think is a relationship between
two variables is really just a chance occurrence?

 If we selected many samples from the same population, would


we still find the same relationship between these two variables in
every sample? If we could do a census of the population, would
we also find that this relationship exists in the population from
which the sample was drawn? Or is our finding due only to
random chance?

 Tests for statistical significance tell us what the probability is that


the relationship we think we have found is due only to random
chance. They tell us what the probability is that we would be
making an error if we assume that we have found that a
relationship exists.
 We can never be completely 100% certain that a relationship
exists between two variables. There are too many sources of
error to be controlled, for example, sampling error, researcher
bias, problems with reliability and validity, simple mistakes, etc.

Steps in Testing for Statistical Significance


1) State the Research Hypothesis
2) State the Null Hypothesis
3) Select a probability of error level (alpha level)
4) Select and compute the test for statistical significance
5) Interpret the results
 
1) State the Research Hypothesis
    A research hypothesis states the expected relationship
between two variables. It may be stated in general terms, or it
may include dimensions of direction and magnitude. For example,
General: The length of the job training program is related to
the rate of job placement of trainees.
Direction: The longer the training program, the higher the
rate of job placement of trainees.
Magnitude: Longer training programs will place twice as
many trainees into jobs as shorter programs.
General: Graduate Assistant pay is influenced by gender.
Direction: Male graduate assistants are paid more than
female graduate assistants.
Magnitude: Female graduate assistants are paid less than
75% of what male graduate assistants are paid.
2) State the Null Hypothesis
    A null hypothesis usually states that there is no relationship
between the two variables. For example,
There is no relationship between the length of the job
training program and the rate of job placement of trainees.
Graduate assistant pay is not influenced by gender.
A null hypothesis may also state that the relationship proposed in
the research hypothesis is not true. For example,
Longer training programs will place the same number or
fewer trainees into jobs as shorter programs.
Female graduate assistants are paid at least 75% or more of
what male graduate assistants are paid.
Researchers use a null hypothesis in research because it is
easier to disprove a null hypothesis than it is to prove a research
hypothesis. The null hypothesis is the researcher's "straw man."
That is, it is easier to show that something is false once than to
show that something is always true. It is easier to find
disconfirming evidence against the null hypothesis than to find
confirming evidence for the research hypothesis.
 
3) TYPE I AND TYPE II ERRORS

 Even in the best research project, there is always a possibility


(hopefully a small one) that the researcher will make a mistake
regarding the relationship between the two variables. There are
two possible mistakes or errors.

 The first is called a Type I error. This occurs when the researcher
assumes that a relationship exists when in fact the evidence is
that it does not. In a Type I error, the researcher should accept
the null hypothesis and reject the research hypothesis, but the
opposite occurs. The probability of committing a Type I error is
called alpha.

 The second is called a Type II error. This occurs when the


researcher assumes that a relationship does not exist when in
fact the evidence is that it does. In a Type II error, the researcher
should reject the null hypothesis and accept the research
hypothesis, but the opposite occurs. The probability of committing
a Type II error is called beta.

   

SELECT A PROBABILITY OF ERROR LEVEL (ALPHA LEVEL)


 Researchers generally specify the probability of committing a
Type I error that they are willing to accept, i.e., the value of alpha.
In the social sciences, most researchers select an alpha=.05. This
means that they are willing to accept a probability of 5% of
making a Type I error, of assuming a relationship between two
variables exists when it really does not. In research involving
public health, however, an alpha of .01 is not unusual.
Researchers do not want to have a probability of being wrong
more than 0.1% of the time, or one time in a thousand.

 If the relationship between the two variables is strong (as


assessed by a Measure of Association), and the level chosen for
alpha is .05, then moderate or small sample sizes will detect it. As
relationships get weaker, however, and/or as the level of alpha
gets smaller, larger sample sizes will be needed for the research
to reach statistical significance. 
Using T-Tests
 T-Tests are tests for statistical significance that are used with
interval and ratio level data. T-tests can be used in several
different types of statistical tests:
1) to test whether there are differences between two groups
on the same variable, based on the mean (average) value of
that variable for each group; for example, do students at
private schools score higher on the SAT test than students
at public schools?
2) to test whether a group's mean (average) value is greater
or less than some standard; for example, is the average
speed of cars on freeways in California higher than 65 mph?
3) to test whether the same group has different mean
(average) scores on different variables; for example, are the
same clerks more productive on IBM or Macintosh
computers?
The Chi Square Test
  Chi Square is used as a test for statistical significance. For
example, we hypothesize that there is a relationship between the
type of training program attended and the job placement success
of trainees
Normal Distribution

The normal distribution is the most important probability


distribution in statistics because it fits many natural phenomena.
For example, heights, blood pressure, measurement error, and IQ
scores follow the normal distribution. It is also known as the
Gaussian distribution and the bell curve.
The normal distribution is a probability function that describes how
the values of a variable are distributed. It is a symmetric
distribution where most of the observations cluster around the
central peak and the probabilities for values further away from
the mean taper off equally in both directions. Extreme values in
both tails of the distribution are similarly unlikely.

Properties of a normal distribution

 The mean, mode and median are all equal.


 The curve is symmetric at the center - around the mean
 Exactly half of the values are to the left of center and exactly
half the values are to the right.
 The total area under the curve is 1.

You might also like