Basic Concepts in Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

Basic Concepts in Statistics last election

Definition of Statistics INFERENTIAL STATISTICS


• A student wants to estimate his chance
• Statistics came from the Latin word ‘status’ meaning
of passing the subject based on his
‘state’
average score.
• In plural sense, statistics is any set of numerical data
• A housewife would like to predict based
(e.g. vital
on last year’s weekly expenditures, the
statistics, monthly sales)
average weekly expenditure she will
• In singular sense, statistics is a branch of science that
spend on groceries for this year.
deals
• A politician would like to estimate, based
with the collection, presentation, analysis and
on an opinion poll, his chance of winning
interpretation
of an election
of data
General Uses of Statistics
1.2 Variables and Levels of Measurement
• Aids in decision-making
• Data - facts or figures from which conclusions may be
o provides comparison
drawn
o explains actions that has taken place
• Data Set - collection of facts and figures
o justifies a claim or assertion
• Variable - a characteristic or attribute of persons or
o predicts future outcome
objects which can
o estimates unknown quantities
assume different values or labels under statistical study
• Summarizes data for public use
Examples: income, sex, age, height, attitudes about
Areas of Statistics
school, score on a
1. Descriptive Statistics
measure of depression, etc.
• Statistics used to describe the characteristics of a
• Measurement- process of determining the value or
distribution of scores. They
label of a particular
apply only to the members of a sample or population
variable for a particular experimental unit
from which data have
been collected.
Types of Variables
• Generalizability to the population is not the objective
1. Quantitative Variable - outcomes of the variables are
of descriptive statistics
expressed
2. Inferential Statistics
numerically that are meaningful or indicate some sort of
• Statistics, derived from sample data, that are used to
amount
make inferences about
Examples: age, allowance, weight, height, etc.
the population from which the sample was drawn.
2. Qualitative Variable - outcomes of the variables are
• Generalizability is important in this type of statistic
expressed nonnumerically
because it is the ability
or categorically
to use the results of data collected from a sample to
Examples: sex, religion, eye color, marital status, etc.
reach conclusions about
the characteristics of the population.
Kinds of Quantitative Variable
1. Discrete Variable
Descriptive statistics vs Inferential statistics
- a variable which can assume finite, or at most ,
DESCRIPTIVE STATISTICS
countably infinite number of values
• A student wants to find his average
- usually measured by counting or enumeration
score on his 3 exams
- answers the question “how many”
• A housewife wants to determine the
Examples: number of pets, number of children, no. of
average weekly expenditure of the
subjects taken, etc
household
2. Continuous Variable
• A politician wants to know the exact
number of votes he received in the
- a variable which can assume infinitely many values
corresponding to a line interval
- gives rise to measurement
- answers the question “how much”
Examples: height, weight, allowance, etc

Scales/Levels of Measurement for Variables


1. Nominal (classificatory scale)
weakest level of measurement where
numbers or symbols are used simply for
categorizing subjects into different
groups
Examples:
marital status (single, married, divorced, widow)
religious affiliation (Catholic, INC, Born Again, ect.) 1.3 Important Terms and Notations in Statistics

2. Ordinal (ranking scale) Basic Terms in Statistics


• numbers assigned to categories of any variable may Population - The totality of elements under a statistical
be ranked or ordered study
in some low-to-high manner Any value generated from or applied to the population
• limitation: differences between rankings may appear is a
equal when in parameter - a numerical characteristic of the
reality it is known that they are not population.
Examples: Census – collecting information to all units of the
athletes rankings population
teaching ratings Sample - A subset of the population
Any value derived from the sample, such as the mean, is
3. Interval a statistic - a numerical characteristic of the sample.
• has the properties of the nominal and ordinal Survey Sampling – studying a subset of
levels, and in addition, the distances between any
two numbers on the scale are of known sizes
• In interval scale, there is no ‘true zero’ point
Examples:
temperature
IQ scores

4. Ratio
• highest level of measurement
• contains the properties of nominal, ordinal and
interval, and in addition, it has a ‘true zero’ point
• in economics or business, an income variable could
be measured on a ratio scale because it makes
sense to talk of ‘zero’ income, while it makes no
sense to talk about ‘zero’ intelligence
Examples:
weight
number of books
1.4 Sampling, Collection and
Presentation of Data

Two Types of Data:


1. Primary data - data refer to information which are
gathered
directly from an original source or which are based on
direct or
first-hand experience.
2. Secondary data – refer to information which are
taken from
published or unpublished data which were previously
gathered by
other individuals or groups.

General Classifications of Collecting Data


§ Census
• complete enumeration
• process of gathering information from every unit in
the population
* not always possible to get timely, accurate and
economical data
* costly, especially if the number of units in the
population is too
large
§ Survey Sampling
• process of obtaining information from the units in the
selected
Sample

Advantages of Survey Sampling:


• Reduced cost
• Greater speed
• Greater scope
• Greater accuracy

2 Procedures of Survey Sampling


1. Probability Sampling
• a sampling procedure that gives every element of the
population a (known) nonzero
chance of being selected in the sample
2. Non-probability Sampling
• not all the elements of the population are given a
chance to be included in the sample
Remarks:
• Whenever possible, probability sampling is used
because there is no objective way of
assessing the reliability of inferences under non-
probability sampling.
• What we want is a sample that is representative of the
population.
• The sampling frame is a listing of all individual units in
the population.

Methods of Probability Sampling


1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling

1. Simple Random Sampling


• Method of sampling where all members of the
population have an equal
chance of being included in the sample.
• This procedure is suitable when the population being
studied is
homogeneous with respect to the characteristic under
investigation.
• This is usually done by draw lots, by the use of the
table of random
numbers, or by the use of a calculator.

Methods of Non-probability Sampling


1. Purposive Sampling
• sets out to make a sample agree with the profile of
the population
based on some pre-selected characteristic
2. Quota Sampling
• selects a specified number (quota) of sampling units
possessing certain
characteristics
3. Convenience Sampling
• also called accidental/haphazard sampling
• selecting a sample based on ease of access or
availability
4. Judgment Sampling
• selects sample in accordance with an expert’s
judgment

3 Methods of Data Collection


1. Objective Method
• data are gathered with the use of a measuring or
counting device
examples: meter stick, tape measure, weighing scale
2. Subjective Method
• data are collected by conducting an interview, which
may be done
personally, or by gathering administered questionnaires
• written or verbal reports are elicited from identified
respondents

Self-Administered Questionnaire:
§ Obtained information is limited to subject’s written
answers to prearranged
questions
§ It can be administered to a large number of people
simultaneously
§ Respondents may feel freer to express views and are
less pressured to
answer immediately
§ Lower response rate

Personal Interview:
§ Missing information and vague responses are
minimized with the proper
probing of the interviewer
§ It is administered to a person or group one at a time
§ Respondents may feel more cautious particularly in
answering sensitive
questions for fear of disapproval
§ Higher response rate through call-backs

3. Use of Existing Studies/Records


• use of secondary data Parts of a Formal Statistical Table
examples: census, health statistics, weather bureau 1. Heading
reports, etc. • consists of a table number, title and headnote
2 Types: 2. Caption
Documentary sources – published or written reports, • the portion of the table that contains the column
periodicals, heads which describe the
unpublished documents, etc. data in each column
Field sources - researchers who have done studies on 3. Stub
the area of interest • the portion of the table usually comprising the first
are asked personally or directly for information needed column on the left
4. Field
3 Methods of Presenting Data • main part of the table because this contains the
1. Textual Presentation substance or the figures of
2. Graphical Presentation one’s data
3. Tabular Presentation *other optional parts: source note, footnote, etc.

3. Graphical Presentation
• presentation of data in the form of a graph or diagram
Advantages
• Main features and implications of a body of data can
be grasped at a
glance
• Can attract attention and hold the reader’s interest
• Simplifies concepts that would otherwise have been
expressed in so
many words
• Can readily clarify data, frequently bring out hidden
facts and
Relationships
Exploratory Data Analysis
2.1 Numerical Summary Measures

2.1.2 Measures of Central Tendency


• This chapter present several numerical measures that provide
additional alternatives for summarizing data
• A number that is meant to convey the idea of
‘centralness’ for the data set
• A value about which the set of observations tend to
cluster
• Typical/average value of the data set
3 measures of central tendency:
1. Arithmetic mean
2. Median
3. Mode

Properties of the Mean:


• It always exists.
• It is unique.
• It reflects the magnitude of every observation
• It is easily affected by extreme values.
• The mean of the subgroups can be combined into the
overall mean of all
the data, called the weighted mean.
• The sample mean is a point estimator of the Properties of the Median:
population mean. • It is a positional value.
• The mean discussed here is also called arithmetic • Extreme values do not affect the median as strongly as
mean. There are other they do the mean.
kinds of mean like harmonic mean (for averaging rates, Remark:
geometric mean, • The median is the measure of location most often
etc.) reported for annual
income and poverty value data because a few extremely
large incomes or
property values can inflate the mean. In such cases, the
median is the
preferred measure of central location.
figure into two congruent parts i.e. identical in all
respect or one part can
be superimposed on the other i.e mirror images of each
other.
• In Statistics, a distribution is called symmetric if mean,
median and mode
coincide. Otherwise, the distribution becomes
asymmetric.

Characteristics of the Mode


• It does not always exist; and if it does, it may not be
unique. A data set
is said to be unimodal if there is only one mode,
bimodal if there are
two modes, trimodal if there are three modes, and so
on.
• It is not affected by extreme values.
• The mode can be used for qualitative as well as
quantitative data.

Graphical Comparison of the 3 Measures of Central


Tendency
• Measures of skewness help us to know to what
degree and in which
direction (positive or negative) the frequency
distribution has a departure
from symmetry.
• In mathematics, a figure is called symmetric if there
exists a point in it
through which if a perpendicular is drawn on the X-axis,
it divides the
minutes or grams or dollars) as the original data values.
• It is affected by the value of every observation. It may
be distorted by few
extreme values.

1. Range
• difference between the highest value (HV) and the
lowest value (LV) in the
population
• it uses only the extreme values
• it fails to communicate any information about the
clustering or the lack of
clustering of the values between the extremes
• a weakness is that an outlier can greatly alter its value
R = HV – LV = max– min

3. Standard Deviation
• The average deviation between the individual scores
in the distribution and
the mean for the distribution; square root of the
variance
• Values close together have a small standard deviation,
but values with
much more variation have a larger standard deviation.
• The standard deviation has the same units of
measurement (such as
2. Chebyshev’s Theorem
• enables us to make statements about the proportions
of data values that
must be within a specified number of standard
deviations of the mean
• Chebyshev’s Theorem can be applied to any data set
regardless of the
shape of the distribution
• it states that at least (1-1/z2) of the data values must
be within z standard
deviations of the mean, where z is any value greater
than 1
Some implications of this theorem, with z = 2, 3, and 4
standard deviations
of the mean
§ At least 75% of the data values must be within z = 2
standard deviations of the mean.
§ At least 89% of the data values must be within z = 3
standard deviations of the mean.
§ At least 94% of the data values must be within z = 4
standard deviations of the mean.

2.2 Graphical Summary Techniques


Principles of Probability and
Probability Distributions

3.1 Basic Concepts of Probability

3.2 Probability Distributions


A discrete random variable is a variable that can
assume only a
countable number of values
Many possible outcomes:
› number of complaints per day
› number of TV’s in a household
› number of rings before the phone is answered
Only two possible outcomes:
› gender: male or female
› defective: yes or no
› spreads peanut butter first vs. spreads jelly first

A continuous random variable is a variable that can


assume any value
on a continuum (can assume an uncountable number of
values)
› thickness of an item
› time required to complete a task
› temperature
› height, in inches
• These can potentially take on any value, depending
only on the ability
to measure accurately.
3. A normal distribution curve is unimodal (i.e., it has
only one mode).
4. The curve is symmetric about the mean, which is
equivalent to saying
that its shape is the same on both sides of a vertical line
passing through
the center.
5. The curve is continuous; that is, there are no gaps or
holes. For each
value of X, there is a corresponding value of Y.
6. The curve never touches the x axis. Theoretically, no
matter how far in
either direction the curve extends, it never meets the x
axis—but it gets
increasingly close.
7. The total area under a normal distribution curve is
equal to 1.00, or 100%.
8. The area under the part of a normal curve that lies
within 1 standard
deviation of the mean is approximately 0.68, or 68%;
within 2 standard
deviations, about 0.95, or 95%; and within 3 standard
3.3 Definition and Properties of Normal Distribution deviations, about
0.997, or 99.7%.

3.4 Standard Normal Distribution

Properties of the Normal Curve:


1. A normal distribution curve is bell-shaped.
2. The mean, median, and mode are equal and are
located at the center of
the distribution.
3.5 Finding the Area under the Standard
Normal Distribution Curve

• For the solution of problems using the standard


normal distribution, a
three-step process is recommended
Step 1: Draw the normal distribution curve and shade
the area.
Step 2: Look up the z value in the table
Step 3: Perform the necessary mathematical operation
in order to
compute the probability of the shaded area.
3.6 Applications of Normal Distribution

3.7 The Central Limit Theorem


Remarks:
1. When the original variable is normally distributed, the
distribution of
the sample means will be normally distributed, for any
sample size n.
2. When the distribution of the original variable is not
normal, a sample
size of 30 or more is needed to use a normal
distribution to
approximate the distribution of the sample means. The
larger the
sample, the better the approximation will be.

You might also like