Describing Data Pt.2

DATA ANALYSIS: METHODOLOGICAL BIG
PICTURE
STATISTICAL ANALYSIS
Presentation Outline
• Review
o Structure of Research
o Dimensions of Research
o Research Process
o Operationalization and Levels of Measurement
o Study Designs
• Statistical Analysis
o Sampling Distributions and z-scores
o Hypothesis Test
o Estimation
• Study Design Considerations
• Points of Confusion
The Structure of Research:
Deduction
The “Hourglass" Notion of Research
Begin with broad questions narrow down, focus in.

Operationalize
OBSERVE
Analyze Data
Reach Conclusions
Generalize back to Questions
The Scientific Method
Problem/Question
Observation/Research
Formulate a Hypothesis
Experiment
Collect and Analyze Results
Conclusion
Communicate the Results
The Empirical Research Process:
Step 1 Identification of Area of Study: Problem Formulation
Step 2 Literature Review: Context
T Step 3 Research Objectives to Hypotheses: Content to Methodology
D H • Concepts to Variables
E E Step 4 Study Design I: Data Collection Methods

• Research Design: experimental, quasi-experimental, or non-experimental
D O • Time & Unit of Analysis
U R Step 5 Procedures: Sampling, Assignment, Recruitment, & Ethics
C Y Step 6 Collection: Instruments, Materials, & Management
T
Step 7 Study Design II: Analysis
I • Statistical Approaches & Analytical Techniques
O • Sample Size & Power
N Step 8 Results: Dissemination

• Publication, Presentation, & New Application
The Dimensions of Empirical Research:
A movement from the theoretical to analytical
Theories
Deductive
Analysis
Reasoning
Data Collection EMPIRICAL Constructs

RESEARCH
Hypotheses Propositions
SCIENTIFIC METHOD
Variables Concepts
Measurement Postulates
Data Analysis:
In the Big Picture of Methodology
Question to Answer
Hypothesis to Test Note: Results of empirical scientific studies
Theory always begin with the Descriptive Statistics,
whether results conclude with Inferential Statistics
depends of the Research Objectives/ Aims
Study Design:
Data Collection Method & Analysis
Inferential Statistics
Causal Inference
Collect Data: Test Hypothesis, Conclusions,
Measurements, Observations Interpretation, &
Identification Relationships
Data Data Descriptive Statistics

Storage Extraction
Describe
Characteristics Decision:
Organize, Summarize, & Statistics?
Condense the Numbers
Data Analysis:
Types of Statistics
• Descriptive Statistics
o Summarization & Organization of variable values/scores
for the sample
• Inferential Statistics
o Inferences made from the Sample Statistic to the
Population Parameter.
o Able to Estimate Causation or make Causal Inference
• Isolate the effect of the Experimental (Independent) Variable
on the Outcome (Dependent) Variable
Data Analysis:
Descriptive Statistics
• Descriptive Statistics are procedures used for organizing and summarizing
scores in a sample so that the researchers can describe or communicate the
variables of interest.
• Note: Descriptive Statistics apply only to the sample: says nothing about how
accurately the data may reflect the reality in the population
• Use Sample Statistics to “infer” something about relationships in the entire

population: assumes sample is representative of population.
• Descriptive Statistics summarize 1 variable: aka Univariate Statistics
• Mean, Median, Mode, Range, Frequency Distribution, Variance and

Standard Deviation are the Descriptive Statistics: Univariates
Data Analysis:
Inferential Statistics
• Inferential Statistics are procedures designed to test the likelihood of finding the same
results from one sample with another sample drawn from the same population: in
fact, mathematically tests whether the sample results would be obtained if all possible
samples from the population were tested.
• Attempts to rule out chance as an explanation for the results: that results reflect real
relationships that exist in the population and are not just random or only by chance.
• Before you can describe or evaluate a relationship using statistics, you must design
your study so that your research question can be addressed.
• This is Methodology: where theory meets Data Collection

Methods & Data Analysis.
Data Analysis:
Statistics Notation
Capitalization Greek vs. Roman Letters
In general, capital letters refer to • Like capital letters, Greek letters refer to
population attributes (i.e., parameters); population attributes.
and lower-case letters refer to sample • Their sample counterparts, however, are
attributes (i.e., statistics). usually Roman letters.
For example, For example,
• P refers to a population proportion; • μ refers to a population mean;
o and p, to a sample proportion. o and x, to a sample mean.
• X refers to a set of population • σ refers to the standard deviation of a
elements; population;
o and x, to a set of sample o and s, to the standard deviation of
elements. a sample.
• N refers to population size;
o and n, to sample size.
Data Analysis:
Statistics Notation
Population Parameters Sample Statistics
By convention, specific symbols represent By convention, specific symbols represent
certain population parameters. certain sample statistics.
Notation Notation
• μ refers to a population mean. • x refers to a sample mean.
• σ refers to the standard deviation of a • s refers to the standard deviation of a
population. sample.
• σ2 refers to the variance of a population. • s2 refers to the variance of a sample.
• P refers to the proportion of population • p refers to the proportion of sample
elements that have a particular attribute. elements that have a particular attribute.
• Q refers to the proportion of population • q refers to the proportion of sample
elements that do not have a particular elements that do not have a particular
attribute, so Q = 1 - P. attribute, so q = 1 - p.
• ρ is the population correlation coefficient, • r is the sample correlation coefficient,
based on all of the elements from a based on all of the elements from a sample.
population. • n is the number of elements in a sample.
• N is the number of elements in a
population.
Data Analysis:
Summation/ Sigma Notation
Summation Notation is shorthand that relies on Greek alphabet and mathematical
symbols to indicate how to process values: aka formulae.
•  = summation
• X = Variable
What do each of these mean?
• X
o Add up the values of X
• X + 2 versus (X + 2)
o Add up the values of X and add 2 to the Sum,
o Add 2 to each value of X and then Sum the values
• X2 versus (X)2
o Square each value of X and then Sum
o Sum the values of X and then Square the Sum
• (X + 2)2 versus (X2 + 2)
o Add 2 to each value of X, square the value, then Sum the squared values
o Square each value of X, add 2 to the value, then Sum the values
Data Analysis:
Summation/ Sigma Notation
 : summation
X : Independent Variable, typically
Y: Dependent Variable, typically
N= Size of the Population
n= Size of the Sample
≤ ≥ ≠ = : Equalities or Inequalities
± × ÷ + - : Mathematical Operators
α: alpha, refers to constant/ intercept
µ: mu, sample mean
β: beta coefficient/ standardized
δ: sigma, sample standard deviation
δ2: sigma squared, sample variance
Data Analysis:
Inferential Statistics & Types of Tests
Data Analysis:
What does “Statistical Significance” mean?

Data Analysis:
Frequency Distributions
Data Analysis:
• After collecting data, the first task for a researcher is to
organize, summarize, condense and simplify the data for a
general overview of the results.
• Frequency Distributions are the conventional method to

organize, summarize, condense and simplify the data
Data Analysis:
• A Frequency Distribution consists of at least two columns:
1. one listing categories on the scale of measurement (X), and
2. another for frequency (f).
• In the X column, list the values from the highest to lowest: do

not omit any of the values.
• The frequency column contains the tallies for each value X:

how often each X value occurs in the data set.
o These tallies are the frequencies for each X value.
• The sum of the frequencies should equal N.

Data Analysis:
• A third column can be used for the proportion (p) for each
category: p = f/N.
o The sum of the p column should equal 1.00.
• A fourth column is often included to display the percentage of

the distribution corresponding to each X value.
o The percentage is found by multiplying p by 100.
o The sum of the percentage column is 100%.
Data Analysis:
• Regular or Normal Frequency Distribution
o All of the individual categories (X values) are listed
• When a set of scores covers a wide range of values, a list of all

the X values would be quite long: too long to be a “simple”
presentation of the data.
o In a situation many and diverse X values, a Grouped Frequency
Distribution is used.
Data Analysis:
• Grouped Frequency Distribution: the X column lists groups of scores, called Class
Intervals, rather than individual values.
• Class Intervals all have the same width: typically, a simple number such as 2, 5, 10,
and so on.
• Each Class Interval begins with a value that is a multiple of the Interval Width.
o The Interval Width is selected so that the distribution will have approximately 10 intervals.
Data Analysis: Grouped Frequency Distribution
• Choosing a width of 15 Relative
Class Intervals produces Class Interval Frequency Frequency
the following Frequency 100 to <115 2 0.025
115 to <130 10 0.127
Distribution.
130 to <145 21 0.266
145 to <160 15 0.190
• Age is typically 160 to <175 15 0.190
displayed as Grouped 175 to <190 8 0.101
Frequency Distribution: 190 to <205 3 0.038
o For Example: 205 to <220 1 0.013
• 45 to 54 Years 220 to <235 2 0.025
235 to <250 2 0.025
• 55 to 64 Years
79 1.000
Copyright © 2005 Brooks/Cole, a
division of Thomson Learning, Inc.
• Today’s Computer Technology has automated descriptive reporting of data.

• The advent of the Data Warehouse has transformed data from national surveys
or surveillance systems into products with automated processing, routinized
reporting functionality, and visual graphic outputs.
Data Analysis: Graphing Frequency Distribution
• Graph of Frequency Distribution:
o The score categories (X values) are listed on the X axis and the
frequencies (Number of categories of X values) are listed on the Y axis.
o When the score categories are numerical scores measured at interval

or ratio level, the graph should be either a Histogram or a Polygon.
Data Analysis: Histograms
• In a Histogram, a bar/column is centered above each score (or

Class Interval) so that the height of the bar/column
corresponds to the frequency of the X values and the width of
the bar/column extends to that adjacent bars/columns touch
one another.
Histogram of Scores
You will probably never have to draw
a Histogram by hand beyond a class
exercise.
Data Management and Analytical

Software have automated reporting
routines
Data Analysis: Regular or Normal Frequency Distribution
Table
Histograms A frequency distribution
histogram: same set of
quiz scores as a table
Also see Age Distribution of and in a histogram.
Martians examples from
Sampling PowerPoint
Grouped Frequency Distribution

Table
A frequency distribution
histogram for grouped
data: same set of
children’s as a table and
in a histogram.
Data Analysis: Polygons & Plots
• Polygon/ Plots: a dot or point is centered above each score so
that the height of the dot corresponds to the frequency.
o Then straight lines connect those dots/ points
o The graph is centered to a zero frequency by drawing additional lines
at each end
• These descriptions are bit hard to visualize, but you see
histograms and plots all the time: visualizations of data
Table
Frequency Distribution
Polygon: same set of Frequency Distribution
Table Polygon for Grouped Data:
data as a table and in a
polygon. same set of data as a
grouped table and in a
polygon.
Data Analysis: Bar Graphs
• Bar Graph are appropriate when the score categories (X
values) are measurements at nominal or ordinal level
• A Bar Graph is just like a Histogram except that there are gaps
or spaces between adjacent bars/ columns.
Personality Type Bar Graph

• A Bar Graph showing the distribution of
personality types in a sample of college
students.
• Because personality type is a discrete
variable measured on a nominal scale,
the graph is drawn with space between
the bars.
Data Analysis: Smooth Curve
• The conventional display of a distribution of interval or ratio level scores is
a Smooth Curve: not jagged Histogram or Polygon
• The Smooth Curve emphasizes the shape of the distribution: not the exact
frequency for each category
The population distribution of IQ scores:

an example of a Normal Distribution.
Data Analysis:
Frequency Distributions, Graphs, Plots & Histograms
• Graphs, Plots & Histograms of Frequency Distributions are

useful because they show the entire set of scores.
o These info-grpahics quickly allow you to see the highest score, the
lowest score, and where the scores are centered.
• These data visualizations also show how the scores are

clustered together or scattered apart.
• A graph shows the shape of the distribution.
o A distribution is Symmetrical if the left side of the graph is (roughly) a
mirror image of the right side.
o
• A familiar example of a Symmetrical Distribution is the bell-
shaped normal distribution: the bell curve.
• Distributions are skewed when scores pile up on one side of

the distribution: leaving a "tail" of a few extreme values on
the other side
Data Analysis:
Positively & Negatively Skewed Distributions
• Positively Skewed: the scores tend to pile up on the left side of the
distribution with the tail tapering off to the right.
• Negatively Skewed: the scores tend to pile up on the right side and the
tail points to the left.
Data Analysis: Percentiles, Percentile Ranks, &
Interpolation
• Percentiles and Percentile ranks describe: the relative location
of individual scores within a distribution: for example, the 90th
percentile of infant weight
• The Percentile Rank for a particular X value is the percentage
of individuals with scores equal to or less than that X value.
• An X value described by its rank is the Percentile.
Data Analysis:
X to z and z to X
• The basic z-score definition is usually sufficient to complete most z-score
transformations.
• However, the definition can be written in mathematical notation to create a

formula for computing the z-score for any value of X.
X– μ
z = ────
σ
• In addition, the terms in the formula can be regrouped to create an

equation for computing the value of X corresponding to any specific z-score.
X = μ + zσ
The relationship between z-score values
and locations in a population
distribution.
An entire population of scores is transformed into z-scores. The transformation does

not change the shape of the population, but the mean is transformed into a value of 0
and the standard deviation is transformed to a value of 1.
Following a z-score transformation, the X-axis
is relabeled in z-score units.
The distance that is equivalent to 1 standard

deviation on the X-axis (σ = 10 points in this
example) corresponds to 1 point on the z-score
scale
Why are z-scores important? Because if you know the distribution of your
scores, you can test hypothesis, and make predictions.
Data Analysis: Characteristics of z Scores
• Z scores tell you the number of standard deviation units a score is above or
below the mean
• The mean of the z score distribution = 0
• The SD of the z score distribution = 1
• The shape of the z score distribution will be exactly the same as the shape of
the original distribution
• z=0
•  z2 = SS = N
• 2 = 1 = ( z2/N)
Data Analysis:
Sources of Error in Probabilistic Reasoning
• The Power of the Particular

• Inability to Combine Probabilities
• Inverting Conditional Probabilities
• Failure to Utilize sample Size information
• The Gambler’s Fallacy
• Illusory Correlations & Confirmation Bias
• A Tendency to Try to Explain Random Events
• Misunderstanding Statistical Regression
• The Conjunction Fallacy
Data Analysis:
Characteristics of the Normal Distribution
• It is ALWAYS unimodal & symmetric

• The height of the curve is maximum at μ
• For every point on one side of mean, there is an exactly
corresponding point on the other side
• The curve drops as you move away from the mean
• Tails are asymptotic to zero
• The points of inflection always occur at one SD above and
below the mean.
Data Analysis:
The Distribution of Sample Means
• A distribution of the means from all possible samples of size n
• The larger the n, the less variability there will be
• The sample means will cluster around the population mean
• The distribution will be normal if the distribution of the population is normal
• Even if the population is not normally distributed, the distribution of sample means
will be normal when n > 30
Data Analysis:
Properties of the Distribution of Sample Means
• The mean of the distribution = μ
• The standard deviation of the distribution = σ/√n
• The mean of the distribution of sample means is called the Expected Value of the
Mean
• The standard deviation of the distribution of sample means is called the Standard
Error of the Mean (σM)
• Z scores for sample means can be calculated just as we did for individual scores. Z =
M-μ/σM
Data Analysis:
What is a Sampling Distribution?
• It is the distribution of a statistic from all possible samples of
size n
• If a statistic is unbiased, the mean of the sampling distribution
for that statistic will be equal to the population value for that
statistic.
Data Analysis:
Re-Introduction to Hypothesis Testing
• We use a sample to estimate the likelihood that our hunch
about a population is correct.
• In an experiment, we see if the difference between the means
of our groups is so great that they would be unlikely to have
been drawn from the same population by chance.
Methodology:
Formulating Hypotheses
• The Null Hypothesis (H0)
o Differences between means are due only to chance fluctuation
• Alternative Hypotheses (Ha)
• Criteria for rejecting a null hypothesis
o Level of Significance (Alpha Level)
• Traditional levels are .05 or .01
o Region of distribution of sample means defined by alpha level is
known as the “critical region”
o No hypothesis is ever “proven”; we just fail to reject null
o When the null is retained, alternatives are also retained.
z ratio= Obtained Difference Between Means / Difference due to chance/error:

the basis for most of the hypothesis tests
Data Analysis: Errors in Hypothesis Testing
• Type I Errors
o You reject a null hypothesis when you shouldn’t
o You conclude that you have an effect when you really do
not
o The alpha level determines the probability of a Type I
Error (hence, called an “alpha error”)
• Type II Errors
o Failure to reject a false null hypothesis
o Sometimes called a “Beta” Error.

Describing Data Pt.2

Uploaded by

Copyright:

Available Formats

Describing Data Pt.2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Describing Data Pt.2

Uploaded by

Copyright:

Available Formats

DATA ANALYSIS: METHODOLOGICAL BIG

Begin with broad questions narrow down, focus in.

E E Step 4 Study Design I: Data Collection Methods

N Step 8 Results: Dissemination

Data Collection EMPIRICAL Constructs

Data Data Descriptive Statistics

• Use Sample Statistics to “infer” something about relationships in the entire

• Descriptive Statistics summarize 1 variable: aka Univariate Statistics

• Mean, Median, Mode, Range, Frequency Distribution, Variance and

• This is Methodology: where theory meets Data Collection

What does “Statistical Significance” mean?

• Frequency Distributions are the conventional method to

• In the X column, list the values from the highest to lowest: do

• The frequency column contains the tallies for each value X:

• The sum of the frequencies should equal N.

• A fourth column is often included to display the percentage of

• When a set of scores covers a wide range of values, a list of all

• Today’s Computer Technology has automated descriptive reporting of data.

o When the score categories are numerical scores measured at interval

• In a Histogram, a bar/column is centered above each score (or

Data Management and Analytical

Grouped Frequency Distribution

Personality Type Bar Graph

The population distribution of IQ scores:

• Graphs, Plots & Histograms of Frequency Distributions are

• These data visualizations also show how the scores are

• Distributions are skewed when scores pile up on one side of

• However, the definition can be written in mathematical notation to create a

• In addition, the terms in the formula can be regrouped to create an

An entire population of scores is transformed into z-scores. The transformation does

The distance that is equivalent to 1 standard

• The Power of the Particular

• It is ALWAYS unimodal & symmetric

z ratio= Obtained Difference Between Means / Difference due to chance/error:

You might also like