Describing Data Pt.2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

DATA ANALYSIS: METHODOLOGICAL BIG

PICTURE
STATISTICAL ANALYSIS
Presentation Outline
• Review
o Structure of Research
o Dimensions of Research
o Research Process
o Operationalization and Levels of Measurement
o Study Designs
• Statistical Analysis
o Sampling Distributions and z-scores
o Hypothesis Test
o Estimation
• Study Design Considerations
• Points of Confusion
The Structure of Research:
Deduction
The “Hourglass" Notion of Research

Begin with broad questions narrow down, focus in.


Operationalize
OBSERVE
Analyze Data
Reach Conclusions
Generalize back to Questions
The Scientific Method
Problem/Question
Observation/Research
Formulate a Hypothesis
Experiment
Collect and Analyze Results
Conclusion
Communicate the Results
The Empirical Research Process:
Step 1 Identification of Area of Study: Problem Formulation
Step 2 Literature Review: Context
T Step 3 Research Objectives to Hypotheses: Content to Methodology
D H • Concepts to Variables

E E Step 4 Study Design I: Data Collection Methods


• Research Design: experimental, quasi-experimental, or non-experimental
D O • Time & Unit of Analysis
U R Step 5 Procedures: Sampling, Assignment, Recruitment, & Ethics
C Y Step 6 Collection: Instruments, Materials, & Management
T
Step 7 Study Design II: Analysis
I • Statistical Approaches & Analytical Techniques
O • Sample Size & Power

N Step 8 Results: Dissemination


• Publication, Presentation, & New Application
The Dimensions of Empirical Research:
A movement from the theoretical to analytical

Theories
Deductive
Analysis
Reasoning

Data Collection EMPIRICAL Constructs


RESEARCH

Hypotheses Propositions
SCIENTIFIC METHOD

Variables Concepts

Measurement Postulates
Data Analysis:
In the Big Picture of Methodology
Question to Answer
Hypothesis to Test Note: Results of empirical scientific studies
Theory always begin with the Descriptive Statistics,
whether results conclude with Inferential Statistics
depends of the Research Objectives/ Aims

Study Design:
Data Collection Method & Analysis

Inferential Statistics

Causal Inference
Collect Data: Test Hypothesis, Conclusions,
Measurements, Observations Interpretation, &
Identification Relationships

Data Data Descriptive Statistics


Storage Extraction
Describe
Characteristics Decision:
Organize, Summarize, & Statistics?
Condense the Numbers
Data Analysis:
Types of Statistics

• Descriptive Statistics
o Summarization & Organization of variable values/scores
for the sample
• Inferential Statistics
o Inferences made from the Sample Statistic to the
Population Parameter.
o Able to Estimate Causation or make Causal Inference
• Isolate the effect of the Experimental (Independent) Variable
on the Outcome (Dependent) Variable
Data Analysis:
Descriptive Statistics
• Descriptive Statistics are procedures used for organizing and summarizing
scores in a sample so that the researchers can describe or communicate the
variables of interest.

• Note: Descriptive Statistics apply only to the sample: says nothing about how
accurately the data may reflect the reality in the population

• Use Sample Statistics to “infer” something about relationships in the entire


population: assumes sample is representative of population.

• Descriptive Statistics summarize 1 variable: aka Univariate Statistics

• Mean, Median, Mode, Range, Frequency Distribution, Variance and


Standard Deviation are the Descriptive Statistics: Univariates
Data Analysis:
Inferential Statistics
• Inferential Statistics are procedures designed to test the likelihood of finding the same
results from one sample with another sample drawn from the same population: in
fact, mathematically tests whether the sample results would be obtained if all possible
samples from the population were tested.

• Attempts to rule out chance as an explanation for the results: that results reflect real
relationships that exist in the population and are not just random or only by chance.

• Before you can describe or evaluate a relationship using statistics, you must design
your study so that your research question can be addressed.

• This is Methodology: where theory meets Data Collection


Methods & Data Analysis.
Data Analysis:
Statistics Notation
Capitalization Greek vs. Roman Letters
In general, capital letters refer to • Like capital letters, Greek letters refer to
population attributes (i.e., parameters); population attributes.
and lower-case letters refer to sample • Their sample counterparts, however, are
attributes (i.e., statistics). usually Roman letters.
For example, For example,
• P refers to a population proportion; • μ refers to a population mean;
o and p, to a sample proportion. o and x, to a sample mean.
• X refers to a set of population • σ refers to the standard deviation of a
elements; population;
o and x, to a set of sample o and s, to the standard deviation of
elements. a sample.
• N refers to population size;
o and n, to sample size.
Data Analysis:
Statistics Notation
Population Parameters Sample Statistics
By convention, specific symbols represent By convention, specific symbols represent
certain population parameters. certain sample statistics.
Notation Notation
• μ refers to a population mean. • x refers to a sample mean.
• σ refers to the standard deviation of a • s refers to the standard deviation of a
population. sample.
• σ2 refers to the variance of a population. • s2 refers to the variance of a sample.
• P refers to the proportion of population • p refers to the proportion of sample
elements that have a particular attribute. elements that have a particular attribute.
• Q refers to the proportion of population • q refers to the proportion of sample
elements that do not have a particular elements that do not have a particular
attribute, so Q = 1 - P. attribute, so q = 1 - p.
• ρ is the population correlation coefficient, • r is the sample correlation coefficient,
based on all of the elements from a based on all of the elements from a sample.
population. • n is the number of elements in a sample.
• N is the number of elements in a
population.
Data Analysis:
Summation/ Sigma Notation
Summation Notation is shorthand that relies on Greek alphabet and mathematical
symbols to indicate how to process values: aka formulae.
•  = summation
• X = Variable
What do each of these mean?
• X
o Add up the values of X
• X + 2 versus (X + 2)
o Add up the values of X and add 2 to the Sum,
o Add 2 to each value of X and then Sum the values
• X2 versus (X)2
o Square each value of X and then Sum
o Sum the values of X and then Square the Sum
• (X + 2)2 versus (X2 + 2)
o Add 2 to each value of X, square the value, then Sum the squared values
o Square each value of X, add 2 to the value, then Sum the values
Data Analysis:
Summation/ Sigma Notation

 : summation
X : Independent Variable, typically
Y: Dependent Variable, typically
N= Size of the Population
n= Size of the Sample
≤ ≥ ≠ = : Equalities or Inequalities
± × ÷ + - : Mathematical Operators
α: alpha, refers to constant/ intercept
µ: mu, sample mean
β: beta coefficient/ standardized
δ: sigma, sample standard deviation
δ2: sigma squared, sample variance
Data Analysis:
Inferential Statistics & Types of Tests
Data Analysis:

What does “Statistical Significance” mean?


Data Analysis:
Frequency Distributions
Data Analysis:
Frequency Distributions
• After collecting data, the first task for a researcher is to
organize, summarize, condense and simplify the data for a
general overview of the results.

• Frequency Distributions are the conventional method to


organize, summarize, condense and simplify the data
Data Analysis:
Frequency Distributions
• A Frequency Distribution consists of at least two columns:
1. one listing categories on the scale of measurement (X), and
2. another for frequency (f).

• In the X column, list the values from the highest to lowest: do


not omit any of the values.

• The frequency column contains the tallies for each value X:


how often each X value occurs in the data set.
o These tallies are the frequencies for each X value.

• The sum of the frequencies should equal N.


Data Analysis:
Frequency Distributions
• A third column can be used for the proportion (p) for each
category: p = f/N.
o The sum of the p column should equal 1.00.

• A fourth column is often included to display the percentage of


the distribution corresponding to each X value.
o The percentage is found by multiplying p by 100.
o The sum of the percentage column is 100%.
Data Analysis:
Frequency Distributions
• Regular or Normal Frequency Distribution
o All of the individual categories (X values) are listed

• When a set of scores covers a wide range of values, a list of all


the X values would be quite long: too long to be a “simple”
presentation of the data.
o In a situation many and diverse X values, a Grouped Frequency
Distribution is used.
Data Analysis:
Frequency Distributions
• Grouped Frequency Distribution: the X column lists groups of scores, called Class
Intervals, rather than individual values.

• Class Intervals all have the same width: typically, a simple number such as 2, 5, 10,
and so on.

• Each Class Interval begins with a value that is a multiple of the Interval Width.
o The Interval Width is selected so that the distribution will have approximately 10 intervals.
Data Analysis: Grouped Frequency Distribution
• Choosing a width of 15 Relative
Class Intervals produces Class Interval Frequency Frequency
the following Frequency 100 to <115 2 0.025
115 to <130 10 0.127
Distribution.
130 to <145 21 0.266
145 to <160 15 0.190
• Age is typically 160 to <175 15 0.190
displayed as Grouped 175 to <190 8 0.101
Frequency Distribution: 190 to <205 3 0.038
o For Example: 205 to <220 1 0.013
• 45 to 54 Years 220 to <235 2 0.025
235 to <250 2 0.025
• 55 to 64 Years
79 1.000
Copyright © 2005 Brooks/Cole, a
division of Thomson Learning, Inc.

• Today’s Computer Technology has automated descriptive reporting of data.


• The advent of the Data Warehouse has transformed data from national surveys
or surveillance systems into products with automated processing, routinized
reporting functionality, and visual graphic outputs.
Data Analysis: Graphing Frequency Distribution
• Graph of Frequency Distribution:

o The score categories (X values) are listed on the X axis and the
frequencies (Number of categories of X values) are listed on the Y axis.

o When the score categories are numerical scores measured at interval


or ratio level, the graph should be either a Histogram or a Polygon.
Data Analysis: Histograms

• In a Histogram, a bar/column is centered above each score (or


Class Interval) so that the height of the bar/column
corresponds to the frequency of the X values and the width of
the bar/column extends to that adjacent bars/columns touch
one another.
Histogram of Scores
You will probably never have to draw
a Histogram by hand beyond a class
exercise.

Data Management and Analytical


Software have automated reporting
routines
Data Analysis: Regular or Normal Frequency Distribution

Table
Histograms A frequency distribution
histogram: same set of
quiz scores as a table
Also see Age Distribution of and in a histogram.
Martians examples from
Sampling PowerPoint

Grouped Frequency Distribution


Table
A frequency distribution
histogram for grouped
data: same set of
children’s as a table and
in a histogram.
Data Analysis: Polygons & Plots
• Polygon/ Plots: a dot or point is centered above each score so
that the height of the dot corresponds to the frequency.
o Then straight lines connect those dots/ points
o The graph is centered to a zero frequency by drawing additional lines
at each end
• These descriptions are bit hard to visualize, but you see
histograms and plots all the time: visualizations of data
Table
Frequency Distribution
Polygon: same set of Frequency Distribution
Table Polygon for Grouped Data:
data as a table and in a
polygon. same set of data as a
grouped table and in a
polygon.
Data Analysis: Bar Graphs
• Bar Graph are appropriate when the score categories (X
values) are measurements at nominal or ordinal level
• A Bar Graph is just like a Histogram except that there are gaps
or spaces between adjacent bars/ columns.

Personality Type Bar Graph


• A Bar Graph showing the distribution of
personality types in a sample of college
students.
• Because personality type is a discrete
variable measured on a nominal scale,
the graph is drawn with space between
the bars.
Data Analysis: Smooth Curve
• The conventional display of a distribution of interval or ratio level scores is
a Smooth Curve: not jagged Histogram or Polygon

• The Smooth Curve emphasizes the shape of the distribution: not the exact
frequency for each category

The population distribution of IQ scores:


an example of a Normal Distribution.
Data Analysis:
Frequency Distributions, Graphs, Plots & Histograms

• Graphs, Plots & Histograms of Frequency Distributions are


useful because they show the entire set of scores.
o These info-grpahics quickly allow you to see the highest score, the
lowest score, and where the scores are centered.

• These data visualizations also show how the scores are


clustered together or scattered apart.
• A graph shows the shape of the distribution.
o A distribution is Symmetrical if the left side of the graph is (roughly) a
mirror image of the right side.
o
• A familiar example of a Symmetrical Distribution is the bell-
shaped normal distribution: the bell curve.

• Distributions are skewed when scores pile up on one side of


the distribution: leaving a "tail" of a few extreme values on
the other side
Data Analysis:
Positively & Negatively Skewed Distributions
• Positively Skewed: the scores tend to pile up on the left side of the
distribution with the tail tapering off to the right.

• Negatively Skewed: the scores tend to pile up on the right side and the
tail points to the left.
Data Analysis: Percentiles, Percentile Ranks, &
Interpolation
• Percentiles and Percentile ranks describe: the relative location
of individual scores within a distribution: for example, the 90th
percentile of infant weight
• The Percentile Rank for a particular X value is the percentage
of individuals with scores equal to or less than that X value.
• An X value described by its rank is the Percentile.
Data Analysis:
X to z and z to X
• The basic z-score definition is usually sufficient to complete most z-score
transformations.

• However, the definition can be written in mathematical notation to create a


formula for computing the z-score for any value of X.
X– μ
z = ────
σ

• In addition, the terms in the formula can be regrouped to create an


equation for computing the value of X corresponding to any specific z-score.

X = μ + zσ
The relationship between z-score values
and locations in a population
distribution.

An entire population of scores is transformed into z-scores. The transformation does


not change the shape of the population, but the mean is transformed into a value of 0
and the standard deviation is transformed to a value of 1.
Following a z-score transformation, the X-axis
is relabeled in z-score units.

The distance that is equivalent to 1 standard


deviation on the X-axis (σ = 10 points in this
example) corresponds to 1 point on the z-score
scale

Why are z-scores important? Because if you know the distribution of your
scores, you can test hypothesis, and make predictions.
Data Analysis: Characteristics of z Scores
• Z scores tell you the number of standard deviation units a score is above or
below the mean
• The mean of the z score distribution = 0
• The SD of the z score distribution = 1
• The shape of the z score distribution will be exactly the same as the shape of
the original distribution
• z=0
•  z2 = SS = N
• 2 = 1 = ( z2/N)
Data Analysis:
Sources of Error in Probabilistic Reasoning

• The Power of the Particular


• Inability to Combine Probabilities
• Inverting Conditional Probabilities
• Failure to Utilize sample Size information
• The Gambler’s Fallacy
• Illusory Correlations & Confirmation Bias
• A Tendency to Try to Explain Random Events
• Misunderstanding Statistical Regression
• The Conjunction Fallacy
Data Analysis:
Characteristics of the Normal Distribution

• It is ALWAYS unimodal & symmetric


• The height of the curve is maximum at μ
• For every point on one side of mean, there is an exactly
corresponding point on the other side
• The curve drops as you move away from the mean
• Tails are asymptotic to zero
• The points of inflection always occur at one SD above and
below the mean.
Data Analysis:
The Distribution of Sample Means
• A distribution of the means from all possible samples of size n
• The larger the n, the less variability there will be
• The sample means will cluster around the population mean
• The distribution will be normal if the distribution of the population is normal
• Even if the population is not normally distributed, the distribution of sample means
will be normal when n > 30
Data Analysis:
Properties of the Distribution of Sample Means
• The mean of the distribution = μ
• The standard deviation of the distribution = σ/√n
• The mean of the distribution of sample means is called the Expected Value of the
Mean
• The standard deviation of the distribution of sample means is called the Standard
Error of the Mean (σM)
• Z scores for sample means can be calculated just as we did for individual scores. Z =
M-μ/σM
Data Analysis:
What is a Sampling Distribution?
• It is the distribution of a statistic from all possible samples of
size n
• If a statistic is unbiased, the mean of the sampling distribution
for that statistic will be equal to the population value for that
statistic.
Data Analysis:
Re-Introduction to Hypothesis Testing
• We use a sample to estimate the likelihood that our hunch
about a population is correct.
• In an experiment, we see if the difference between the means
of our groups is so great that they would be unlikely to have
been drawn from the same population by chance.
Methodology:
Formulating Hypotheses
• The Null Hypothesis (H0)
o Differences between means are due only to chance fluctuation
• Alternative Hypotheses (Ha)
• Criteria for rejecting a null hypothesis
o Level of Significance (Alpha Level)
• Traditional levels are .05 or .01
o Region of distribution of sample means defined by alpha level is
known as the “critical region”
o No hypothesis is ever “proven”; we just fail to reject null
o When the null is retained, alternatives are also retained.

z ratio= Obtained Difference Between Means / Difference due to chance/error:


the basis for most of the hypothesis tests
Data Analysis: Errors in Hypothesis Testing
• Type I Errors
o You reject a null hypothesis when you shouldn’t
o You conclude that you have an effect when you really do
not
o The alpha level determines the probability of a Type I
Error (hence, called an “alpha error”)

• Type II Errors
o Failure to reject a false null hypothesis
o Sometimes called a “Beta” Error.

You might also like