Lectures - Test 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 40

Test 2 – Research Methodology and Descriptive Statistics

Unit 13, 24, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22

Unit 13 - Visualizing and analyzing bivariate relationships


Contingency table and scatterplots: BIVARIATE RELATIONSHIP
Display between two variables using tables and graphs (if its correlated)
Contingency table (cross tables) enables you to display the relationship between two
ordinal or nominal variables. Similar to a frequency table but a frequency table only concerns
one variable.
Scatterplot enables you to display the relationship between two quantitative variables
(interval or ratio)

1 2 3

4 5 6
1= strong positive linear relation
2= moderate strong positive linear relation
3= no relationship
4= moderate negative linear relation
5= no linear relation (curvilinear)
6= no relationship

Exy= 63% - 47%= 16%


Is this difference significant?
Ecrit= 20% when N> 50 and N< 100  not significant

Unit 24 - Describing the association between to variables

How to describe bivariate associations


1. Sign of the relationship
- Positive (+) ↑
- Negative (-) ↓
2. Strength: all measures of association (MOA) describe the relationship with a number:
- Between -1 and 1, if a sign is meaningful
- Between 0 and 1, if a sign is not meaningful (nominal variables)
3. Significance: looking at our sample results, how certain are we about the existence of
the association between X and Y in the population (or maybe it is a consequence or
chance)?
MOA

Why so many different MOA?


- Measurement levels are different
- Range of MOA: sign/ association useful?
o No: Cramer’s V: cross table with at least one dicho/ nominal variable:
range 0-1
o Yes: Kendall’s Tau B and Tau-C; Pearson’s r; Spearman’s r: range -1 to +1
- Symmetric or asymmetric relationship?
o X  Y: a causal relationship (=asymmetric)  other MOA
o X – X or Y – X or X – Y: association is symmetric  no clear cause and
effect variables
 Cramer’s V
 Kendall’s Tau B vs C
 Pearson’s r
 Spearman’s r (rank order correlation – non-normal distribution)
Cramer’s V
Kendall’s Tau
What is a perfect association?

Two ordinal but unequal variables (perfect association)

Spearman’s r
- Original scores of X and Y are sorted an ranked from low to high, rank x i and rank
yi
- Calculate the difference in rangorders between X and Y: di
- Spearman rho is 1 minus “the average difference rangorder” (=-0,175)

‘Measures of association’ refers to a wide variety of coefficients that measure (the direction
and) the strength of an association between two variables (bi-variate) in a dataset. Most of
the coefficients can take values between -1 (perfect negative association) and +1 (perfect
positive association), with 0 meaning no relationship at all (values close to zero can be seen
as weak associations).
The number of coefficients that can be used to describe relationships between variables is
very large. The choice between these measures depends to a large extent on the level of
measurement of the variables that are being used. In this course we will only discuss %
difference E, Pearson’s r, Spearman’s rho, Kendall’s tau-b, Kendall’s tau-c and
Cramér’s V. In Table 1 you can see when to use these coefficients.
Table 1.

To start from the right bottom of Table 1 we encounter Pearson’s r. Pearson’s r can be used
to look at the association between two scale variables. It is a standardized measure of
strength for the linear relationship between two scale variables only. What also should be
noted, is that Pearson’s r is not robust (that means: Pearson’s r is sensitive to extreme
values).
Spearman’s rho can be used as a more robust coefficient to look at the relationship
between two quantitative variables (ordinal or scale). Also, it can be used for consistently
increasing or decreasing non-linear associations. Raw scores are sorted from high to low
and replaced by the ranks of values. The highest value of a variable is given rank 1, the
second highest value is given rank 2, etcetera. Because of that, it can also be used to look at
the bivariate association between two ordinal variables.
Kendall’s tau can also be used as a measure of association for a consistently increasing or
decreasing relationship between two ordinal variables, but only when the number of
categories is relatively small so the relationship can be displayed in a contingency table.
Kendall’s tau-b can be used for squared tables (3x3, 4x4, for example), whereas Kendall’s
tau-c can be used for rectangular tables (2x3, 3x4 etc..).
To measure the association between two nominal variables or between a nominal and a
ordinal variable Cramér’s V can be used. Unlike the previous coefficients, Cramér’s V
cannot take a value lower than 0, since the categories are not ordered and it therefore does
not make sense to talk about a positive or negative association.
In case of two dichotomous variables, we use the % difference E (Epsilon. The counts of
two dichotomous variables are shown in a squared contingency table (2x2) and the column
percentages for the independent variable are calculated. The percentages are compared
horizontally and expressed as % difference E.
Unit 12 - Causality and bivariate causal hypotheses
Video 1: Causal explanation
Three implications of general causal statements
1. Time order
2. Association
3. Non-spurious relationship
Three different, yet related questions
1. Why did you become an alcoholic? (full story with reasons why someone became an
alcoholic, you can check the story. Example is in biographies, therapeutic sessions).
2. Why did this person become an alcoholic? (when you can’t ask the person itself. You
can only do this with existing knowledge applied to the specific case. Using causal
hypotheses to offer an explanation, does not improve general knowledge about
causes).
3. Why do people become alcoholics? (thinking about possible causes such as alcohol
acceptance. Explanation as formulating and testing a relationship between a cause
and a consequence. Researchers come up with general causes).
Exogenous concept, cause, X-variable, independent variable, treatment
Level of alcohol acceptance in the family you grew up with
Related to
Endogenous concept, effect/ consequence, Y-variable, dependent variable,
observation
Amount of drinking/ being an alcoholic
 Causality in a graph
The relationship can be deterministic and probabilistic (focus in social sciences)
Deterministic: If.. then always
Probabilistic: If.. then relatively more/ less often
Three aspects of causality
1. X precedes Y in time (correct time order)
2. X and Y are correlated (association)
3. There is no third variable accounting for the association (non-spuriousness)
If X affect/ influence Y  asymmetric relationship
Video 2: time order in causal relationships
The independent variable precedes the dependent variable
Examples:

- The effect of alcohol acceptance during childhood on current drinking behavior


- Effect of gender on behavior

Problems may occur when behavior and attitudes are measured at the same time..
Behavior may change (reported) attitudes
Example: Do you like hem because you are dancing together or are you dancing together
because you like him?
Measuring both variables at the same time may produce reverse causation
Example: Does a happy childhood make you more happy now?
How to check the time order: collect data at different points at time by interrupted time series
design

Video 3: the effect of third variables: non-spurious relationship


Two effects

- Explanation/ confounding
- Specification/ interaction/ modification

Example cofounding: why are more babies born in some municipalities than in others?
Urbanized city

Example interaction: why do some people spend more on holidays than others?
Willingness to go on holiday

Relationships can be spurious or seriously biased (third variable bias) because of


confounding or interaction.
Video 4: bivariate associations
Bivariate associations between variables with various levels of measurement.
Positive or negative causality is often called the SIGN of the relationship.
Probabilistic graph

 Measurement error (it’s difficult to have perfect measurement)


 Parsimonious models: omitted variables (we leave variables out we know we are
effecting the consequence but because we want to have a simple picture of the world,
we ignore that)
Dichotomous/ nominal and ordinal variables are not easily displayed in graphs.
Unit 15 – Research designs for testing causal hypotheses
Video 1: Research designs for testing bivariate causal relationships
Three groups of research design for testing causal relationships
1. Cross sectional research designs (correlational)
2. Interrupted time series
3. (Classical) experiments
Research design= the way of answering an explanatory (causal) research question in a
convincing way.
The logic, not the logistics of answering such a question (=more about thinking than
organizing your research).
Distinguish between

- Research design (example: experiment, cross sectional study, etc.)


- A data collection method (example: a survey, observation)
- The aim or context of the research (example: ex-post evaluation)
- A type of data (qualitative and quantitative)

Three aspects of causality

- Association
o Units for comparison (comparative research)
o A basis for comparison (variables
If you only have one unit, you cannot compare so you cannot test bivariate
causal relationships.
- Correct time order
- No spurious relationship

Testing a causal relationship


Example: the effect of the amount of studying on the grade
Independent variable: amount of studying
Dependent variable: grade
Cross sectional: a set of units. Variables are measured at one moment in time.
Interrupted time series: studying the same units and variables over time. A type of
longitudinal research (time is included).

Interrupted time series normally have more pre-tests and more post-tests, this variant is
called before-after study.
Experiment: extensions of this interrupted time series.

Experimental group and control group are identical, there is no third variable

Cross sectional Interrupted time series Experiment


Association Check Check Check
Time order You can’t check that Check Check
No spurious You can’t check that You can’t check that Check

Experiments are not always feasible and require relatively simple hypotheses.
Cross sectional studies can sometimes exclude time order problems and can be used to
collect data on confounders.
Video 2: Validity in causal research
Four types of validity in the context of causal inference:

- Statistical conclusion validity


- Internal validity
 Are about correct conclusions drawn in the study itself
- Measurement and sampling validity
- External validity
 Are about generalization and inference to theory/ populations/ other cases
The four types only help you to find specific problems in causal research, they are not a
classification scheme of problems (warning)
Example: does comedy improve the outcomes of academic learning among students at this
university?  Relationship between fun and learning
Control groups gets normal lecture
Experimental groups gets the same lecture but more funny
Internal validity and statistical conclusion validity (STUDY)
Are the conclusions about the association and time order between the measured variables
correct? Is the relationship unaffected by a third variable (no spurious relationship)?
Statistical conclusion validity mainly refers to the correct handling data and using the
appropriate statistical tests. Mainly about correctly establishing associations between
variables.
Internal validity

- Is the time order between the measured variables correctly established (is the
relationship not reversed?)
- Is the relationship unaffected by a third variable (no spurious relationship)?

Measurement and sampling validity (THEORY)


Measuring the variables and units appropriately. Referring to the theoretical constructs
intended.
Examples:

- Is making some jokes in the lecture a good indication if comedy in teaching


(measurement validity)
- Is studying only IBA student adequate? (sampling validity)

Samling validity: about the relationship between the units we actually study and the units we
interested in
Measurement validity: about the extent to making some jokes in a lecture actually reverse to
comedy more generally.
External validity (most important): supposed that we found a relationship between an
independent and dependent variable, can we say that this relationship also exist more
generally (in other circumstances)?
Video 3: The classical experiment
Classical experiment: a research design for testing bivariate causal relationship (also known
as randomized experiment).
Classical experiments have:

- Two groups
- Randomization
- One pre-test of the dependent variable
- One post-test of the dependent variable
- One intervention/ treatment (independent variable)

Definition: a research design in which comparable groups of units are constructed using
random assignment (1) and in which the treatment (the independent variable) is manipulated
differently across groups (2) and to see whether the outcome (the dependent variable)
becomes different across groups (3).
Random assignment (1)
Start with a group of units and create an experimental group and control group using random
assignment. Experimental and control group are identical.
Time order: because of the pre-test and the observed similarity between the to groups and
because we focus on the difference between to groups after giving the treatment
 We are sure the independent variable precedes the dependent variable: no reverse
causation.
Associating: by treating one group and not the other. By comparing the outcomes of the
control group and the experimental group
 We establish an association
Non spurious: because of random assignment and checked by the pre-test, the only
difference between the control group and the experimental group is the treatment
 We thus make sure that there is no effect of a third variable

Video 4: Cross sectional research


Cross sectional research= a research design in which all variables of a set of units are
measured at the same time and none of the variables is manipulated differently for a sub-set
of units.
Does the amount of e-marketing increase sales?
Assessing causal research

- Measurement validity and reliability


- External validity
- Internal validity (time order, non-spuriousness)
- Statistical conclusion validity (correlation)

Main threats in cross sectional research can be found in internal validity of the study:

- Reverse causation cannot be ruled out: since data are collected at one moment in
time, reverse causation cannot be ruled out. Example: sales increases the budget for
e-marketing.
- Third variables may affect the relationship (non-spuriousness). Example: maybe the
presence of young and dynamic managers affect both sales and the e-marketing
budget.
Confounding is a problem in cross sectional research (we cannot easily exclude all
third variables we can think of). Taking into account the possible effect of third
variables, may reduce the problems of spuriousness (controlling).
Evaluating cross sectional research
 Weak internal validity because of reverse causation and the possible effect of third
variables
 Potentially strong in external validity (sampling)
 The effect of many independent variables cannot be studies in other types of research
designs
Video 5: Interrupted time series
Interrupted time series= a research design in which a dependent variable of one group of
units is studies over time and in which at one point in time the group receives a treatment (a
change in the independent variable)
Example: effect of wearing safety belts on traffic safety of individuals

Assessing interrupted series

- Measurement validity (and reliability)


o Is the actual treatment reflecting the theoretical construct?
- External validity
o It may work now, for this set of units and in this setting, but it may not work
another time/ case.
- Internal validity
o Time order (reversed causation)
o Non-spuriousness (effect of third variables)
- Statistical conclusion validity
o Correlation (when is the change big enough to argue the treatment worked)

Video 6: A notation of research designs


Introducing notation for research designs used to describe cross sectional research design,
interrupted time series, experiments and some extensions of these research designs
Interrupted time series
Observation/ Outcome/ Dependent variable = O

Treatment/ Intervention/ Independent variable= X

Notation: O O X O O
The positions of O’s and X’s indicate time order in the design

Correlational studies
Only variables
Notation: O

Classical experiment

Notation: R O X O
R O O

Notation
Groups are indicated on separate lines
R= group created by random assignment
N= comparison group not created by random assignment
Extensions of classical experiments
Are combines treatments more effective? (factorial design)
R O X1 O
R O X2 O
R O X1X2 O
R O O

What if the pre-test may have an effect on itself? (Solomon four-group design)
R O X1 O
R O O
R X1 O
R O

Does the effect persist over time?


 More post-tests
R O X1 O O O O
R O O O O O

Does the effect persist after removing the treatment?


R O X1 O O X1 O O
R O O O O O

X1= removal treatment

Does it matter when the treatment is given? (steppe wetch design)


Or what if you cannot withhold the treatment?
R O X1 O O O
R O O X1 O O
R O O O X1 O
Unit 14 – Causality and the effect of third variables
Video 1: the effect of third variables thinking about trivariate hypotheses
Testing bivariate hypothesis is checking:

- Time order of cause and effect


- Correlation/ association (MOA: measures of association: E% difference)
- Effect of third variables
o Theorizing about the effect of third variables (elaboration models)
o Formulating a trivariate hypothesis
o Testing the trivariate hypothesis (using empirical data  are there three
arrows actually as expected)
o Conclusion: accept the entire trivariate hypotheses vs reject = not all arrows
are as expected)
Independent variable  dependent variable
Third variable:

- Another independent variable affecting the dependent variable


- The confounder variable affects both the independent and dependent variable
(confounding)
- Intervening variable (interpretation, knowledge about consequences)
- Moderator (interaction)

Confounding and interpretation look a bit like the same. But there is a difference in time
order: if the independent variable precedes the third variable in time  interpretation. If
independent variable is explained by the third variable  confounding
Trivariate hypothesis
After the introduction of the third (test) variable (mention the third variable), the bivariate
relationship (mention the original bivariate hypothesis) disappears/ changes, remains the
same. And the test variable is related to the other variables (outline the model)
Video 2: The effect of third variables – confounding

 Urbanization is connected to the number of storks and the number of babies.


 Confounder variable: affects the independent and dependent variable but there is no
relationship between the independent and dependent variable.
Video 3: The effect of third variables – interpretation
Interpretation is the theoretical argumentation

 There is no longer an additional relationship between study time and grade but that is
fully goes from study time to understanding of the topic to grade
 Intervening variable: independent and dependent variable are still related but we know
better understand why there is a relationship (no additional relationship left)

Video 4: The effect of third variables - interaction/ modification

 Modifier variable, interaction effect

Video 5: The effect of third variables – addition


Are there maybe other variables explaining the grade you got, in addition to the amount of
time spend on studying?

 An extra variable explain the dependent variable

Summarizing the effect of third variables


1) Addition
 TV has its own relation with Y, independent from X
 TV is another independent variable  X2
2) Explanation (a.k.a. confounding)
 TV is called ‘confounder’
 TV causes both X and Y; TV is a common cause for X and Y; spurious correlation
between X and Y
 there is a spurious correlation between X and Y
 TV precedes X in time!
3) Interpretation (a.k.a. mediation)
 Why is there a relationship between X and Y? Because of an intervening variable
TV?
 TV is called ‘mediator’ or ‘intervening variable’
 The actual relationship is X  TV  Y, instead of simply X  Y
 X precedes TV in time!
4) Specification (a.k.a. interaction or moderation)
 TV is called ‘moderator’ or ‘modifying variable’.
 Dependent on the value of TV, the relationship between X and Y differs.
 The original bivariate relationship:

Remains the same Disappears Weakens Varies

The third Confounder variable Confouder variable Moderator


variable variable/
is called: interaction
variable
The Replication Confouding/ full Confounding/ Interaction/
model is explanation (MOA: partial explanation specification/
called: Pearson’s r=0) (MOA: Pearson’s modification
r=↓)

The third Another variable Intervening variable Intervening variable


variable
is called:
The Addition (Full) interpretation Partial
model is (MOA: Pearon’s r=0) interpretation
called: (MOA: Pearson’s
r=↓)

Formulating of the trivariate hypothesis


After the introduction of … TV …. the bivariate (main) relationship between ….X and Y ….
disappears / increases/ decreases / remains the same …..
and the third variable TV is related to the other variables ….outline the elaboration
model…..” .

Unit 16 - Elaboration: analyzing multi-variate relationships using tables


Video 1: Testing a trivariate hypothesis using a trivariate table

-Time order of cause and effect (X  Y, if not rejected)


-Correlation/ association (MOA: measures of association: E% difference)
-Effect of third variables (non-spuriousness)
o Theorizing about the effect of third variables
o Formulating a trivariate hypothesis
o Testing the trivariate hypothesis
 Constructing a trivariate table
In this video we use confounding (as example) to show the idea of elaboration, introducing a
trivariate table. But it also applies interpretation, modification, addition etc…
 No difference between high and low number of storks regard to the number of babies
born. There is no longer a difference set of municipalities.
 Exactly what we expected. In the municipalities that are not urbanized, there is no
relationship between the number of storks and the number of babies

 Also no difference
 Exactly what we expected. In the municipalities that are urbanized, we found no
relationship between the number of storks and the number of babies
 Second comparison: non urbanized municipalities have far more storks then urbanized
municipalities
Unit 17: Elaboration: analyzing multi-variate relationships using tables
Video 1: Interpreting a trivariate table
Example: using laughing gas causes concentration problems

62% - 32,6% = 29,4% difference (in the expected direction  the more you use, more
problems)
Third variable?  Vegetarian (because of vitamin B12)
58,7% - 26,2% = 32,5% difference

90% - 47,7 %= 42,3% difference


Specification effect  being an vegetarian, causes more concentration problems when using
laughing gas.
 Confirm the original trivariate hypothesis

Video 2: Visualizing trivariate relationships


Depression affects (negative relationship) the academic performance.

56,6% - 43,4% = 13,2% difference (in the expected direction. The more depressed, the more
often a lower achievement)
Third variable?  counselling
If counselling was available that reduced the effect of depression

83,3% - 64,2% = 19,1% difference

32,5% - 20,4% = 12,1% difference


The difference is smaller (what we expected) but it’s only a small difference.

No difference in the effect of depression on academic performance


 Addition model instead of specification
 Reject the original trivariate hypothesis
Order of variables in a contingency table of trivariate hypothesis: dependent (y), independent
(x), third variable (tv)

Understand what Ecrit (E% difference critical…) is and what is does!!! (Outcome: significant
or not) and then.. is the hypothesis accepted/ confirmed or rejected.
Unit 19: Sampling
Video 1: Sampling
Before sampling, you need a clear definition on the population of interest (target population)
 Unit of analysis

When sampling
- If not all units mentioned in our research question can be studied, we need to sample.
- Studying a smaller set of units with the aim to say something about all units.
Sampling process
Especially the relationship between the sampling frame and the sample is called sampling.

Population  sampling frame  sample  interviewed sample  data


We can find distortions in the process.

Focus on distortions in sampling (the relationship between the sampling frame and the
sample)
Sampling procedures
Is the chance that a specific unit from the sampling frame is included in the study, known?
 No: Non-probability sampling
 Yes: Probability sampling
Non-probability sampling
 You want to develop a new concept/ the population is unknown/ sampling frame is not
available/ you can study only a very small number of units.
- Convenience
- Purposive
- Systematic
- Snowball sampling (social network of response)
- Quota (purposive with fixed final size)

Example: opt-in survey of some newspaper.


Selected units do not necessarily reflect the population. The sample is probably biased.
 Not everyone has an equal chance to be included in the sample (if you not read the
newspaper or not interested)
 Non-randomly: sample is not representative for the population
 No sampling frame available
 Goal: inductive reasoning (explorative empirical research)
Non-probability sampling does not allow for generalizations to a larger population of units,
however it enables the construction of variables (empirical conceptualization) to be used in
research, or the construction of theories (exploratory research)
Descriptive RQ: convenience sampling (people you know) and snowball sampling (ask
people if they know others)  outcome: list of problems
Theory construction: a set including both normal and extreme cases: diverse cases 
outcome would be hypotheses suggesting why cases are different
Typical cases may help to better understand the connection (to interpret the relationship)
Deviant cases may help to better understand additional factor
Beware of the pitfalls in the context of sampling. Since probability sampling is having a very
strong footing in statistics in research. Many researchers think that you should always use
probability sampling but this is incorrect: inappropriately aiming for probability sampling in the
context of concept formation or theory construction. It just make no sense in some cases.
And on the other hand using non-probability samples as if they are probability samples is
also wrong.
Probability

- Simple random sampling (SRS)


- Systematic random sampling
o You ask the 10th person every time
- Stratified random sampling
o Based on the idea that we know a little bit about the sampling frame  split
sampling frame into strata, sample within both groups (strata)
- Cluster random sampling (for example a branche or classes in school)
o Interviewing all people within a cluster
- (Multi-stage) cluster sampling (combining two things: cluster random sampling but not
everybody of that part)
o Interviewing a sample within some selected clusters

Example: Simple random sample from the population registry.


Selected units reflect the population
 Everyone has a known change to be included in the sample
 Goal: your sample represents well all the (target) population (external validity)
 Sampling frame is necessary
Comparing simple/ systematic and stratified
- Simple/ systematic (especially when n is low) can still be wrong (too many men, for
example)
- Stratified can never be wrong in the strata (for example: with regard to gender), so
sampling errors somewhat smaller
Comparing simple/ systematic/ stratified with cluster types of sampling

- Cluster types of sampling increase the standard errors: Why


o With simple/ systematic/ stratified, units do not share context (they are
completely independent)
o With cluster types, you have to take shared context into account (not really
independent units)
- Probability theory is based on selecting independent units

Assessing sampling
We always make mistakes when sampling
Two types of mistakes:

- Sampling bias (sampling invalidity): not being typical for the population. Studying the
wrong group of people.
- Sampling error (sampling unreliability): a consequence of sample size and
characteristics of the population.
Evaluating sampling procedures
Non-probability sampling:

- Bias
- (Sample size relatively unimportant)

Probability sampling:

- No bias
- Sample size affects sampling error

Video 2: Sample and population


A population is the entire group that you want to draw conclusions about. A sample is the
specific group that you will collect data from.

So you first calculate for example the mean of a sample and then use inferential statistics to
calculate the mean of it but than in the population.
Probability sampling allows for statistical inference.
Video 3: Sampling distributions
Inferential statistics refers to methods used to draw conclusions about a population based on
data coming form a sample. A sample is a subset of a population.

 If you don’t have a good sampling frame or if a simple random sample is really
expensive

 Advantage is that you can make sure that you have enough subjects from every
stratum in your sample
Sample: bigger is better. Bigger sample can never make up for a bad sampling procedure. If
your sample is not random, you can increase the sample size as much as you want but if
there is a bad sampling procedure, your sample will never be good. However you’re sample
is random, a bigger sample is technically better but if you pass a certain point an increase in
your sample only results in a very small increase in the precision of your estimation of the
population parameter.
Unit 20 - First steps towards inference: certainty about means
A sample distribution displays which values of a variable you have obtained after drawing a
sample of a give size from a population

A sampling distribution displays the values of a statistic (mean, SD, var) from repeatedly
drawing samples of a give size from a population

1st graph: population distribution


2nd graph: sample distribution
3rd graph: sampling distribution of the mean (when n=2)
4th graph: sampling distribution of the mean (when n=25)

Video 1: The sampling distribution


Samling distribution is the link that helps researchers to draw conclusions about a population
on the basis of only one sample.
We’ll pretend that we know what a population looks like (in reality you never know that but
this step is necessary to understand inferential statistics later on).
If you draw a simple random sample from a population, it is very unlikely that the sample will
strongly differ from the population from which it is drawn.
Sampling distribution of the sample mean= infinite number of samples  perfectly bell-
shaped and the mean of that distribution is the mean in the population.
 In the actual research question you never found a infinite number of samples.
RANDOM SAMPLES
Sample distribution: for one single sample (the mean is the sample mean)
Sampling distribution of the mean: a lot of samples and lot of samples means
Video 2: The central limit theorem
Central limit theorem= the sampling distribution of sample mean is approximately normal
(provided that n is sufficiently large) even if the variable of interest is not normally distributed
in the population. So know matter how a variable is distributed in the population, the
sampling distribution of the sample mean is always approximately normal as long if the
sample size is large enough (guideline for large enough is higher than 30).
Impossible to draw an infinite number of samples  normal  you can describe its shape
with just two parameters  mean or standard deviation.
Sampling distribution of the mean: normal distribution (characterized by their mean and sd)
The mean of the sampling distribution (Mu) μ xbar = μ (population mean)
Standard deviation (σ xbar) in the sampling distribution (= SEM, Standard Error of the Mean)
depends on:

- The sample size n


o The bigger n, the lower σ
- The population parameter σ
o The bigger population σ, the bigger
σ xbar
σ xbar= standard deviation of the sampling distribution
σ = standard deviation in the population
n= sample size

If the standard deviation increases, the standard deviation of the sampling distribution
increases as well. The larger the variability in the population the larger the variability in the
sample means. Also if the sample size increases, the standard deviation of the sample
distribution is decreasing (the larger the size of your sample, the closer the sample means
will lie to the population mean, the smaller the standard deviation of your sampling
distribution).
What do we observe about the sampling distribution?
Irrespective the shape of the population distribution, the shape of the sampling distribution is
normal  central limit theorem (and the mean of the sampling distribution is almost equals
to the population mean
n (sample size) is given, σ (population sd) is not given so what we do  given the normal
distribution of the sampling distribution  95% of all samples lie between -1,96* σ xbar (-
2sd) and +1,96 σ xbar (+2sd) away from the unknown μ (population mean)

If you know σ but don’t know μ  The means of 95% of the samples (of n= 5000) are
between - 0,56 minutes and + 0,56 minutes away from the (still unknown) population mean
Caveat= the idea of working with the sampling distribution is based on knowing the
population (σ) when calculating σ xbar. However using σ is problematic because we don’t
know σ. How do we solve that? We plug in the best estimated of that standard deviation of
the population which is the observed standard deviation in the population (s) especially when
the sample size is bigger (say above 50) s and σ will be very similar. When n is smaller, we
definitely need to take into account that we are introducing some error  enters t-distribution
SEM (standard error of the mean)

Video 3: first steps towards statistical inference


The distinguish between descriptive RQ and explanatory RQ has nothing to do with the
distinguish between the descriptive statistics or inferential statistics
The difference between descriptive statistics (using the data of a sample) and inferential
statistics (using the data of a sample to say something about the population)
Population Parameters= characteristics of a population (standard deviation sigma, mean u
with Greek letters)
Sample Statistics= standard deviation s and mean x written in normal letters
If we assume there are no problems in the sampling frame (sampling frame is a good
representation of the full population) and if we assume there are no problems in the data
collection procedure (you don’t have a lot of non-response) then the only step we have to
carefully think about is the one between the population and the sample.
Random sample
To what extent can we now use the sample (one!) to say something about the population
 This process is called inference
 You can’t do statistical inference without carefully studying the extent to which the
assumptions are meant
 Statistical inference= by thinking precisely what it means to draw a random sample
and after making some assumptions, we can use simple statistics to say something
about population parameters

Video 4: confidence intervals


Normally we have only one sample from an unknown population.
We observe xbar (144 minutes)  we can be pretty confident the population mean lies
between 144 - 0,56 and 144 + 0,56  This is called confidence interval.
If we have one confidence interval, we cannot simply say that there is a 95% probability that
μ lies in that interval: μ is in it or μ is not in it. What can you say about the population mean?
Suppose you construct a 95% confidence interval with a lot of random samples from the
same population, 95% of these intervals does contain
the parameter μ. See the picture and indeed you see
the population mean is in that 95% confidence interval.
Unit 21 – First steps towards inference: effects and significance
Video 1 – Certainty about means/ the t-distribution
Sample standard deviations are (on average) smaller than the real population standard
deviation (especially in smaller samples)  so especially with smaller samples, we are less
certain about where the mean is, when we use s.
This is where the t-distributions enters the scene. Not one, but a set of distributions,
depending on the number of cases. It takes into account that we are less certain, than had
we known the population sd. If n is big (>50), it becomes very similar to the normal
distribution.
Because of this, we can also use the t-distribution in the context of large samples, we always
use the t-distribution instead of the normal distribution.

Because we’re less certain Difference is already less


Example: sample mean= 144, n=25, sd of the sample=20
SD or SE (standard error)

In other words, we find a sample mean. We know there is chance of 95% the sample mean
lies between -2,06SD and +2.06SD away from the population mean (this number comes from
the t-distribution).
The SD or SE is in this example 4. So in this example, there is a chance of 95% the sample
mean lies between -8.24 and +8.24 away from the population mean.
Sample mean is 144, so the population mean probably (95%) lies between 135.76 and
152.24 minutes. You want to estimate something about a population using information from
the sample, you want to use sample statistics to make an inference about parameters  this
is called the confidence interval for the mean.
Video 2: Certainty about slopes
The relationship of a population between the dependent and independent variable can be
described with a linear equation. The intercept is 40 and the end point is 73, when x=100.
This is the population, the true relationship.

X=0 Probabilistic, there can be some error

X=100
Linear equation as a summary
Y= 40 + 0.33 * X + error
Population
Intercept= 40
Slope= 0.33
Because it’s a population, we use Greek letters for the intercept and slope parameters: Beta0
and Beta1 (β0 and β1)
Casual research is about finding these slopes in the population. We want to know whether x
is indeed causing y and the association is one central part in studying causality.
One single sample with intercept=40 and slope= 0.33

In this sample, we can estimate a linear equation.


Sample data
Intercept= 37.40
Slope= 0.38
Different from the population: that is understandable, sample data deviate a bit from what is
going on in the population.
Focus on the slope:
β1= population parameter
b= sample statistics

We draw a sample and in that sample and in that sample we estimate b1. We can do that an
infinite number of times. Every time we draw a sample, we calculate b, we throw the sample
back in the population and we draw another sample. Using that idea, we can also construct a
sampling distribution of the slope. All the possible slopes in all these possible samples are
then used as an estimate of the slope parameter.
Something like this. We have a population regression line beta, the relationship is
probabilistic and the regressions lines we find are in the graph below.
So these are the regressions line b found in all kinds of sampling and normally we only find
one of those because normally we’ve only got one sample. Unfortunately we know a bit more
about this sampling distribution of the slope parameter.

Distribution is normal. If we know all kinds of standard deviations of the independent and
dependent variable we can construct the normal distribution quite easily. In order to say
where the b’s are in comparison with the beta, we need the standard deviation of the
sampling distribution of the slope.
Unfortunately these population parameters, these standard deviations in the population are
unknown. The only thin we can do about that is by using the sample standard deviations. So
the standard deviation of the sampling distribution depends on the population of both x and y
but these are unknown, so we use the standard deviations in the sample. By doing this, we
are introducing some extra error (extra uncertainty about the slopes), especially in the
context of smaller samples. Therefore we use the t-distribution instead of the normal
distribution (you don’t have to calculate this by hand, but understand and interpret)

And we also know that in at least some bigger samples, not in the smallest samples that two
times the standard error away from the estimate gives us the confidence interval for the
intercept and the slope.
Let’s focus on the slope, this means that the confidence interval for the slope is the point
estimate 0.38 + or – about 2 times the standard error. So that’s the range we think that in
mist cases the population slope is.
Why regression analysis:
Correlation: that expresses how tightly the data fit around an imaginary straight line through
the scatterplot
coefficient is a number between -1 and +1

- Pearson’s r: positive or negative relation? (<0 is negative). The closer to zero, the
stronger the relation
The correlation cannot specifically predict, linear regression can.
Linear regression: describes the relation mathematically through a regression equation

- Test if the equation is an accurate description of the relation in the population


- How good are our predictions?

Y= Dependent variable = outcome = response variable: the variable we want to predict


X= Independent variable = explanatory variable = predictor: can be used to predict the
response variable.
Unit 22 – Research ethics
Aim is to present a structure to classify ethical rules, dilemma’s and conflicts when doing
empirical research.
Research ethics= systemizing, defending and recommending concepts of right and wrong
conduct in social science research  normative area (we’re making right between wrong and
right conduct). However the statements about what is good/ wrong are often based on
empirical study of (un) ethical research behavior.

You have to deal as researcher with units, principles, other researchers and the society. You
have to take into account that units, principles, other researchers and society have their own
interests.
I. Units of observation:
Principle A. No harm to research objects
Anonymity: researcher does not know who the units are (cannot reveal their names)
Confidentiality: the identity of people is known, but is hidden by the researcher.
Principle B. Informed consent
People need to be informed, using informed consent forms explaining the goal of research
and asking the research objects to sign so they are aware where the research is about.
However (fully) informed consent is often impossible. Supposed that you want to do a
behavioral, then deception is needed because otherwise people will act in accordance with
the hypothesis. The rule in that case is that you have to debrief people (you have to tell them
after the experiment what the true aims of research were).
II. Principle:
- Quality research and not overstretching claims
- Sponsor non-disclosure (not showing your identity)
- Confidentiality of results
III. Other researchers
- Plagiarism: presenting another author's language, thoughts, ideas or expressions as one’s
own original work  referencing
- Data fabrication (fabricate your own data is not allowed)
- Relationship with other researchers must be transparency & replicability: you should be
able to show which data you use: data storage, storage of all files. Giving access to all these
files.
- Relationship is also characterized by a peer review (researchers anonymity check what
other researchers are doing/writing). This is done in research proposals, concept of papers.
IV. Society at large
- Relevance of research
- Your research does not cause any harm to the society
Ethical dilemma’s occur prominently when the interest of the groups conflicts.
- Unit vs other researchers: no harm to units vs. transparency of research
- Principal vs society: interest of principals vs relevance vs society at large
- Units vs society: harm to units vs. relevance for society at large
- Principals vs other researchers: transparency of research vs interest of principal

The organization of research ethics

- Review boards (ethics committees)


- Disclosure of competing interests
- Professional standards
- Company (organization) standards
- Complaints procedures when violating

You might also like