Lectures - Test 2
Lectures - Test 2
Lectures - Test 2
Unit 13, 24, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22
1 2 3
4 5 6
1= strong positive linear relation
2= moderate strong positive linear relation
3= no relationship
4= moderate negative linear relation
5= no linear relation (curvilinear)
6= no relationship
Spearman’s r
- Original scores of X and Y are sorted an ranked from low to high, rank x i and rank
yi
- Calculate the difference in rangorders between X and Y: di
- Spearman rho is 1 minus “the average difference rangorder” (=-0,175)
‘Measures of association’ refers to a wide variety of coefficients that measure (the direction
and) the strength of an association between two variables (bi-variate) in a dataset. Most of
the coefficients can take values between -1 (perfect negative association) and +1 (perfect
positive association), with 0 meaning no relationship at all (values close to zero can be seen
as weak associations).
The number of coefficients that can be used to describe relationships between variables is
very large. The choice between these measures depends to a large extent on the level of
measurement of the variables that are being used. In this course we will only discuss %
difference E, Pearson’s r, Spearman’s rho, Kendall’s tau-b, Kendall’s tau-c and
Cramér’s V. In Table 1 you can see when to use these coefficients.
Table 1.
To start from the right bottom of Table 1 we encounter Pearson’s r. Pearson’s r can be used
to look at the association between two scale variables. It is a standardized measure of
strength for the linear relationship between two scale variables only. What also should be
noted, is that Pearson’s r is not robust (that means: Pearson’s r is sensitive to extreme
values).
Spearman’s rho can be used as a more robust coefficient to look at the relationship
between two quantitative variables (ordinal or scale). Also, it can be used for consistently
increasing or decreasing non-linear associations. Raw scores are sorted from high to low
and replaced by the ranks of values. The highest value of a variable is given rank 1, the
second highest value is given rank 2, etcetera. Because of that, it can also be used to look at
the bivariate association between two ordinal variables.
Kendall’s tau can also be used as a measure of association for a consistently increasing or
decreasing relationship between two ordinal variables, but only when the number of
categories is relatively small so the relationship can be displayed in a contingency table.
Kendall’s tau-b can be used for squared tables (3x3, 4x4, for example), whereas Kendall’s
tau-c can be used for rectangular tables (2x3, 3x4 etc..).
To measure the association between two nominal variables or between a nominal and a
ordinal variable Cramér’s V can be used. Unlike the previous coefficients, Cramér’s V
cannot take a value lower than 0, since the categories are not ordered and it therefore does
not make sense to talk about a positive or negative association.
In case of two dichotomous variables, we use the % difference E (Epsilon. The counts of
two dichotomous variables are shown in a squared contingency table (2x2) and the column
percentages for the independent variable are calculated. The percentages are compared
horizontally and expressed as % difference E.
Unit 12 - Causality and bivariate causal hypotheses
Video 1: Causal explanation
Three implications of general causal statements
1. Time order
2. Association
3. Non-spurious relationship
Three different, yet related questions
1. Why did you become an alcoholic? (full story with reasons why someone became an
alcoholic, you can check the story. Example is in biographies, therapeutic sessions).
2. Why did this person become an alcoholic? (when you can’t ask the person itself. You
can only do this with existing knowledge applied to the specific case. Using causal
hypotheses to offer an explanation, does not improve general knowledge about
causes).
3. Why do people become alcoholics? (thinking about possible causes such as alcohol
acceptance. Explanation as formulating and testing a relationship between a cause
and a consequence. Researchers come up with general causes).
Exogenous concept, cause, X-variable, independent variable, treatment
Level of alcohol acceptance in the family you grew up with
Related to
Endogenous concept, effect/ consequence, Y-variable, dependent variable,
observation
Amount of drinking/ being an alcoholic
Causality in a graph
The relationship can be deterministic and probabilistic (focus in social sciences)
Deterministic: If.. then always
Probabilistic: If.. then relatively more/ less often
Three aspects of causality
1. X precedes Y in time (correct time order)
2. X and Y are correlated (association)
3. There is no third variable accounting for the association (non-spuriousness)
If X affect/ influence Y asymmetric relationship
Video 2: time order in causal relationships
The independent variable precedes the dependent variable
Examples:
Problems may occur when behavior and attitudes are measured at the same time..
Behavior may change (reported) attitudes
Example: Do you like hem because you are dancing together or are you dancing together
because you like him?
Measuring both variables at the same time may produce reverse causation
Example: Does a happy childhood make you more happy now?
How to check the time order: collect data at different points at time by interrupted time series
design
- Explanation/ confounding
- Specification/ interaction/ modification
Example cofounding: why are more babies born in some municipalities than in others?
Urbanized city
Example interaction: why do some people spend more on holidays than others?
Willingness to go on holiday
- Association
o Units for comparison (comparative research)
o A basis for comparison (variables
If you only have one unit, you cannot compare so you cannot test bivariate
causal relationships.
- Correct time order
- No spurious relationship
Interrupted time series normally have more pre-tests and more post-tests, this variant is
called before-after study.
Experiment: extensions of this interrupted time series.
Experimental group and control group are identical, there is no third variable
Experiments are not always feasible and require relatively simple hypotheses.
Cross sectional studies can sometimes exclude time order problems and can be used to
collect data on confounders.
Video 2: Validity in causal research
Four types of validity in the context of causal inference:
- Is the time order between the measured variables correctly established (is the
relationship not reversed?)
- Is the relationship unaffected by a third variable (no spurious relationship)?
Samling validity: about the relationship between the units we actually study and the units we
interested in
Measurement validity: about the extent to making some jokes in a lecture actually reverse to
comedy more generally.
External validity (most important): supposed that we found a relationship between an
independent and dependent variable, can we say that this relationship also exist more
generally (in other circumstances)?
Video 3: The classical experiment
Classical experiment: a research design for testing bivariate causal relationship (also known
as randomized experiment).
Classical experiments have:
- Two groups
- Randomization
- One pre-test of the dependent variable
- One post-test of the dependent variable
- One intervention/ treatment (independent variable)
Definition: a research design in which comparable groups of units are constructed using
random assignment (1) and in which the treatment (the independent variable) is manipulated
differently across groups (2) and to see whether the outcome (the dependent variable)
becomes different across groups (3).
Random assignment (1)
Start with a group of units and create an experimental group and control group using random
assignment. Experimental and control group are identical.
Time order: because of the pre-test and the observed similarity between the to groups and
because we focus on the difference between to groups after giving the treatment
We are sure the independent variable precedes the dependent variable: no reverse
causation.
Associating: by treating one group and not the other. By comparing the outcomes of the
control group and the experimental group
We establish an association
Non spurious: because of random assignment and checked by the pre-test, the only
difference between the control group and the experimental group is the treatment
We thus make sure that there is no effect of a third variable
Main threats in cross sectional research can be found in internal validity of the study:
- Reverse causation cannot be ruled out: since data are collected at one moment in
time, reverse causation cannot be ruled out. Example: sales increases the budget for
e-marketing.
- Third variables may affect the relationship (non-spuriousness). Example: maybe the
presence of young and dynamic managers affect both sales and the e-marketing
budget.
Confounding is a problem in cross sectional research (we cannot easily exclude all
third variables we can think of). Taking into account the possible effect of third
variables, may reduce the problems of spuriousness (controlling).
Evaluating cross sectional research
Weak internal validity because of reverse causation and the possible effect of third
variables
Potentially strong in external validity (sampling)
The effect of many independent variables cannot be studies in other types of research
designs
Video 5: Interrupted time series
Interrupted time series= a research design in which a dependent variable of one group of
units is studies over time and in which at one point in time the group receives a treatment (a
change in the independent variable)
Example: effect of wearing safety belts on traffic safety of individuals
Notation: O O X O O
The positions of O’s and X’s indicate time order in the design
Correlational studies
Only variables
Notation: O
Classical experiment
Notation: R O X O
R O O
Notation
Groups are indicated on separate lines
R= group created by random assignment
N= comparison group not created by random assignment
Extensions of classical experiments
Are combines treatments more effective? (factorial design)
R O X1 O
R O X2 O
R O X1X2 O
R O O
What if the pre-test may have an effect on itself? (Solomon four-group design)
R O X1 O
R O O
R X1 O
R O
Confounding and interpretation look a bit like the same. But there is a difference in time
order: if the independent variable precedes the third variable in time interpretation. If
independent variable is explained by the third variable confounding
Trivariate hypothesis
After the introduction of the third (test) variable (mention the third variable), the bivariate
relationship (mention the original bivariate hypothesis) disappears/ changes, remains the
same. And the test variable is related to the other variables (outline the model)
Video 2: The effect of third variables – confounding
There is no longer an additional relationship between study time and grade but that is
fully goes from study time to understanding of the topic to grade
Intervening variable: independent and dependent variable are still related but we know
better understand why there is a relationship (no additional relationship left)
Also no difference
Exactly what we expected. In the municipalities that are urbanized, we found no
relationship between the number of storks and the number of babies
Second comparison: non urbanized municipalities have far more storks then urbanized
municipalities
Unit 17: Elaboration: analyzing multi-variate relationships using tables
Video 1: Interpreting a trivariate table
Example: using laughing gas causes concentration problems
62% - 32,6% = 29,4% difference (in the expected direction the more you use, more
problems)
Third variable? Vegetarian (because of vitamin B12)
58,7% - 26,2% = 32,5% difference
56,6% - 43,4% = 13,2% difference (in the expected direction. The more depressed, the more
often a lower achievement)
Third variable? counselling
If counselling was available that reduced the effect of depression
Understand what Ecrit (E% difference critical…) is and what is does!!! (Outcome: significant
or not) and then.. is the hypothesis accepted/ confirmed or rejected.
Unit 19: Sampling
Video 1: Sampling
Before sampling, you need a clear definition on the population of interest (target population)
Unit of analysis
When sampling
- If not all units mentioned in our research question can be studied, we need to sample.
- Studying a smaller set of units with the aim to say something about all units.
Sampling process
Especially the relationship between the sampling frame and the sample is called sampling.
Focus on distortions in sampling (the relationship between the sampling frame and the
sample)
Sampling procedures
Is the chance that a specific unit from the sampling frame is included in the study, known?
No: Non-probability sampling
Yes: Probability sampling
Non-probability sampling
You want to develop a new concept/ the population is unknown/ sampling frame is not
available/ you can study only a very small number of units.
- Convenience
- Purposive
- Systematic
- Snowball sampling (social network of response)
- Quota (purposive with fixed final size)
Assessing sampling
We always make mistakes when sampling
Two types of mistakes:
- Sampling bias (sampling invalidity): not being typical for the population. Studying the
wrong group of people.
- Sampling error (sampling unreliability): a consequence of sample size and
characteristics of the population.
Evaluating sampling procedures
Non-probability sampling:
- Bias
- (Sample size relatively unimportant)
Probability sampling:
- No bias
- Sample size affects sampling error
So you first calculate for example the mean of a sample and then use inferential statistics to
calculate the mean of it but than in the population.
Probability sampling allows for statistical inference.
Video 3: Sampling distributions
Inferential statistics refers to methods used to draw conclusions about a population based on
data coming form a sample. A sample is a subset of a population.
If you don’t have a good sampling frame or if a simple random sample is really
expensive
Advantage is that you can make sure that you have enough subjects from every
stratum in your sample
Sample: bigger is better. Bigger sample can never make up for a bad sampling procedure. If
your sample is not random, you can increase the sample size as much as you want but if
there is a bad sampling procedure, your sample will never be good. However you’re sample
is random, a bigger sample is technically better but if you pass a certain point an increase in
your sample only results in a very small increase in the precision of your estimation of the
population parameter.
Unit 20 - First steps towards inference: certainty about means
A sample distribution displays which values of a variable you have obtained after drawing a
sample of a give size from a population
A sampling distribution displays the values of a statistic (mean, SD, var) from repeatedly
drawing samples of a give size from a population
If the standard deviation increases, the standard deviation of the sampling distribution
increases as well. The larger the variability in the population the larger the variability in the
sample means. Also if the sample size increases, the standard deviation of the sample
distribution is decreasing (the larger the size of your sample, the closer the sample means
will lie to the population mean, the smaller the standard deviation of your sampling
distribution).
What do we observe about the sampling distribution?
Irrespective the shape of the population distribution, the shape of the sampling distribution is
normal central limit theorem (and the mean of the sampling distribution is almost equals
to the population mean
n (sample size) is given, σ (population sd) is not given so what we do given the normal
distribution of the sampling distribution 95% of all samples lie between -1,96* σ xbar (-
2sd) and +1,96 σ xbar (+2sd) away from the unknown μ (population mean)
If you know σ but don’t know μ The means of 95% of the samples (of n= 5000) are
between - 0,56 minutes and + 0,56 minutes away from the (still unknown) population mean
Caveat= the idea of working with the sampling distribution is based on knowing the
population (σ) when calculating σ xbar. However using σ is problematic because we don’t
know σ. How do we solve that? We plug in the best estimated of that standard deviation of
the population which is the observed standard deviation in the population (s) especially when
the sample size is bigger (say above 50) s and σ will be very similar. When n is smaller, we
definitely need to take into account that we are introducing some error enters t-distribution
SEM (standard error of the mean)
In other words, we find a sample mean. We know there is chance of 95% the sample mean
lies between -2,06SD and +2.06SD away from the population mean (this number comes from
the t-distribution).
The SD or SE is in this example 4. So in this example, there is a chance of 95% the sample
mean lies between -8.24 and +8.24 away from the population mean.
Sample mean is 144, so the population mean probably (95%) lies between 135.76 and
152.24 minutes. You want to estimate something about a population using information from
the sample, you want to use sample statistics to make an inference about parameters this
is called the confidence interval for the mean.
Video 2: Certainty about slopes
The relationship of a population between the dependent and independent variable can be
described with a linear equation. The intercept is 40 and the end point is 73, when x=100.
This is the population, the true relationship.
X=100
Linear equation as a summary
Y= 40 + 0.33 * X + error
Population
Intercept= 40
Slope= 0.33
Because it’s a population, we use Greek letters for the intercept and slope parameters: Beta0
and Beta1 (β0 and β1)
Casual research is about finding these slopes in the population. We want to know whether x
is indeed causing y and the association is one central part in studying causality.
One single sample with intercept=40 and slope= 0.33
We draw a sample and in that sample and in that sample we estimate b1. We can do that an
infinite number of times. Every time we draw a sample, we calculate b, we throw the sample
back in the population and we draw another sample. Using that idea, we can also construct a
sampling distribution of the slope. All the possible slopes in all these possible samples are
then used as an estimate of the slope parameter.
Something like this. We have a population regression line beta, the relationship is
probabilistic and the regressions lines we find are in the graph below.
So these are the regressions line b found in all kinds of sampling and normally we only find
one of those because normally we’ve only got one sample. Unfortunately we know a bit more
about this sampling distribution of the slope parameter.
Distribution is normal. If we know all kinds of standard deviations of the independent and
dependent variable we can construct the normal distribution quite easily. In order to say
where the b’s are in comparison with the beta, we need the standard deviation of the
sampling distribution of the slope.
Unfortunately these population parameters, these standard deviations in the population are
unknown. The only thin we can do about that is by using the sample standard deviations. So
the standard deviation of the sampling distribution depends on the population of both x and y
but these are unknown, so we use the standard deviations in the sample. By doing this, we
are introducing some extra error (extra uncertainty about the slopes), especially in the
context of smaller samples. Therefore we use the t-distribution instead of the normal
distribution (you don’t have to calculate this by hand, but understand and interpret)
And we also know that in at least some bigger samples, not in the smallest samples that two
times the standard error away from the estimate gives us the confidence interval for the
intercept and the slope.
Let’s focus on the slope, this means that the confidence interval for the slope is the point
estimate 0.38 + or – about 2 times the standard error. So that’s the range we think that in
mist cases the population slope is.
Why regression analysis:
Correlation: that expresses how tightly the data fit around an imaginary straight line through
the scatterplot
coefficient is a number between -1 and +1
- Pearson’s r: positive or negative relation? (<0 is negative). The closer to zero, the
stronger the relation
The correlation cannot specifically predict, linear regression can.
Linear regression: describes the relation mathematically through a regression equation
You have to deal as researcher with units, principles, other researchers and the society. You
have to take into account that units, principles, other researchers and society have their own
interests.
I. Units of observation:
Principle A. No harm to research objects
Anonymity: researcher does not know who the units are (cannot reveal their names)
Confidentiality: the identity of people is known, but is hidden by the researcher.
Principle B. Informed consent
People need to be informed, using informed consent forms explaining the goal of research
and asking the research objects to sign so they are aware where the research is about.
However (fully) informed consent is often impossible. Supposed that you want to do a
behavioral, then deception is needed because otherwise people will act in accordance with
the hypothesis. The rule in that case is that you have to debrief people (you have to tell them
after the experiment what the true aims of research were).
II. Principle:
- Quality research and not overstretching claims
- Sponsor non-disclosure (not showing your identity)
- Confidentiality of results
III. Other researchers
- Plagiarism: presenting another author's language, thoughts, ideas or expressions as one’s
own original work referencing
- Data fabrication (fabricate your own data is not allowed)
- Relationship with other researchers must be transparency & replicability: you should be
able to show which data you use: data storage, storage of all files. Giving access to all these
files.
- Relationship is also characterized by a peer review (researchers anonymity check what
other researchers are doing/writing). This is done in research proposals, concept of papers.
IV. Society at large
- Relevance of research
- Your research does not cause any harm to the society
Ethical dilemma’s occur prominently when the interest of the groups conflicts.
- Unit vs other researchers: no harm to units vs. transparency of research
- Principal vs society: interest of principals vs relevance vs society at large
- Units vs society: harm to units vs. relevance for society at large
- Principals vs other researchers: transparency of research vs interest of principal