Strructures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

RELIABILITY

RELIABILITY

 refersto the consistency of scores obtained by the


same person when re-examined with the same test on
different occasions, or with different sets of equivalent
items, or under other variable
 More number of items leads to higher reliability
TRUE SCORE AND OBSERVE SCORE

 To measure the true score, give a test that comprises


all items (Ex: Spelling of all the words in the dictionary
to measure the vocabulary)
 Use only a representative sample to obtain an
observed score
 True score cannot be found. If the observed score
and true score has little difference, it has small error.
CLASSICAL TEST THEORY

 alsoknown as true score


theory, assumes that
each person has a true
score, T, that would be
obtained if there were no
errors in measurement.
PRIMARY METHODS IN
OBTAINING RELIABILITY
1.TEST-RETEST RELIABILITY

 It is established by comparing the scores obtained from two


successive measurements of the same individuals and
calculating a correlation between the two sets of scores.
 Measurement for enduring and stable traits (e.g. IQ)
 It is also known as TIME SAMPLING RELIABILITY since it measures
the error associated with administering a test at two different
times.
 This is not applicable for tests involving reasoning and ingenuity
INTERVAL

 time between the administrations of two tests; IDEALLY 6


MONTHS OR MORE
 TEST-RETEST WITH SHORT INTERVAL might be affected by
CARRYOVER EFFECTS; it might measure MEMORY; Inflated
correlation
 TEST-RETEST WITH LONG INTERVAL might be affected by other
extreme factors; low correlation
 CARRYOVER EFFECT: occurs when the first testing session
Influences the results of the second session and this can affect
the test-retest reliability of a psychological measure. It can result
in an overestimation of reliability.
 PRACTICE EFFECT: a type of carryover effect wherein the scores
on the second test administration are higher than they were on
the first; the person improves his score because of his experience
in testing.
2. PARALLEL FORMS/ITEM SAMPLING
RELIABILITY

 It is established when at least two different versions of the test


yield almost the same scores
 Examples: (1) The Purdue Non-Language Test (PLNT) has Forms
A and B and yield slightly identical scores of the test taker. (2)
The SRA Verbal Form has parallel forms A and B and both yield
almost identical scores of the taker.
3. INTERNAL CONSISTENCY

 Used when tests are administered once.


 Suggests that there is consistency among items within the test.
 It is also known as INTER-ITEM RELIABILITY.
 This model of reliability measures the internal consistency of the
test which is the degree to which each test item measures the
same construct. It is simply the intercorrelations among the
items.
 Measurement for changing traits.
 Tests should contain the same number of items and the items
should be expressed in the same form and should cover the
same type of content. The range and level of difficulty of the
items should also be equal. Instructions, time limits, illustrative
examples, format and all other aspects of the test must likewise
be checked for equivalence.
 If there is test leakage, use the form that is not mostly
administered.
CRONBACH’S COEFFICIENT ALPHA

 Used when two halves of the test have unequal variances


 Provides the lowest estimate of reliability
 Items are not in right or wrong format or non-dichotomous
 Typically range from 0 to 1
 Used in personality tests and multiple-scored items
 A Cronbach of 0.99 implies that the test must have highly redundant
items; thus, items should be reviewed.
KUDER RICHARDSON 20 (KR20)
FORMULA

 The statistics used for calculating the reliability of a test in


which the items are dichotomous or scored a 0 or 1 with
varying level of difficulty.
 If it has a uniform level of difficulty, use KR21.
 One testing condition
 •Administered to 200+ respondents
SPLIT-HALF RELIABILITY

 Itis obtained by splitting the items on a


questionnaire or test in half, computing a
separate score for each half, and then
calculating the degree of consistency
between the two scores of group
participants.
INTER-RATER OR INTER-OBSERVER
RELIABILITY

 It is obtained by splitting the items on a questionnaire or test in half,


computing a separate score for each half, and then calculating the
degree of consistency between the two scores of group participants.
 •It is the degree of agreement between two observers (judges or raters)
who simultaneously record measurements of the behaviors
 •Measure of reliability for creativity or projective and individual tests.
 • Examples: Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the construct are
almost the same, then the measure has a good inter-rater reliability.
This uses the KAPPA STATISTIC in order to assess the level
of agreement among raters in nominal scale.
 COHEN’S KAPPA: used raters to know the agreement
among two raters
 FLEISS’ KAPPA: used to know the agreement among 3
or more raters
VALIDITY

Validity refers to the


accuracy of inferences
drawn from an assessment.

It is the degree to which


the assessment measures
what it is intended to
measure.
CHARACTERISTICS OF A VALID TEST

Predicts future performance on


appropriate variables
Measures an appropriate domain
Measures appropriate characteristics
of test takers
FACE VALIDITY

• Is the simplest and least scientific and


stringent form of validity
• A test appears to measure what it’s
supposed to measure. This type
of validity is concerned with whether a
measure seems relevant and appropriate
for what it’s assessing on the surface.
INTERNAL AND EXTERNAL
VALIDITY
• INTERNAL VALIDITY refers to the degree of
confidence that the causal relationship
being tested is trustworthy and not
influenced by other factors or variables.
• EXTERNAL VALIDITY refers to the extent to
which results from a study can be applied
(generalized) to other situations, groups,
or events.
CRITERION VALIDITY

 How well a test correlates with an established standard of


comparison called a criterion.
 Evaluates how accurately a test measures the outcome it
was designed to measure. An outcome can be a disease,
behavior, or performance.
Types of Criterion Validity
• Concurrent validity shows
you the extent of the
agreement between two
measures or assessments
taken at the same time.
• Predictive Validity - refers to
the ability of a test or other
measurement to predict a
future outcome.
CONTENT VALIDITY
 Evaluates how well an
instrument (like a test) covers
all relevant parts of the
construct it aims to measure.
Is the test fully representative
of what it aims to measure?
Does the test contain items
from the desired “content
domain”?
Types of Content Validity

 CONSTRUCT UNDERREPRESENTATION –
failure to capture important components
of a construct. Example: An English test
which only contains vocabulary items but
no grammar items.
 CONSTRUCT-IRRELEVANT VARIANCE - is
the introduction of extraneous,
uncontrolled variables that affect
assessment outcomes.
CONSTRUCT VALIDITY

 It assesses whether the data collected truly reflects the


theoretical constructs and concepts being studied. It covers
all type of validity.
 A test has a GOOD CONSTRUCT VALIDITY if there is an EXISTING
PSYCHOLOGICAL THEORY which can support what the test are
measuring.
 Does the test interrelate with other tests as a measure of this construct
should ?
Types of Construct Validity

 CONVERGENT VALIDITY – refers


to how closely a test is related
to other tests that measure the
same (or similar) constructs.
 DISCRIMINANT VALIDITY - refers
Example : You are researching
extroversion as a personality trait
to the extent to which a test among marketing students. To

is not related to other tests that establish discriminant validity, you


must also measure an unrelated
measure different constructs. construct, such as intelligence.

You might also like