MODULE 2 Handout
MODULE 2 Handout
MODULE 2 Handout
Prepared by:
ELIZABETH S. SUBA, Ph.D., RPsy, RPm, RGC
ANGELO R. DULLAS, MA Clinical Psych
Overview
In this module, we will provide you with the Principles of Psychological Assessment and
Psychological Testing, the definition and its basic concepts. You are expected to define
what are the Principles of Psychological Assessment and Psychological Testing, the
definition and its basic concepts primarily the statistical foundation of modern
psychometrics. The following are the outline of this chapter.
1. Scales of Measurement
2. Statistical Interpretation of Tests Scores (Raw and Derived Scores)
3. Measure of Central Tendencies
4. Measures of Variability
5. Norms
5.1Linear and Non-Linear Transformation
5.2Types of Norms
6. Test Reliability
6.1General Model of Reliability
6.2Test-retest
6.3Alternate Form
6.4Split Half Reliability
6.5Kuder Richardson
6.6Standard Error of Measurement
7. Test Validity
7.1Content Validity
7.2Criterion-related Validity
7.3Construct Validity
8. Item Analysis
9. Item Response Theory
I. Objectives:
Scales of Measurement
A measurement scale differentiates people from each other on any one variable.
Image:
https://www.graphpad.com/support/faq/
what-is-the-difference-between-ordinal-
interval-and-ratio-variables-why-should-i-
care/
SCALES OF
MEASUREMENT
SCALES DESCRIPTION LIMITATIONS/APPLICATION
1. Nominal - Numbers are used to -they do not provide very precise
classify and identify information about individual
people or objects differences; do not really quantify
according to category test-taker’s performance;
labels.
Examples: -they indicate the presence or
A. Gender can be absence of a property but not the
categorized as “male” or extent or amount of a property
“female”
We can choose to give all Compare:
females a “score” of 1 IQ = 102, IQ= 108 IQ = Average
and all “males” a score of
B. We can administer an Note: when we transform scores to
IQ test to a group of a nominal scale our information
people and reclassify becomes more general and less
their scores as “below precise.
average”, “average”, or
“above average”.
2. Ordinal -When we classify people It does not indicate the precise
or objects by ranking extent by which the group
them on some members differ.
dimensions or in terms of
the attributes being Ex: the ranks simply tell us that
measured. one child is taller than another, but
not exactly how much taller.
An ordinal scale provides
information about where -they do not provide the type of
group members fall individual differences information
relative to each other. Ex. that we want.
1st, 2nd, 3rd , . . . .
3. Interval When we classify people Application: -Assume that 3
or objects by ranking people, A, B, and C receive scores
them with an equal-unit of 65, 55, and 45 respectively on a
scale. We need to standardized test of anxiety.
establish that a difference
of 1 or 3 or 5 units is If this is an interval-level test, we
equivalent at any place can draw 3 conclusions:
along the scale. 1. Person A demonstrates a higher
level of anxiety than Person B, who
Ex. Height in turn is more anxious than person
C. The scores permit us to
NOTE:
As we move from nominal scale to interval and ratio scales, we increase the
precision of the measurement process.
Interval and ratio scales with their equal units are most appropriate for
comparing people, for the study of individual differences.
Types of Scores
Median- the middlemost score or the score above and below which 50% of the
score fall. It is sometimes referred to as the 50th percentile, the 5th decile, and
the second quartile.
Mode- the score that occurs more frequently in a set of test scores or the score
obtained by the most number of people. When test scores are grouped into
intervals, the mode is the midpoint of the interval containing the largest number
of scores.
2. Measures of dispersion or Variation- refers to the extent of the clustering about
a central value or the dispersion of scores around a given point. If all scores are close
to the central value, their variation will be less than if they tend to depart more
markedly from the central values.
Range- the simplest measure; the difference between the largest and smallest
score.
more widely scattered the scores. The standard deviation is actually the square
root of the variance.
Variance- is a measure of the total amount of variability in a set of test scores.
Image: https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
Negatively skewed- the larger frequencies are concentrated toward the high
end of the scale and the smaller frequencies toward the low end. Many high
scores and few low scores. The median is larger than the mean.
Example: If a test is easy, the scores would cluster at the high end of the scale
and tail off toward the low end.
Image: https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
Normal Curve- if the distribution is symmetrical, bell shaped and the larger
frequencies are clustered around the average. The mean, median and mode
coincide.
Image: https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
Image: https://www.researchgate.net/figure/Value-of-kurtosis-for-different-Gaussian-distribution-compared-with-normal-
distribution_fig16_318491600
Criterion-Referenced Test
Example:
Academic performance where you need to score 90% or better correct in a test for a
grade of 1.00 ; 80% or better for 1.50 and so forth. Professional licensing examinations
are examples that include a mastery component.
Norm-Referenced Test
individuals who have taken the test
often called standardization sample or normative group.
Example: IQ test, Aptitude test
KINDS OF NORMS
Developmental Norms- indicate how far along the normal developmental path the
individual had progressed. (Anastasi & Urbina, 1997).
1. Age norms- Age equivalent is the median score on a test obtained by persons
(standardization sample) of a given chronological age.
Mental Age score of an examinee corresponds to the chronological age of the subgroup
in the standardization group whose median is the same as that of the examinee.
Percentile- scores that are expressed in terms of the percentage of persons in the
standardization sample who fall below a given raw score, it is also called percentile
rank.
Percentage scores are raw scores expressed in terms of percentage of correct items.
Standard Scores
It is a raw score that has been converted from one scale to another scale, where it can
be interpreted easily than a raw score.
in
standard deviation units.
Linear transformation- scores retain their exact numerical relations of the original
raw scores because they are computed by subtracting a constant (mean) from each raw
score and then dividing the result by another constant (standard deviation).
Linearly derived standard scores are often designated as standard scores or “z scores”.
z Scores
-score is considered the base of standard scores, since it is used for conversion
to another type of standard score.
-score by subtracting the mean of the
instrument from the client’s raw score and dividing by the standard deviation of the
instrument.
- Aside from providing an easy context for comparing scores on the same test,
standard scores also provides an easy way to compare scores on different test.
Example:
Consider Marites’ raw score on the Psychological Assessment test was 24 and that her
raw score on the Abnormal Psychology test was 42. Without knowing any other
information than these raw scores, we can conclude that Marites did better on the
Abnormal Psychology test than on the other test. But if the two raw scores are
converted in z scores it will become more informative.
Converting Marites’ raw scores to z scores based on the performance of her classmates,
assume that we find her z score on psychological assessment was 1.32 and that her z
score on abnormal psychology was -0.75. Thus, although her raw score in abnormal
psychology test was higher than the other, the z scores will tell a different picture.
z score will show that, as compared to her other classmates (assuming that there is a
normal distribution of scores), Marites performed above average on the psychological
assessment test and below average on the abnormal psychology test (we assumed that
the interpretation is based on the tables detailing distances under the normal curve as
well as the resulting percentage of cases that could be expected to fall above or below
a particular standard deviation point (z score).
T Scores
The normalized standard score is multiplied by 10 and added to or subtracted from 50,
has a fixed mean of 50 and a standard deviation of 10.
A score of 50 corresponds to the mean, a score of 60 to 1 SD above the mean, and so
forth. Some test developers prefer T-scores because they eliminate the decimals and
positive and negative signs of z-scores.
Stanines
rd deviation of 1.96 except for the
stanines of 1 and 9.
ng the lowest 4 percent of the
individuals receive a stanine score of 1, the next 7 percent receive a stanine of 2, the
next 12 percent receive a stanine of 3, and then just keep progressing through the
group.
present a range of scores, and sometimes
people do not understand that one number represents various raw scores.
Example: Ottis Lennon School Ability test is one example that shows how raw scores
are converted to different scales to obtain a logical interpretation of the scores.
Source: https://www.psi-services.net/services/educational-assessment/
Deviation IQs
SD that approximate the SD of the
Stanford-Binet IQ distribution. It resembles an IQ scale because of the use of
100.
deviations from the mean are converted into standard scores, which typically have
a mean of 100 and a standard deviation of 15.
nt) used in early intelligence tests.
They are more preferred now than the ratio IQ.
(GRE)
CORRELATIONAL STATISTICS
Correlation is concerned with determining the extent to how some things (such as
traits, abilities or interests) are related to other things (such as behavior or intelligence).
For example,
If the correlation between tests X and Y is close to +1.00, it can be predicted with
confidence that a person who makes a high score on variable X will also make a high
score on variable Y and a person who makes a low score on X will also obtain a low
score on Y. On the other hand, if the correlation is close to – 1.00 what could be your
prediction?
Simple Linear Regression – procedure for determining the algebraic equation of the
best-fitting line for predicting scores on a dependent variable from one or more
independent variables. The product moment correlation coefficient, which is a measure
of the linear relationship between two variables, is actually a by-product of the
statistical procedure for finding the equation of the straight line that best fits the set of
points representing the paired X-Y values.
It refers to the consistency of scores obtained by the same person when retested with
the same test or with an equivalent form of the test on different occasions.
Item Analysis- process of statistically reexamining the qualities of each item of the test.
It includes Item Difficulty Index and Discrimination Index.
TEST RELIABILITY
Source of Error:
Time Sampling
b. Alternate Equivalent tests Coefficient of Hard to develop two
Form or given with time Equivalence equivalent tests.
Parallel interval between and Coefficient of
Form testing. stability May reflect change
in behavior over
-Uses one form of Consistency of time
test on the first response to
testing and with different item Practice effect may
another comparable samples. tend to reduce the
form on the second. correlation between
-In the development the two test forms.
of alternate forms,
need to ensure that The degree to
they are truly which the nature of
parallel. the test will change
in repetition.
Source of Error:
Item Sampling
c. Inter-rater Different scorers or Consistency of Source of Error:
or Inter- observers rate the ratings Observer
scorer items or responses differences
reliability independently.
Used for free
responses
d. Internal One test given at Coefficient of Uses shortened
Other things being equal, the longer the test, the more reliable it will be.
Lengthening a test, however, will only increase its consistency in terms of content
sampling, not its stability over time. The effect that lengthening or shortening a test will
have on its coefficient can be estimated by means of the Spearman-Brown formula.
The mean of this hypothetical score distribution is the person’s true score on the
test. If a client took a test 100 times, we would expect that one of those test
scores would be his or her true score.
The formula for calculating the standard error of measurement (SEM) is:
SEM= s√ 1- r
Where: s represents the standard deviation and r is the reliability coefficient
Question:
Given this information, how would you help Anne, if you are the counsellor?
If Anne is applying to a graduate program that only admits students with GRE-V scores
of 600 or higher, what are her chances of being admitted?
As a Psychologist or Counselor, one might assist Anne in examining her GRE scores and
considering other options or other graduate programs.
TEST VALIDITY
The degree to which a test measures what it purports (what it is supposed) to measure
when compared with accepted criteria. (Anastasi and Urbina, 1997).
TYPES OF VALIDITY
Types Purpose/Description Procedure Types of Tests
CONTENT To compare whether Compare test -Survey
- Utilize
systematic
observation of
behavior
(observe skills
and
competencies
needed to
perform a given
task;
CRITERIONRELATED To predict performance Use a rating, Aptitude Tests
on another measure or observation or
to predict an another test as Ability Tests
individual’s behavior in criterion.
specified situations
ratings of the
worker’s
performance
conducted at the
same time
CRITERIONRELATED -criterion measure is to Correlate test -Scholastic
be obtained in the scores with aptitude tests
future. criterion measure -General
obtained after a aptitude
period of time. batteries
Predictive -Goal is to have test -Prognostic tests
scores accurately -Readiness tests
predict criterion Ex. Predictive -Intelligence
performance identified. validities of tests
Admission tests
CONSTRUCT To determine whether Conduct Intelligence tests
a construct exists and multivariate Aptitude Tests
A construct is not to understand the statistical Personality Tests
directly observable but traits or concepts that analysis such as
usually derived from makes up the set of factor analysis,
theory, research or scores or items. discriminant
observation. analysis, and
-The extent to which a multivariate
test measure a analysis of
theoretical construct or variance.
trait. Such as
intelligence, -Requires
mechanical evidence that
comprehension, and support the
anxiety. interpretation of
test scores in line
-involves gradual with “theoretical
accumulation of implications
evidence associated with
the construct
label.
-The authors
should precisely
define each
construct and
distinguish it
from other
constructs.
Validity Coefficient – is the correlation between the scores on an instrument and the
correlation measure.
ITEM ANALYSIS
A general term for procedures designed to assess the utility or validity of a set of
test items.
Validity concerns the entire instrument, while item analysis examines the
qualities of each item.
It is done during test construction and revision; provides information that can be
used to revise or edit problematic items or eliminate faulty items.
• it reflects the proportion of people getting the item correct, calculated by dividing the
number of individuals who answered the item correctly by the total number of people.
item difficulty index can range from .00 (meaning no one got the item correct) to
1.00 (meaning everyone got the item correct.
item difficulty actually indicate how easy the item is because it provides the
proportion of individuals who got the item correct.
Example: in a test where 15 of the students in a class of 25 got the first item
on the test correct.
p = 15 = .60
25
the desired item difficulty depends on the purpose of the assessment, the group
taking the instrument, and the format of the item.
Calculate by subtracting the proportion of examinees in the lower group from the
proportion of examinees in the upper group who got the item correct or who
endorsed the item in the expected manner.
Item discrimination indices can range from + 1.00 (all of the upper group got it
right and none of the lower group got it right) to – 1.00 (none of the upper
group got it right and all of the lower group got it right)
The determination of the upper and lower group will depend on the distribution
of scores. If normal distribution, use the upper 27% for the upper group and
lower 27% for the lower group (Kelly,1939). For small groups Anastasi and
Urbina (1997) suggest the range of upper and lower 25% to 33%.
• Theory of test in which item scores are expressed in terms of estimated scores on a
latent-ability continuum.
• it rests on the assumption that the performance of an examinee on a test item can be
predicted by a set of factors called traits, latent traits or abilities.
• using IRT, we get an indication of an individual’s performance based not on the total
score, but on the precise items the person answers correctly.
• it suggests that the relationship between examinees’ item performance and the
underlying trait being measured can be described by an item characteristic curve.
Rasch Model – one parameter (item difficulty) model for scaling test items for
purposes of item analysis and test standardization.
- The model is based on the assumption that indexes of guessing and item
discrimination are negligible parameters. As with other latent trait models, the Rasch
model relates examinees’ performances on test items (percentage passing) to their
estimated standings on a hypothetical latent-ability trait or continuum.
References
Anastasi, Anne and Urbina, Susana (1997).Psychological Testing. 7th edition, New York:
McMillan Publishing.
Aiken, Lewis R. (2000) Psychological Testing and Assessment. Boston: Allyn and Bacon
Inc.
Cohen, Ronald Jay &Swerdlik, Mark E. (2018). Psychological Testing and Assessment.9th
Edition, New York: McGraw-Hill Companies, Inc.
Del Pilar, Gregorio H. (2015) Scale Construction: Principles and Procedures, Workshop
powerpoint presentation. AASP-PAP, 2015, Cebu City
Kaplan, Robert M. And Sacuzzon, Dennis P. (1997) Psychological Testing: Principles and
Application and Issues. 4th edition, California: Brooks/Cole Publishing Company.
Orense, Charity and Jason Parena (2014) Lecture in Psychological Assessment, Review
Manual in RGC Licensure Examination, Assumption College, Makati.
Walsh, w. Bruce and Bets, Nancy E. (1995) Test Assessment. New Jersey: Prentice Hall
Inc.
Morrison, J. (2014). DSM-5 Made Easy. The Clinician’s Guide to Diagnosis.The Guilford
Press. New York.
Others:
Manual of psychological tests
Psychological Resources Center – test brochures and test descriptions.
www.AssessmentPsychology.com
Microsoft Word - Ethical Guidelines- Final _as 9 August_.doc (pap.ph)