EDUC 203 - Assessment in Learning 2 Module (Midterm)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Reaccredited Level IV by the Accrediting Agency of Chartered Colleges and Universities

of the Philippines

1|Page
2|Page
Assessment in Learning 2
EDUC 203

Mark P. Castillo, LPT


BEEd Generalist Graduate
Board Topnotcher
1st place, September 2016 LET
[email protected]

Ezra Gil S. Lagman, LPT


BSECE Graduate with 18 units of Professional Education
Civil Service Examination Professional Level Passer
Board Topnotcher
9th place, September 2018 LET
[email protected]

Stephen John B. Caducoy, LPT


BSEd Mathematics Graduate (Magna Cum Laude)
Board Topnotcher
10th place, September 2018 LET
[email protected]
3|Page
Course Description
This is a course that focuses on the principles, development and utilization of alternative
forms of assessment in measuring authentic learning. It emphasizes on how to assess
process- and product-oriented learning outcomes as well as affective learning. Students
will experience how to develop rubrics and other assessment tools for performance-based
and product-based assessment.

Course Outline
I. Basic Concepts in Assessment
II. Principles of High Quality Assessment
III. Measures of Central Tendency and Variability
IV. Performance-based Assessment
V. Assessment in the Affective Domain
VI. Portfolio Assessment Methods
VII. Educational Evaluation
VIII. Grading and Reporting

Purpose and Rationale


College of Teacher Education, as part of its commitment in supporting equity of access to
Higher Education for all students, has developed this module for use by both teachers and
students to support in building their skills needed to access quality education.

The purpose of this module is to develop an understanding on the principles, development


and utilization of alternative forms of assessment in measuring authentic learning .

Through this instructional module, the students will be able to:


1. Recall the basic concepts in assessment such as evaluation, assessment, measurement and test.
2. Discuss the different learning domains and distinguish among validity, reliability, practicability and
other properties of assessment methods.
3. Compute for the values of the measures of central tendency and of variability.
4. Explain the concepts of process- and product-oriented performance-based assessment and
construct scoring rubrics for assessment.
5. Clarify the different learning competencies in the affective domain and develop assessment tools
for the affective domain.
6. Discuss the different types of portfolios and the methods on how to assess portfolios.
7. Delineate evaluation and discuss the different approaches and methods of evaluation.
8. Apply the principles in assigning grades and implementing grading systems.

4|Page
Chapter 1 – Basic Concepts in Assessment
At the end of this chapter, the students will be able to:

1. Distinguish among test, measurement, evaluation and assessment.


2. Explain the meaning of assessment FOR, OF, and AS learning.

1.1 Basic Concepts


a. Test is defined as an instrument, tool or technique used to obtain a sample of an
individual‘s behaviour using standardized procedures.
b. Measurement is a set of rules for assigning numbers to represent objects, traits,
attributes, or behaviors.
c. Evaluation is the process of making judgments based on criteria and evidence,
and determining the extent to which instructional objectives are attained.
d. Assessment is the process of describing, collecting (gathering/ documenting),
recording, scoring, and interpreting information about learning.

1.2 Purposes of Assessment


a. Assessment FOR learning
The preposition ―for‖ in assessment for learning implies that assessment is done
to improve and ensure learning. This is referred to as FORmative assessment,
assessment that is given while the teacher is in the process of student formation.
It ensures that learning is going on while teacher is in the process of teaching.

b. Assessment OF learning
It is usually given at the end of a unit, grading period or a term like a semester. It
is meant to assess learning for grading purposes, thus the term assessment of
learning.

c. Assessment AS learning
It is associated with self-assessment. As the term implies, assessment by itself is
already a form of learning for the students.
As students assess their own work (e.g. a paragraph) and/or with their peers with
the use of scoring rubrics, they learn on their own what a good paragraph is. At
the same, as they are engaged in self-assessment, they learn about themselves
5|Page
as learners become aware of how they learn. In short, in assessment AS learning,
students set their targets, actively monitor and evaluate their own learning in
relation to their set target. As a consequence, they become self-directed or
independent learners. By assessing their own learning, they are learning at the
same time.

Assessment AS
learning
Assessment FOR Self-assessment
learning Assessment OF learning
Placement assessment Summative assessment
Diagnostic assessment
Formative assessment

ASSESSMENT

Various Approaches to Assessment

Other terms in assessment include:

 Placement assessment – used to place students according to prior achievement


or personal characteristics, at the most appropriate point in an instructional
sequence, in a unique instructional strategy, or with a suitable teacher.
 Diagnostic assessment – used to identify the strengths and weakness of the
students.
 Summative assessment – is generally carried out at the end of a course or
project. In an educational setting, summative assessments are typically used to
assign students a course grade. Summative assessments are evaluative.
Summative assessments are made to summarize what the students have learned,
to determine whether they understand the subject matter well.

6|Page
Exercises
A. Determine whether the following statements are test, measurement, assessment
or evaluation.
1. Over-all goal is to provide information regarding the extent of attainment of
student learning outcomes.
2. Uses such instruments as ruler, scale, or thermometer.
3. Process designed to aid educators make judgment and indicates solutions to
academic situations.
4. Results show the more permanent learning and clear picture of student‘s
ability.
5. Instrument to gather data
B. ―All tests are forms of assessment, but not all assessments are test.‖ Do you
agree? Why or why not?
C. Assessment for learning is ―when the cook tastes the food‖ while assessment of
learning is ―when the guest tastes the food.‖ Do you agree? Why or why not?
D. List down three (3) activities or processes involved in each of the following:
1. Measurement
(a)
(b)
(c)
2. Assessment
(a)
(b)
(c)
3. Evaluation
(a)
(b)
(c)

7|Page
Chapter 2 – Principles of High Quality Assessment
At the end of this chapter, the students will be able to:

1. Discuss the different learning domains.


2. Distinguish among validity, reliability, practicability and other properties of assessment
methods.

2.1 Clarity of Learning Targets


Assessment can be made precise, accurate and dependable only if what are to be
achieved are clearly stated and feasible. To this end, we consider learning targets
involving knowledge, reasoning skills, products and effects. Learning targets need to be
stated in behavioral terms or terms that denote something which can be observed
through the behavior of the students.

2.1.1 Cognitive Targets


As early as the 1950‘s, Bloom (1956), proposed a hierarchy of educational objectives at
the cognitive level. These are:

Level 1. Knowledge refers to the acquisition of facts, concepts and theories.

Level 2. Comprehension refers to the same concept of ―understanding‖. It is a step higher


than mere acquisition of facts and involves a cognition or awareness of the interrelationships
of facts and concepts.

Level 3. Application refers to the transfer of knowledge from one field of study to another or
from one concept to another concept in the same discipline.

Level 4. Analysis refers to the breaking down of a concept or idea into its components and
explaining the concept as a composition of these concepts.

Level 5. Synthesis refers to the opposite of analysis and entails putting together the
components in order to summarize the concept.

Level 6. Evaluation refers to valuing and judgment or putting worth to a concept or principle.

8|Page
2.1.2 Skills, Competencies and Abilities Targets
Skills refer to specific activities or tasks that a student can proficiently do. Skills can be
clustered together to form specific competencies. Related competencies characterize a
student‘s ability. It is important to recognize a student‘s ability in order that the program of
study can be so designed as to optimize his/her innate abilities.

Abilities can be roughly categorized into: cognitive, psychomotor and affective abilities. For
instance, the ability to work well with others and to be trusted by every classmate (affective
ability) is an indication that the student can most likely succeed in work that requires
leadership abilities. On the other hand, other students are better at doing things alone like
programming and web designing (cognitive ability) and, therefore, they would be good at
highly technical individualized work.

2.1.3 Products, Outputs and Projects Targets


Products, outputs and projects are tangible and concrete evidence of a student‘s ability. A
clear target for products and projects need to clearly specify the level of workmanship of
such projects. For instance, an expert output may be characterized by the indicator ―at
most tow imperfections noted‖ while a skilled level output can be characterized by the
indicator ―at most four (4) imperfections noted‖ etc.

2.2 Appropriateness of Assessment Methods


Once the learning targets are clearly set, it is now necessary to determine an appropriate
assessment procedure or method. We discuss the general categories of assessment
methods or instruments below.

2.2.1 Written-Response Instruments


Written-response instruments include objective tests (multiple choice, true-false,
matching or short answer tests), essays and checklists. Objective tests are appropriate
for assessing the various levels of hierarchy of educational objectives. Multiple choice tests
in particular can be constructed in such a way as to test higher order thinking skills.
Essays, when properly planned, can test the student‘s grasp of the higher level of cognitive
skills. However, when the essay question is not sufficiently precise and when the
parameters are not properly defined, there is a tendency for the students to write irrelevant
and unnecessary things just to fill in blank spaces. When this happens, both the teacher
and the students will experience difficulty and frustration.
9|Page
2.2.2 Product Rating Scales
A teacher is often tasked to rate products. Examples of products that are frequently rated
in education are book reports, maps, charts, diagrams, notebooks, essays and creative
endeavors of all sorts. An example of a product rating scale is the classic ‗handwriting‘
scale used in the California Achievement Test, Form W (1957). There are prototype
handwriting specimens of pupils and students. The sample handwriting of a student is then
moved along the scale until the quality of the handwriting sample is most similar to the
prototype products in education, the teacher must possess prototype products over his/her
years of experience.

2.2.3 Performance Tests


One of the most frequently used measurement instruments is the checklist. A
performance checklist consists of a list of behaviors that make up a certain type of
performance. It is used to determine whether or not an individual behaves in a certain way
when asked to complete a particular task. If a particular behavior is present when an
individual is observed, the teacher places a check opposite it on the list.

2.2.4 Oral Questioning


The traditional Greeks used oral questioning extensively as an assessment method.
Socrates himself, considered the epitome of a teacher, was said to have handled his
classes solely based on questioning and oral interactions.

Oral questioning is an appropriate assessment method when the objectives are:


(a) To assess the student‘s stock knowledge
(b) To determine the student‘s ability to communicate ideas in coherent verbal
sentences.
While oral questioning is indeed an option for assessment, several factors need to be
considered when using this option. Of particular significance are the student‘s state of
mind and feelings, anxiety and nervousness in making oral presentations which could
mask the student‘s true ability.

2.2.5 Observation and Self reports


A tally sheet is a device often used by teachers to record the frequency of student
behaviors, activities or remarks. How many high school students follow instructions during
a fire drill, for example? How many instances of aggression or helpfulness are observed
when elementary students are observed in the playground? Observational tally sheets are
most useful in answering these kinds of questions.

10 | P a g e
A self-checklist is a list of several characteristics or activities presented to the subjects of
a study. The individuals are asked to study the list and then to place a mark opposite the
characteristics which they possess or the activities which they have engaged in for a
particular length of time. Self-checklists are often employed by teachers when they want to
diagnose or to appraise the performance of students from the point of view of the students
themselves.

Observation and self-reports are useful supplementary assessment methods when used in
conjunction with oral questioning and performance tests. Such methods can offset the
negative impact on the students brought about by their fears and anxieties during oral
questioning or when performing actual task under observation. However, since there is a
tendency to overestimate one‘s capabilities, it may be useful to consider weighing self-
assessment and observational reports against the results of oral questioning and
performance tests.

2.3 Properties of Assessment Methods


The quality of the assessment instrument and method used in education is very important
since the evaluation and judgment that the teacher gives on a student are based on the
information he obtains using these instruments. Accordingly, teachers follow a number of
procedures to ensure that the entire assessment process is valid and reliable.

2.3.1 Validity

Validity is the extent to which a test measures what it is supposed to measure or as


referring to the appropriateness, correctness, meaningfulness and usefulness of the
specific decisions a teacher makes based on the test results.
The first definition refers to the test itself while the second refers to the decisions made by
the teacher based on the test. A test is valid when it is aligned with the learning outcome.
A teacher who conducts test validation might want to gather different kinds of evidence.
There are essentially three (3) main types of evidence that may be collected:
a. Content-related evidence of validity refers to the content and format of the
instrument. How appropriate is the content? How comprehensive? Does it logically
get at the intended variable? How adequately does the sample of items or
questions represent the content to be assessed?
b. Criterion-related evidence of validity refers to the relationship between
scores obtained using the instrument and scores obtained using one or more other
11 | P a g e
tests (often called criterion). How strong is this relationship? How well do such
scores estimate present or predict future performance of a certain type?
c. Construct-related evidence of validity refers to the nature of the
psychological construct or characteristic being measured by the test? How well
does a measure of the construct explain differences in the behaviour of the
individuals or their performance on a certain task?
The usual procedure for determining content validity may be described as follows: The
teacher writes out the objectives of the test based on the Table of Specifications and then
gives these together with the test to at least two (2) experts along with a description of the
intended test takers. The experts look at the objectives, read over the items in the test and
place a check mark in front of each question or item that they feel does not measure one
or more objectives. They also place a check mark in front of each objective not assessed
by any item in the test. The teacher then rewrites any item checked and resubmits to the
experts and/or writes new items to cover those objectives not covered by the existing test.
This continues until the experts approve of all items and also until the experts agree that all
of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity, the teacher usually compares
scores on the test in question with the scores on some other independent criterion test
which presumably has already high validity. For example, if a test is designed to measure
mathematics ability of students and it correlates highly with a standardized mathematics
achievement test (external criterion), then we say we have high criterion-related evidence
of validity. In particular, this type of criterion-related validity is called its concurrent
validity. Another type of criterion-related validity is called predictive validity wherein the
test scores in the instrument are correlated with scores on a later performance (criterion
measure) of the students. For example, the mathematics ability test constructed by the
teacher may be correlated with their later performance in a division – wide mathematics
achievement test.
Another type of validity is the face validity where it is the extent to which a test is
subjectively viewed as covering the concept it tries to measure.
2.3.2 Reliability
Reliability refers to the consistency of the scores obtained – how consistent they are for
each individual from one administration of an instrument to another and from one set of
items to another.

12 | P a g e
Reliability and validity are related concepts. If an instrument is unreliable, it cannot yield
valid outcomes. As reliability improves, validity may also improve (or not) however, if an
instrument is shown scientifically to be valid then it is almost certain that it is also reliable.
Something reliable is something that works well and that you can trust.
A reliable test is a consistent measure of what it is supposed to measure.

The following table is a standard followed almost universally in educational test and
measurement.
Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best standardized tests
0.80 – 0.90 Very good for a classroom test
0.70 – 0.80 Good for a classroom test; in the range of most. There are
probably a few items which could be improved
0.60 – 0.70 Somewhat low. This test needs to be supplemented by other
measures (more tests) to determine grades. There are probably
some items which could be improved
0.50 – 0.60 Suggests need for revision of test, unless it is quite short (ten or
fewer items). The test definitely needs to be supplemented by
other measures (more tests) for grading

13 | P a g e
0.50 or below Questionable reliability. This test should not contribute heavily to
the course grade and it needs revision

The Reliability Coefficient


The reliability coefficient is symbolized with the letter "r" and a subscript that contains two
of the same letters or numbers (e.g., ''r xx''). The subscript indicates that the correlation
coefficient was calculated by correlating a test with itself rather than with some other
measure.
Note that a reliability coefficient does not provide any information about what is actually
being measured by a test!
A reliability coefficient only indicates whether the attribute measured by the test—
whatever it is—is being assessed in a consistent, precise way.
Methods for Estimating Reliability
The selection of a method for estimating reliability depends on the nature of the test. Each
method not only entails different procedures but is also affected by different sources of
error. For many tests, more than one method should be used.
a. Test – retest Reliability - The test-retest method for estimating reliability involves
administering the same test to the same group of examinees on two different occasions
and then correlating the two sets of scores.
When using this method, the reliability coefficient indicates the degree of stability
(consistency) of examinees' scores over time and is also known as the coefficient of
stability.
The primary sources of measurement error for test-retest reliability are any random factors
related to the time that passes between the two administrations of the test. These time
sampling factors include random fluctuations in examinees over time (e.g., changes in
anxiety or motivation) and random variations in the testing situation.
Memory and practice also contribute to error when they have random carryover effects;
i.e., when they affect many or all examinees but not in the same way.
Test-retest reliability is appropriate for determining the reliability of tests designed to
measure attributes that are relatively stable over time and that are not affected by repeated
measurement. It would be appropriate for a test of aptitude, which is a stable

14 | P a g e
characteristic, but not for a test of mood, since mood fluctuates over time, or a test of
creativity, which might be affected by previous exposure to test items.
b. Alternate (Equivalent, Parallel) Forms Reliability
To assess a test's alternate forms reliability, two equivalent forms of the test are
administered to the same group of examinees and the two sets of scores are correlated.
Alternate forms reliability indicates the consistency of responding to different item samples
(the two test forms) and, when the forms are administered at different times, the
consistency of responding over time.
The alternate forms reliability coefficient is also called the coefficient of equivalence when
the two forms are administered at about the same time; The coefficient of equivalence and
stability when a relatively long period of time separates administration of the two forms.
The primary source of measurement error for alternate forms reliability is content sampling,
or error introduced by an interaction between different examinees' knowledge and the
different content assessed by the items included in the two forms (e.g.: Form A and Form
B). The items in Form A might be a better match of one examinee's knowledge than items
in Form B, while the opposite is true for another examinee.
In this situation, the two scores obtained by each examinee will differ, which will lower the
alternate forms reliability coefficient. When administration of the two forms is separated by
a period of time, time sampling factors also contribute to error.
Like test-retest reliability, alternate forms reliability is not appropriate when the attribute
measured by the test is likely to fluctuate over time (and the forms will be administered at
different times) or when scores are likely to be affected by repeated measurement.
If the same strategies required to solve problems on Form A are used to solve problems
on Form B, even if the problems on the two forms are not identical, there are likely to be
practice effects, when these effects differ for different examinees (i.e., are random),
practice will serve as a source of measurement error.
Although alternate forms reliability is considered by some experts to be the most rigorous
(and best) method for estimating reliability, it is not often assessed due to the difficulty in
developing forms that are truly equivalent.
c. Internal Consistency Reliability
Reliability can also be estimated by measuring the internal consistency of a test.
Split-half reliability and coefficient alpha are two methods for evaluating internal
consistency. Both involve administering the test once to a single group of examinees, and
15 | P a g e
both yield a reliability coefficient that is also known as the coefficient of internal
consistency.
To determine a test's split-half reliability, the test is split into equal halves so that each
examinee has two scores (one for each half of the test). Scores on the two halves are then
correlated. Tests can be split in several ways, but probably the most common way is to
divide the test on the basis of odd- versus even-numbered items.
A problem with the split-half method is that it produces a reliability coefficient that is based
on test scores that were derived from one-half of the entire length of the test. If a test
contains 30 items, each score is based on 15 items. Because reliability tends to decrease
as the length of a test decreases, the split-half reliability coefficient usually underestimates
a test's true reliability.
For this reason, the split-half reliability coefficient is ordinarily corrected using the
Spearman-Brown prophecy formula, which provides an estimate of what the reliability
coefficient would have been had it been based on the full length of the test.
Cronbach's coefficient alpha also involves administering the test once to a single group
of examinees. However, rather than splitting the test in half, a special formula is used to
determine the average degree of inter-item consistency.
One way to interpret coefficient alpha is as the average reliability that would be obtained
from all possible splits of the test. Coefficient alpha tends to be conservative and can be
considered the lower boundary of a test's reliability (Novick and Lewis, 1967).
When test items are scored dichotomously (right or wrong), a variation of coefficient alpha
known as the Kuder-Richardson Formula 20 (KR-20) can be used.
The Kuder-Richarson is the more frequently employed formula for determining internal
consistency, particularly KR20 and KR21. We present the latter formula (KR21) since
KR20 is more difficult to calculate and requires a computer program:

( )

= the number of items on the test


= mean of the test
= variance of the test scores
Example:

16 | P a g e
A 30 item test was administered to a group of 30 students. The mean score was 25 while
the standard deviation was 3. Compute the KR21 index of reliability.

So,

( )

( )

Content sampling is a source of error for both split-half reliability and coefficient alpha.
For split-half reliability, content sampling refers to the error resulting from differences
between the content of the two halves of the test (i.e., the items included in one half may
better fit the knowledge of some examinees than items in the other half);
For coefficient alpha, content (item) sampling refers to differences between individual test
items rather than between test halves. Coefficient alpha also has as a source of error, the
heterogeneity of the content domain. A test is heterogeneous with regard to content
domain when its items measure several different domains of knowledge or behavior.
The greater the heterogeneity of the content domain, the lower the inter-item correlations
and the lower the magnitude of coefficient alpha.
Coefficient alpha could be expected to be smaller for a 200-item test that contains items
assessing knowledge of test construction, statistics, ethics, epidemiology, environmental
health, social and behavioral sciences, rehabilitation counseling, etc. than for a 200-item
test that contains questions on test construction only.
The methods for assessing internal consistency reliability are useful when a test is
designed to measure a single characteristic, when the characteristic measured by the test
fluctuates over time, or when scores are likely to be affected by repeated exposure to the
test. They are not appropriate for assessing the reliability of speed tests because, for these
tests, they tend to produce spuriously high coefficients. (For speed tests, alternate forms
reliability is usually the best choice.)

17 | P a g e
d. Inter-Rater (Inter-scorer, Inter-Observer) Reliability
Inter-rater reliability is of concern whenever test scores depend on a rater's judgment.
A test constructor would want to make sure that an essay test, a behavioral observation
scale, or a projective personality test have adequate inter-rater reliability. This type of
reliability is assessed either by calculating a correlation coefficient (e.g., a kappa
coefficient or coefficient of concordance) or by determining the percent agreement
between two or more raters.
Although the latter technique is frequently used, it can lead to erroneous conclusions since
it does not take into account the level of agreement that would have occurred by chance
alone. This is a particular problem for behavioral observation scales that require raters to
record the frequency of a specific behavior.
In this situation, the degree of chance agreement is high whenever the behavior has a high
rate of occurrence, and percent agreement will provide an inflated estimate of the
measure's reliability.
Sources of error for inter-rater reliability include factors related to the raters such as lack of
motivation and rater biases and characteristics of the measuring device.
An inter-rater reliability coefficient is likely to be low, for instance, when rating categories
are not exhaustive (i.e., don't include all possible responses or behaviors) and/or are not
mutually exclusive.
The inter-rater reliability of a behavioral rating scale can also be affected by consensual
observer drift, which occurs when two (or more) observers working together influence each
other's ratings so that they both assign ratings in a similarly idiosyncratic way. (Observer
drift can also affect a single observer's ratings when he or she assigns ratings in a
consistently deviant way.) Unlike other sources of error, consensual observer drift tends to
artificially inflate inter-rater reliability.
The reliability (and validity) of ratings can be improved in several ways:
 Consensual observer drift can be eliminated by having raters work
independently or by alternating raters.
 Rating accuracy is also improved when raters are told that their ratings will
be checked.
 Overall, the best way to improve both inter- and intra-rater accuracy is to
provide raters with training that emphasizes the distinction between
observation and interpretation.
18 | P a g e
Factors that affect the Reliability Coefficient
The magnitude of the reliability coefficient is affected not only by the sources of error
discussed earlier, but also by the length of the test, the range of the test scores, and the
probability that the correct response to items can be selected by guessing.
a. Test Length - The larger the sample of the attribute being measured by a test, the less
the relative effects of measurement error and the more likely the sample will provide
dependable, consistent information.
Consequently, a general rule is that the longer the test, the larger the test's reliability
coefficient.
The Spearman-Brown prophecy formula is most associated with split-half reliability but can
actually be used whenever a test developer wants to estimate the effects of lengthening or
shortening a test on its reliability coefficient.
For instance, if a 100-item test has a reliability coefficient of .84, the Spearman-Brown
formula could be used to estimate the effects of increasing the number of items to 150 or
reducing the number to 50. A problem with the Spearman-Brown formula is that it does not
always yield an accurate estimate of reliability: In general, it tends to overestimate a test's
true reliability. This is most likely to be the case when the added items do not measure the
same content domain as the original items and/or are more susceptible to the effects of
measurement error.
Note that, when used to correct the split-half reliability coefficient, the situation is more
complex, and this generalization does not always apply: When the two halves are not
equivalent in terms of their means and standard deviations, the Spearman-Brown formula
may either over- or underestimate the test's actual reliability.

Where:
rKK = reliability of a test “k” times as long as the original test
r11 = reliability of the original test
K = factor by which the length of the test is changed. To find k, divide the number
of items on the new test by the number of items on the original. If you had 10 items
on the original and 20 on the new, k would be 20 / 10 = 2

19 | P a g e
Example:
A test made up of 12 items has reliability (r11) of 0.68. If the number of items is doubled to
24, will the reliability of the test improve?
Solution:
r11 = 0.68
k = 24 / 12 = 2
So,

Doubling the test increases the reliability from .68 to .81


Note: for the formula to work properly, the two tests must be equivalent in difficulty. If you
double a test and add only easy questions, the results will be invalid
b. Range of Test Scores
Since the reliability coefficient is a correlation coefficient, it is maximized when the range of
scores is unrestricted.
The range is directly affected by the degree of similarity of examinees with regard to the
attribute measured by the test.
When examinees are heterogeneous, the range of scores is maximized.
The range is also affected by the difficulty level of the test items. When all items are either
very difficult or very easy, all examinees will obtain either low or high scores, resulting in a
restricted range.
Therefore, the best strategy is to choose items so that the average difficulty level is in the
mid-range (r = .50).
c. Guessing
A test's reliability coefficient is also affected by the probability that examinees can guess
the correct answers to test items. As the probability of correctly guessing answers
increases, the reliability coefficient decreases.
All other things being equal, a true/false test will have a lower reliability coefficient than a
four-alternative multiple-choice test which, in turn, will have a lower reliability coefficient
than a free recall test.

20 | P a g e
2.3.3 Fairness
An assessment procedure needs to be fair. This means many things:

First, students need to know exactly what the learning targets are and what method of
assessment will be used. If students do not know what they are supposed to be achieving,
then they could get lost in the maze of concepts being discussed in class. Likewise,
students have to be informed how their progress will be assessed in order to allow them to
strategize and optimize their performance.

Second, assessment has to be viewed as an opportunity to learn rather than an


opportunity to weed out poor and slow learners. The goal should be that of diagnosing the
learning process rather than the learning object.

Third, fairness also implies freedom from teacher-stereotyping. Some examples of


stereotyping include: boys are better than girls in Math or girls are better than boys in
language. Such stereotyped images and thinking could lead to unnecessary and unwanted
biases in the way that teachers assess their students.

2.3.4 Practicality and Efficiency


Another quality of a good assessment procedure is practicality and efficiency. An
assessment procedure should be practical in the sense that the teacher should be familiar
with it, does not require too much time and is in fact, implementable. A complex
assessment procedure tends to be difficult to score and interpret resulting in a lot of
misdiagnosis or too long for a feedback period which may render the test inefficient.

2.3.5 Ethics in Assessment


The term ―ethics‖ refers to questions of right and wrong. When teachers think about ethics,
they need to ask themselves if it is right to assess a specific knowledge or investigate a
certain question. Are there some aspects of the teaching-learning situation that should not
be assessed? Here are some situations in which assessment may not be called for:
 Requiring students to answer checklist of their sexual fantasies;
 Asking elementary pupils to answer sensitive questions without the consent of
their parents;
 Testing the mental abilities of pupils using an instrument whose validity and
reliability are unknown
When a teacher thinks about ethics, the basic question to ask in this regard is ―Will any
physical or psychological harm come to anyone as a result of the assessment or testing?‖
Naturally, no teacher would want this to happen to any of his/her student.
21 | P a g e
Webster defines ethical (behavior) as ‗conforming to the standards of conduct of a given
profession or group.‘ What teachers consider ethical is therefore largely a matter of
agreement among them. Perhaps, the most important ethical consideration of all is the
fundamental responsibility of a teacher to do all in his or her power to ensure that
participants in an assessment program are protected from physical or psychological harm,
discomfort or danger that may arise due to the testing procedure. For instance, a teacher
who wishes to test a student‘s physical endurance may ask students to climb a very steep
mountain thus endangering them physically.

Test results and assessment results are confidential results. Such should be known only
by the student concerned and the teacher. Results should be communicated to the
students in such a way that other students would not be in possession of information
pertaining to any specific number of the class.

The third ethical issue in assessment is deception. Should students be deceived? There
are instances in which it is necessary to conceal the objective of the assessment from the
students in order to ensure fair and impartial results. When this is the case, the teacher
has a special responsibility to determine whether the use of such techniques is justified by
the educational value of the assessment, determine whether alternative procedures are
available that does not make use of concealment and ensure that students are provided
with sufficient explanation as soon as possible.

Finally, the temptation to assist certain individuals in class during assessment or testing is
ever present. In this case, it is best if the teacher does not administer the test himself if he
believes that such a concern way, at a later time, be considered unethical.

Exercises
A. Classify the cognitive objectives below in terms of Bloom’s taxonomy.
1. Identify the parts of a flower.
2. Enumerate the characteristics of a good test.
3. Determine the function of a predicate in a sentence.
4. Summarize the salient features of a good essay.
5. Use the concept of ratio and proportion in finding the height of a building.
6. Name the past presidents of the Philippines.
7. Determine the sufficiency of information given to solve a problem.
8. Identify the resulting product of a chemical reaction.
9. Select a course of action to be taken in the light of possible consequences.
22 | P a g e
10. Enumerate the parts of a cell.
B. A test may be reliable but not necessarily valid. Is it possible for a test to be valid
but not reliable? Discuss.
C. A 50 item test was administered to a group of 20 students. The mean score was 35
while standard deviation was 5.5. Compute the KR21 index of reliability.
D. Answer the following questions:
1. Ms. Plantilla developed an Achievement Test in Math for her grade three pupils. Before
she finalized the test she examined carefully if the test items were constructed based on
the competencies that have to be tested. What test of validity was she trying to establish?

a. Content-validity
b. Concurrent validity
c. Predictive validity
d. Construct validity
2. What type of validity does the Pre-board examination possess if its results can
explain how the students will likely perform in their licensure examination?
a. Concurrent
b. Predictive
c. Construct
d. Content
3. The students of Mrs. Valino are very noisy. To keep them busy, they were
given any test available in the classroom and then the results were graded as a
way to punish them. Which statement best explains if the practice is acceptable or
not?
a. The practice is acceptable because the students behaved well when
they were given a test.
b. The practice is not acceptable because it violates the principle of
reliability.
c. The practice is not acceptable because it violates the principle of
validity.
d. The practice is acceptable since the test results are graded.
4. Mr. Gringo tried to correlate the scores of his pupils in the Social studies test
with their grades in the same subject last 3rd quarter. What test validity is he trying
to establish?
a. Content validity
b. Construct validity
23 | P a g e
c. Concurrent validity
d. Criterion related validity
5. Which of the following situations may lower the validity of test?
a. Mrs. Josea increases the number of items measuring each specific skill
from three to five.
b. Mr. Santosa simplifies the language in the directions for the test.
c. Ms. Lopeza removes the items in the achievement test that everyone
would be able to answer correctly.
d. None of the above.

24 | P a g e
Chapter 3 – Measures of Central Tendency and Variability
At the end of this chapter, the students will be able to:

1. Explain the meaning and function of the measures of central tendency and
measures of dispersion/variability.
2. Distinguish among the measures of central tendency and measures of
variability/dispersion.
3. Explain the meaning of normal and skewed score distribution
4. Compute for the values of the different measures of central tendency and
measures of variability

3.1 Introduction
A measure of central tendency is a single value that attempts to describe a set of data
(like scores) by identifying the central position within that set of data or scores. As such,
measures of central tendency are sometimes called measures of central location.
Central Tendency refers to the center of a distribution of observations. Where do scores
tend to congregate? In a test of 100 items, where are most of the scores? Do they tend to
group around the mean score of 50 or 80?
There are three measures of central tendency – the mean, median and the mode.
Perhaps you are most familiar with the mean (often called the average). But there are two
other measures of central tendency, namely the median and the mode. Is there such a
thing as the best measure of central tendency?
If the measures of central tendency indicate where scores congregate, the measure of
variability indicate how spread out a group of scores is or how varied the scores are or
how far they are from the mean. Common measures of dispersion or variability are range,
variance and standard deviation.

3.2 Measures of Central Tendency


3.2.1 Ungrouped data

The mean, median and mode are valid measures of central tendency but under different
conditions, one measure becomes more appropriate than the others. For example, if the

25 | P a g e
scores are extremely high and extremely low, the median is a better measure of central
tendency since the mean is affected by extremely high and extremely low scores.

Mean

The mean or average or arithmetic mean is the most popular and most well-known
measure of central tendency. The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set. For example, 10 students in a Graduate
School class got the following scores in a 100-item test: 70, 72, 75, 77, 78, 80, 84, 87, 90
and 92.

The mean score of the group of 10 students is the sum of all their scores divided by 10.
The mean, therefore, is 805/10 equals 80.5.

80.5 is the average score of the group. There are 6 scores below the average score
(mean) of the group (70, 72, 75, 77, 78 and 80) and there are 4 scores above the mean of
the group (84, 87, 90 and 92).

The symbol we use for the mean is ̅ (read as bar).

When not to use the mean

The mean has one main disadvantage. It is particularly susceptible to the influence of
outliers. These are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value.

For example, consider the scores of 10 Grade 12 students in a 100-item Statistics test
below:

5 38 56 60 67 70 73 78 79 95

The mean score for these ten Grade 12 students is 62.1. However, inspecting the raw data
suggests that this mean may not be the best way to accurately reflect the score of the
typical Grade 12 student as most students have scores in the 5 to 95 range. The mean is
being skewed by the extremely low and extremely high scores. Therefore, in this situation,
we would like to have a better measure of central tendency. As we will find out later, taking
the median would be a better measure of central tendency in this situation.
26 | P a g e
Median

The median is the middle score for a set of scores arranged from lowest to highest. The
mean is less affected by extremely low and extremely high scores.

The symbol for median is ̃ (read as x-tilde)

How do we find the median? Suppose we have the following data:

65 55 89 56 35 14 56 55 87 45 92

To determine the median, first we have to rearrange the scores into order of magnitude
(from smallest to largest).

14 35 45 55 55 56 56 65 87 89 92

Our median is the score at the middle of the distribution. In this case 56 is the middle
score. There are 5 scores before it and 5 scores after it. This works fine when you have an
odd number of scores, but what happens when you have an even number of scores? What
of you have 10 scores like the scores below?

65 55 89 56 35 14 56 55 87 45

Arrange that data according to order of magnitude (from smallest to largest) then take the
two middle scores (55 and 56) and compute the average of the two scores. The median is
55.5. This gives us a more reliable picture of the tendency of the scores.

Mode

This is the simplest both in concept and in application. By definition, the mode is referred
to as the most frequent value in the distribution. We shall use the symbol ̂ (read as –x
hat) to represent the mode.

Study the score distribution below:

14 35 45 55 55 56 56 56 65 84 89

27 | P a g e
There are two most frequent scores 55 and 56 so we have a score distribution with two
modes, hence a bimodal distribution.

3.2.2 Grouped data

Mean

To compute the value of the mean of a data presented in a frequency distribution,


we will consider the midpoint method.
In using the Midpoint Method, the midpoint of each class interval is taken as the
representative of each class. These midpoints are multiplied by their corresponding
frequencies. The product is added and the sum is divided by the total number of
frequencies. The value obtained is considered the mean of the grouped data. The
formula is:

̅

where

– represents the frequency of each class


– the midpoint of each class
– the total number of frequencies or sample size

To be able to apply the formula for the mean of a grouped data, we shall follow the step
below:

Step 1. Get the midpoint of each class


Step 2. Multiply each midpoint by its corresponding frequency
Step 3. Get the sum of the products in Step 2
Step 4. Divide the sum obtained in Step 3 by the total number of frequencies. The result shall
be rounded off to two decimal places.

28 | P a g e
Median

Just like the mean, the computation of the value of the median is done through
interpolation. The procedure requires the construction of the less than cumulative
frequency column ( ). The first step in finding the value of the median is to divide the
total number of frequencies by 2. This is consistent with the definition of the median. The
value shall be used to determine the cumulative frequency before the median class
denoted by . refers to the highest value under the column that is less than
. The median class refers to the interval that contains the median, that is, where the
value is located. Hence, among the entries under the column which are greater than
, the smallest shall be the frequency of the median class. If a distribution contains an
interval where the cumulative frequency is exactly , the upper boundary of that class will
be the median and no interpolation is needed.
After identifying the median class, we shall approximate the position of the median within
the median class. This approximation shall be done by subtracting the value of from
. Then, the difference is divided by the frequency of the median class times the size of
the class interval. The result is then added to the lower boundary of the median class to
get the median of the distribution.

The computing formula for the median for grouped data is given below.

̃ ( )

where

– refers to the lower boundary of the median class


– the frequency of the median class
– less than cumulative frequency
– the class interval
– the total number of frequencies or sample

To be able to apply the formula for the median for grouped data, we shall follow the steps
below:
29 | P a g e
Step 1. Get .
Step 2. Determine the value of .
Step 3. Determine the median class.
Step 4. Determine the lower boundary and the frequency of the median class and the size
of the class interval.
Step 5. Substitute the values obtained in Steps 1 – 4 to the formula. Round off the final
result to two decimal places

Mode

In the computation of the value of the mode for grouped data, it is necessary to identify the
class interval that contains the mode. This interval, called the modal class, contains the
highest frequency in the distribution. The next step after getting the modal class is to
determine the mode within the class. This value may be approximated by getting the
differences of the frequency of the modal to the frequency before and to the frequency
after the modal class. If we let be the difference of the frequency of the modal class and
the frequency of the interval preceding the modal class and be the difference of the
frequency of the modal class and the frequency of the interval after the modal class, then
the mode within the class shall be approximated using the expression:

( )

If this expression is added to the lower boundary of the modal class, then we can come up
with the computing formula for the value of the mode for grouped data. The formula is:

̂ ( )

To be able to apply the formula for the mode for grouped data, we shall consider the
following steps:

30 | P a g e
Step 1. Determine the modal class
Step 2. Get the value of
Step 3. Get the value of
Step 4. Get the lower boundary of the modal class
Step 5. Apply the formula by substituting the values obtained in the preceding steps

Try this!

Find the mean, median and mode of this frequency table.

Scores

To be able to compute the value of the mean, we shall follow the steps discussed earlier.

Step 1. Get the midpoint of each class. The midpoints are shown in the 3 rd column.

Scores
11–22 3 16.5
23–34 5 28.5
35–46 11 40.5
47–58 19 52.5
59–70 14 64.5
71–82 6 76.5
83–94 2 88.5

31 | P a g e
Step 2. Multiply each midpoint by its corresponding frequency. The products are shown in
the 4th column.

Scores
11–22 3 16.5 49.5
23–34 5 28.5 142.5
35–46 11 40.5 445.5
47–58 19 52.5 997.5
59–70 14 64.5 903
71–82 6 76.5 459
83–94 2 88.5 177

Step 3. Get the sum of the products in Step 2.

Scores
11–22 3 16.5 49.5
23–34 5 28.5 142.5
35–46 11 40.5 445.5
47–58 19 52.5 997.5
59–70 14 64.5 903
71–82 6 76.5 459
83–94 2 88.5 177

Step 4. Divide the result in Step 3 by the sample size. The result is the mean of the
distribution. Hence,

To compute for the median, we shall construct the less than cumulative frequency column.
We can use the existing table when we solved for the mean.

32 | P a g e
Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19
47–58 19 52.5 997.5 38 Median class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60

Step 1.

Step 2.

Step 3. Class interval

Step 4.

Step 5.

̃ ( )

̃ ( )

To compute for the mode, we can still use the existing table.

Scores
11–22 3 16.5 49.5 3
23–34 5 28.5 142.5 8
35–46 11 40.5 445.5 19

33 | P a g e
47–58 19 52.5 997.5 38 Modal class
59–70 14 64.5 903 52
71–82 6 76.5 459 58
83–94 2 88.5 177 60

To get the value of and , we have:

Substituting these values to the formula, we have

̂ ( )

̂ ( )

3.2.3 Comparison

Although there are many types of averages, the three measures that were discussed are
considered the simplest and the most important of all.
In the case of the mean, the following are some of the observations that can be made.
a) The mean always exists in any distribution. This implies that for any set of data,
the mean can always be computed
b) The value of the mean in any distribution is unique. This implies that for any
distribution, there is only one possible value of the mean
c) In the computation for this measure, it takes into consideration all the values in
the distribution
In the case of the median, we have the following observations.
a) Like the mean, the median also exists in any distribution
b) The value of the median is also unique
c) This is a positional measure
34 | P a g e
For the third measure, the mode has the following characteristics.
a) It does not always exist
b) If the mode exists, it is not always unique
c) In determining the value of the mode, it does not take into account all the values
in the distribution
Skewness

Of the three measures of central tendency, the mean is considered the most important.
Since all values are considered in the computation, it can be used in higher statistical
treatment.
There are some instances, however, when the mean is not a good representative of a set
of data. This happens when a set of data contains extreme values either to the left or to
the right of the average. In this situation, the value of the mean is pulled to the direction of
these extreme values. Thus, the median should be used instead.
When a set of data is symmetric or normally distributed, the three measures are
identical or approximately equal. When the distribution is skewed, that is, either negatively
or positively skewed, the three averages diverge. In any case, however, the value of the
median will always be between the mode and the mean.
A set of data is said to be positively skewed when the graph of the distribution has a
longer tail to the right. The data is said to be negatively skewed when the longer tail is
at the left.

35 | P a g e
3.3 Measures of Variability
The measures of central tendency discussed earlier simply approximate the central value
of the distribution but such descriptions are not enough to be able to adequately describe
the characteristics of a set of data. Hence, there is a need to consider how the values are
scattered on either side of the center. Values used to determine the scatter of values in a
distribution are called measures of variation. We will discuss in this part the range, the
variance and the standard deviation.
3.3.1 Range
Among the measure of variation, the range is considered the simplest. Earlier, we defined
the range as the difference between the highest and the lowest value in the distribution.
For example, if the lowest value in the distribution is 12 and the highest value is 125, then
the range is the difference between 125 and 12 which is 113. In symbols, if we let R be the
range, then
R=H–L
Where H – represents the highest value
L – represents the lowest value
In the case of grouped data, the difference between the highest upper class boundary and
the lowest lower class boundary is considered the range. The rationale is that the class
boundaries are considered the true limits.
The range, of course has some disadvantages. First, this value is always affected by
extreme values. Second, in the process of computing the value of the range, not all values
are considered. Thus, the range does not consider the variation of the items relative to the
central value of the distribution.
3.3.2 Variance
Variability can also be defined in terms of how close the scores in the distribution are to the
middle of the distribution. Using the mean as the measure of the middle of the distribution,
the variance is defined as the average squared difference of the scores from the mean.
The formula for variance (s2) is given below
∑ ̅

36 | P a g e
where
– midpoint of each class interval
̅ – mean
– sample size
To be able to apply the formula for the variance, we shall consider the steps below
Step 1. Compute the value of the mean
Step 2. Determine the deviation – by subtracting the mean from the midpoint
of each class interval
Step 3. Square the deviations obtained in Step 2
Step 4. Multiply the frequencies by their corresponding squared deviations
Step 5. Add the results in Step 4
Step 6. Divide the result in Step 5 by the sample size

3.2.3 Standard Deviation


We are now going to consider one of the most important measures of variation – the
standard deviation. Recall that, in the computation of the variance, the deviation x – x was
squared. This implies that the variance is expressed in square units. Extracting the square
root of the value of the variance will give the value of the standard deviation.
If we let be σ (sigma) the standard deviation, then

∑ ̅
√ √

or simply, the standard deviation is just the square root of the variance.

Try this!
Compute the Range, Variance and Standard Deviation of the example given earlier
(Computation of Measures of Central Tendency).

37 | P a g e
Range
R = H – L = 94 – 11 = 83

Variance
First, we will reproduce the frequency distribution. Applying the steps stated before, we
have

Scores – – –
11 – 22 3 16.5 49.5 -36.4 1324.96 3974.88
23 – 34 5 28.5 142.5 -24.4 595.36 2976.80
35 – 46 11 40.5 445.5 -12.4 153.76 1691.36
47 – 58 19 52.5 997.5 -0.4 0.16 3.04
59 – 70 14 64.5 903.0 11.6 134.56 1883.84
71 – 82 6 76.5 459.0 23.6 556.96 3341.76

83 – 94 2 88.5 177.0 35.6 1267.36 2534.72

∑ ( )
= = 273.44
Standard Deviation
It is just the square root of the variance so,
σ=√
σ=√
σ = 16.54

38 | P a g e
3.3.4 Sample Variance and Sample Standard Deviation
Sometimes, our data are only a sample of the whole population.
Example: Sam has 20 rose bushes, but only counted the flowers on 6 of them.
The population is all 20 rose bushes, and the sample is the 6 bushes that Sam counted
among the 20. Let us say that Sam‘s flower counts are 9, 4, 6, 13, 18 and 13, we can still
estimate the Variance and Standard Deviation.
When we use the sample as an estimate of the whole population, The formula for the
variance will change to:
∑ ( )
s2 =
And the Standard deviation formula is
∑ ( )
s =√

Just remember that Standard Deviation will always be the square root of the Variance.
The important change in the formula is ―n-1‖ instead of ―n‖ (which is called Bessel‘s
correction) but it does not affect the calculations. The symbol will also change to reflect
that we are working on a sample instead of the whole population. (σ will be changed to s
when using the sample SD)
Why take a sample?
Mostly because it is easier and cheaper. Imagine you want to know what the whole
university thinks. You cannot ask thousands of people, so instead you may ask maybe
only 300 people. Samuel Johnson once said ―You don‘t have to eat the whole ox to know
that the meat is tough‖.
More notes on Standard Deviation
The Standard Deviation is simply the square root of the variance. It is an especially useful
measure of variability when the distribution is normal or approximately normal because the
proportion of the distribution within a given number of standard deviations from the mean
can be calculated.
For example. 68% of the distribution is within one standard deviation of the mean and
approximately 95% of the distribution is within two standard deviations of the mean.
Therefore, if you have a normal distribution with a mean of 50 and a standard deviation of
10, then 68% of the distribution would be between 50 – 10 = 40 and 50 + 10 = 60.
39 | P a g e
Similarly, about 95% of the distribution would be between 50 – (2 x 10) = 30 and 50 + (2 x
10) = 70. The symbol for the population standard deviation is σ.
Standard deviation is a measure of dispersion, the more dispersed the data, the less
consistent the data are. A lower standard deviation means that the data are more clustered
around the mean and hence the data set is more consistent.
Exercises
Find the mean, median, mode, range and standard deviation of the table below. Determine
also whether is normally distributed, positively skewed or negatively skewed.

Scores

40 | P a g e
Chapter 4 – Performance-based Assessment
At the end of this chapter, the students will be able to:

1. Recall the stages of the psychomotor domain.


2. Describe process-oriented and product-oriented performance based assessment.
3. Write learning competencies based on a given task or topic.
4. Design a task.
5. Develop a scoring rubric for process-oriented and product oriented performance
based assessment.
6. Explain the GRASPS model.
4.1 Stages of the Psychomotor Domain

The psychomotor domain is characterized by progressive levels of behaviors from


observation to mastery of a physical skill. Several different taxonomies exist.

Simpson (1972) built this taxonomy on the work of Bloom and others:

 Perception - Sensory cues guide motor activity.


 Set - Mental, physical, and emotional dispositions that make one respond in a
certain way to a situation.
 Guided Response - First attempts at a physical skill. Trial and error coupled with
practice lead to better performance.
 Mechanism - The intermediate stage in learning a physical skill. Responses are
habitual with a medium level of assurance and proficiency.
 Complex Overt Response - Complex movements are possible with a minimum of
wasted effort and a high level of assurance they will be successful.
 Adaptation - Movements can be modified for special situations.
 Origination - New movements can be created for special situations.

Dave (1970) developed this taxonomy:

 Imitation - Observing and copying someone else.


 Manipulation - Guided via instruction to perform a skill.
 Precision - Accuracy, proportion and exactness exist in the skill performance
without the presence of the original source.

41 | P a g e
 Articulation - Two or more skills combined, sequenced, and performed
consistently.
 Naturalization - Two or more skills combined, sequenced, and performed
consistently and with ease. The performance is automatic with little physical or
mental exertion.

4.2 Process-Oriented Performance based Assessment


4.2.1 Process-Oriented Learning Competencies

Information about outcomes is of high importance; where students ―end up‖ matters
greatly. But to improve outcomes, we need to know about student experience along the
way – about the curricula, teaching, and kind of student effort that lead to particular
outcomes.

Assessment can help us understand which students learn best under what conditions; with
such knowledge comes the capacity to improve the whole of their learning. Process-
oriented performance-based assessment is concerned with the actual task
performance rather than the output or product of the activity.

Learning Competencies

The learning objectives in process-oriented performance based assessment are stated in


directly observable behaviors of the students. Competencies are defined as groups or
clusters of skills and abilities for needed for a particular task. The objectives generally
focus on those behaviors which exemplify a ―best practice‖ for the particular task. Such
behaviors range from a ―beginner‖ or novice level up to the level of an expert. An example
of learning competencies for a process-oriented performance based assessment is given
below:

Task: Recite a Poem by Edgar Allan Poe, ―The Raven‖

Objectives: The activity aims to enable the students to recite a poem entitled ―The Raven‖
by Edgar Allan Poe. Specifically‖

1. Recite the poem from memory without referring to notes;


2. Use appropriate hand and body gestures in delivering the piece;
42 | P a g e
3. Maintain eye contact with the audience while reciting the poem;
4. Create the ambiance of the poem through appropriate rising and falling intonation;
5. Pronounce the words clearly and with proper diction.

Notice that the objective starts with a general statement of what is expected of the student
from the task and then breaks down the general objective into easily observable behaviors
when reciting a poem. The specific objectives identified constituted the learning
competencies for this particular task. As in the statement of objectives using Bloom‘s
taxonomy, the specific objectives also range from simple observable processes to more
complex observable processes e.g. creating an ambiance of the poem through appropriate
rising and falling intonation. A competency is said to be more complex when it consists of
two or more skills.

The following competencies are simple competencies:

 Speak with a well-modulated voice;


 Draw a straight line form one point to another point;
 Color a leaf with a green crayon.

The following competencies are more complex competencies:

 Recite a poem with feeling using appropriate voice quality, facial expressions and
hand gestures;
 Construct an equilateral triangle given three non-collinear points;
 Draw and color a leaf with green crayon.

4.2.2 Task Designing

Learning tasks need to be carefully planned. In particular, the teacher must ensure that the
particular learning process to be observed contributes to the overall understanding of the
subject or course. Some generally accepted standards for designing a task include:

 Identifying an activity that would highlight the competencies to be evaluated e.g.


reciting a poem, writing an essay, manipulating the microscope.

43 | P a g e
 Identifying an activity that would entail more or less the same sets of
competencies. If an activity would result in too many possible competencies then
the teacher would have difficulty assessing the student‘s competency on the task.
 Finding a task that would be interesting and enjoyable for the students. Tasks
such as writing an essay are often boring and cumbersome for the students.

For example:

Topic: Understanding Biological Diversity

Possible Task Design:

Bring the students to a pond or creek. Ask them to find all living organisms as they can find
living near the pond or creek. Also, bring them to the school playground to find as many
living organisms as they can. Observe how the students will develop a system for finding
such organisms, classifying the organisms and concluding the differences in biological
diversity of the two sites.

4.2.3 Scoring Rubrics

Rubric is a scoring scale used to assess student performance along a task-specific set of
criteria. Authentic assessments are typically criterion-referenced measures, that is, a
student‘s aptitude on a task is determined by matching the student‘s performance against
a set of criteria to determine the degree to which the student‘s performance meets the
criteria for the task. To measure student performance against a pre-determined set of
criteria, a rubric, or scoring scale which contains the essential criteria for the task and
appropriate levels of performance for each criterion is typically created. For example, the
following rubric covers the recitation portion of a task in English.

44 | P a g e
Recitation Rubric

Criteria

Number of Appropriate
Hand Gestures

Appropriate Facial Lots of


Few No apparent
Expression inappropriate
inappropriate inappropriate facial
facial
facial expression expression
expression

Voice Inflection Can vary voice


Monotone voice Can easily vary
inflection with
used voice inflection
difficulty

Incorporate proper Recitation fully


Recitation
ambiance through Recitation has captures ambiance
contains very
feelings in the voice some feelings through feelings in
little feelings
the voice

As in the above example, a rubric is comprised of two components: criteria and levels of
performance. Each rubric has at least two criteria and at least two levels of performance.
The criteria, characteristics of good performance on a task, are listed in the left column in
the rubric above. Actually, as is common in rubrics, a shorthand is used for each criterion
to make it fit easily into the table. The full criteria are statements of performance such as
―include a sufficient number of hand gestures‖ and ―recitation captures the ambiance
through appropriate feelings and tone in the voice‖.

For each criterion, the evaluator applying the rubric can determine to what degree the
student has met the criterion, i.e., the level of performance. In the above rubric, there are
three levels of performance for each criterion. For example, the recitation can contain lots
of inappropriate, few inappropriate or no inappropriate hand gestures.

45 | P a g e
Finally, the rubric above contains a mechanism for assigning a score to each project. In
the second-to-left column a weight is assigned in each criterion. Students can receive 1, 2,
or 3 points for ―number of sources.‖ But appropriate ambiance, more important in the
teacher‘s mind, is weighted three times ( as heavily. So, students can receive 3, 6, or
9 points (i.e., 1, 2, or 3 times 3) for the level of appropriateness in this task.

Descriptors

The above rubric includes another common, but not a necessary, component of rubrics —
descriptors. Descriptors spell out what is expected of students at each level of
performance for each criterion. In the above example, ―lots of historical inappropriate hand
gestures,‖ ―monotone voice used‖ are descriptors. A descriptor tells students more
precisely what performance looks like at each level and how their work may be
distinguished from the work of others for each criterion. Similarly, the descriptors help the
teacher more precisely and consistently distinguish between student works.

Why do we need to include levels of performance?

1. Clearer expectations
2. More consistent and objective assessment
3. Better feedback

Analytic vs. Holistic Rubrics

For a particular task you assign students, do you want to be able to assess how well the
students perform on each criterion, or do you want to get a more global picture of the
students‘ performance on the entire task? The answer to that question is likely to
determine the type of rubric you choose to create or use — analytic or holistic.

46 | P a g e
Analytic rubric Holistic rubric

 Articulates levels of  Does not list separate


performance for each levels of performance
criterion so the teacher for each criterion.
can assess student  Assigns a level of
performance on each performance by
criterion. assessing performance
 For performances that across multiple criteria
involve a larger as a whole.
number of criteria.  For gross or quick
judgment

Below is an example of a holistic rubric:

Recitation Rubric

3 – Excellent Speaker

 Included 10-12 changes in hand gestures


 No apparent inappropriate facial expressions
 Utilizes proper voice inflection
 Can create proper ambiance for the poem
2 – Good Speaker

 Included 5-9 changes in hand gestures


 Few inappropriate facial expressions
 Have some inappropriate voice inflection
changes
 Almost creating proper ambiance
1 – Poor Speaker

 Included 1-4 changes in hand gestures


 Lots of inappropriate facial expressions
 Uses monotone voice
 Cannot create proper ambiance

47 | P a g e
How many levels of performance should a teacher include in his/her rubric?

There is no specific number of levels a rubric should or should not possess. It will vary
depending on the task and your needs. A rubric can have as few as two levels of
performance as long that it is appropriate. Also, it is not true that there must be an even
number or an odd number of levels. Again, that will depend on the situation.

Generally, it is better to start with a smaller number of levels of performance for a criterion
and then expand, if necessary. Making distinctions in student performance across two or
three broad categories is difficult enough. As the number of levels increases and those
judgments become finer and finer, the likelihood of error increases. Thus, start small. For
example, in an oral presentation rubric, amount of eye contact might be an important
criterion. Performance on that criterion could be judged along three levels of performance:
never, sometimes, always.

Makes eye contact


never sometimes always
with audience

Although these three levels may not capture all the variation in student performance on the
criterion, it may be sufficient discrimination for your purposes. Or, at the least, it is a place
to start. Upon applying the three levels of performance, you might discover that you can
effectively group your students‘ performance in these three categories. Furthermore, you
might discover that the labels of never, sometimes and always sufficiently communicate to
your students the degree to which they can improve on making eye contact.

On the other hand, after applying the rubric you might discover that you cannot effectively
discriminate among student performance with just three levels of performance. Perhaps, in
your view, many students fall in between never and sometimes, or between sometimes
and always, or neither label accurately captures their performance. So, at this point, you
may decide to expand the number of levels of performance to include never, rarely,
sometimes, usually and always.

48 | P a g e
Makes eye
never rarely sometimes usually always
contact

There is no ―right‖ answer as to how many levels of performance there should be for a
criterion in an analytic rubric; that will depend on the nature of the task assigned, the
criteria being evaluated, the students involved and your purposes and preferences. For
example, another teacher might decide to leave off the ―always‖ level in the above rubric
because ―usually‖ is as much as normally can be expected or even wanted in some
instances. Thus, the ―makes eye contact‖ portion of the rubric for that teacher might be:

Makes eye
never rarely sometimes usually
contact

It is recommended that fewer levels of performance must be included initially because


such is:

 Easier and quicker to administer


 Easier to explain to students (and others)
 Easier to expand than larger rubrics are to shrink

Exercises 4.2

A. For each of the following tasks, identify at least three (3) process-oriented learning
competencies:
1. Constructing an angle bisector using a straight edge and a compass
2. Constructing three-dimensional models of solids from cardboards
3. Role playing to illustrate the concept of Filipino family values
B. Choose any five activities below and then construct your own scoring rubrics.
1. Use evidence to solve a mystery.
2. Devise a game.
3. Participate in a debate.
4. Infer the main idea of a written piece.
49 | P a g e
5. Draw a picture that illustrates what‘s described in a story or article. Explain
what you have drawn, using details from the story or article.
6. Write a research paper.
7. Apply a scoring rubric to a real or simulated piece of student work.
8. Write an outline of a text or oral report.
9. Propose and justify a way to resolve a problem.
10. Design a museum exhibit.
11. Develop a classification scheme for something and explain and justify the
categories.
12. Justify one point of view on an issue and then justify the opposing view.
13. Given background information, predict what will happen if ____________.
14. Evaluate the quality of a writer‘s arguments.
15. Draw conclusions from a text.

4.3 Product-Oriented Performance based Assessment


The role of assessment in teaching happens to be a hot issue in education today.
This has led to an increasing interest in ―performance-based education.‖ Performance-
based education poses a challenge for teachers to design instruction that is task-oriented.
The trend is based on the premise that learning needs to be connected to the lives of the
students through relevant tasks that focus on student‘s ability to use their knowledge and
skills in meaningful ways. In this case, performance-based tasks require performance-
based assessments product, such as completed project or work that demonstrates levels
of task achievement. At times, performance-based assessment has been used
interchangeably with ―authentic assessment‖ and ―alternative assessment.‖ In all cases,
performance-based assessment has led to the use of a variety of alternative ways of
evaluating student progress (journals, checklists, portfolios, projects, rubrics, etc.) as
compared to more traditional methods of measurement (paper-and-pencil testing).

4.3.1 Product-Oriented Learning Competencies

Product-oriented performance based assessment is a kind of assessment


wherein the assessor views and scores the final product made and not on the actual

50 | P a g e
performance of making that product. It is concerned on the product and not on the
process. It also focuses on achievement of the learner.

Student performances can be defined as targeted tasks that lead to a product or


overall learning outcome. Products can include a wide range of student works that target
specific skills. Some examples include communication skills such as those demonstrated
in reading, writing, speaking, and listening, or psychomotor skills requiring physical abilities
to perform a given task. Target tasks can also include behavior expectations targeting
complex tasks that students are expected to achieve. Using rubrics is one way that
teachers can evaluate or assess student performance or proficiency in any given task as it
relates to a final product or learning outcome. Thus, rubrics can provide valuable
information about the degree to which a student has achieved a defined learning outcome
based on specific criteria that defined the framework for evaluation.

The learning competencies associated with products or outputs are linked with an
assessment of the level of ―expertise‖ manifested by the product. Thus, product-oriented
learning competencies target at least three (3) levels: novice or beginner‘s level, skilled
level, and expert level. Such levels correspond to Bloom‘s taxonomy in the cognitive
domain in that they represent progressively higher levels of complexity in the thinking
processes.

There are other ways to state product-oriented learning competencies. For


instance, we can define learning competencies for products or outputs in the following
way:

 Level 1: Does the finished product or project illustrate the minimum expected parts
or functions? (Beginner)
 Level 2: Does the finished product or project contain additional parts and functions
on top of the minimum requirements which tend to enhance the final output?
(Skilled level)
 Level 3: Does the finished product contain the basic minimum parts and functions,
have additional features on top of the minimum, and is aesthetically pleasing?
(Expert level)

51 | P a g e
Examples:

The desired product is a representation of a cubic prism made out of cardboard in an


elementary geometry class.

Learning Competencies: The final product submitted by the students must:

1. Possess the correct dimensions (5‖ x 5‖ x 5‖) – (minimum specifications)


2. Be sturdy, made of durable cardboard and properly fastened together – (skilled
specifications)
3. Be pleasing to the observer, preferably properly colored for aesthetic purposes –
(expert level)

The desired product is a scrapbook illustrating the historical event called EDSA I People
Power

Learning Competencies: The scrapbook presented by the students must:

1. Contain pictures, newspaper clippings and other illustrations for the main
characters of EDSA I People Power namely: Corazon C. Aquino, Fidel V. Ramos,
Juan Ponce Enrile, Ferdinand E. Marcos, Cardinal Sin. – (minimum specifications)
2. Contain remarks and captions for the illustrations made by the student himself for
the roles played by the characters of EDSA 1 People Power – (skilled level)
3. Be presentable, complete, informative and pleasing to the reader of the
scrapbook. – (expert level)

52 | P a g e
Performance-based assessment for products and projects can also be used for
assessing outputs of short-term tasks such as the one illustrated below for outputs in a
typing class:

The desired output consists of the output in a typing class

Learning Competencies: The final typing outputs of the students must:

1. Possess no more than five (5) errors in spelling – (minimum specifications)


2. Possess no more than 5 errors in spelling while observing proper format based on
the document to be typewritten – (skilled level)
3. Possess no more than 5 errors in spelling, has the proper format, and is readable
and presentable – (expert level).

Notices that in all of the above examples, product-oriented performance based


learning competencies are evidence-based. The teacher needs concrete evidence that
the student has achieved a certain level of competence based on submitted products and
projects.

4.3.2 Task Designing

How should a teacher design a task for product-oriented performance based


assessment? The design of the task in this context depends on what the teacher desires to
observe as outputs of the students. The concepts that may be associated with task
designing include:

a. Complexity. The level of complexity of the project needs to be within the range of
ability of the students. Projects that are too simple tend to be uninteresting for the
students while projects that are too complicated will most likely frustrate them.
b. Appeal. The project or activity must be appealing to the students. It should be
interesting enough so that students are encouraged to pursue the task to
completion. It should lead to self-discovery of information by the students.
53 | P a g e
c. Creativity. The project needs to encourage students to exercise creativity and
divergent thinking. Given the same set of materials and project inputs, how does
one best present the project? It should lead the students into exploring the various
possible ways of presenting the final output.
d. Goal-Based. Finally, the teacher must bear in mind that the project is produced in
order to attain a learning objective. Thus, projects are assigned to students not just
for the sake of producing something but for the purpose of reinforcing learning.

Example: Paper folding is a traditional Japanese art. However, it can be used as an


activity to teach the concept of plane and solid figures in geometry. Provide the
students with a given number of colored papers and ask them to construct as many
plane and solid figures from these papers without cutting them (by paper folding only).

4.3.3 Scoring Rubrics

Scoring rubrics are descriptive scoring schemes that are developed by teachers
or other evaluators to guide the analysis of the products or processes of students‘ efforts
(Brookhart, 1999). Scoring rubrics are typically employed when a judgment of quality is
required and may be used to evaluate a broad range of subjects and activities. For
instance, scoring rubrics can be most useful in grading essays or in evaluating projects
such as scrapbooks. Judgments concerning the quality of a given writing sample may vary
depending upon the criteria established by the individual evaluator. One evaluator may
heavily weigh the evaluation process upon the linguistic structure, while another evaluator
may be more interested in the persuasiveness of the argument. A high quality essay is
likely to have a combination of these and other factors. By developing a pre-defined
scheme for the evaluation process, the subjectivity involved in evaluating an essay
becomes more objective.

Criteria Setting

The criteria for a scoring rubrics are statements which identify ―what really counts‖ in
the final output. The following are the most often used major criteria for product
assessment:

 Quality
54 | P a g e
 Creativity
 Comprehensiveness
 Accuracy
 Aesthetics

From the major criteria, the next task is to identify substatements that would make the
major criteria more focused and objective. For instance, if we were scoring an essay on:
―Three Hundred Years of Spanish Rule in the Philippines‖, the major criterion ―Quality‖ may
possess the following substatements:

 Interrelates the chronological events in an interesting manner


 Identifies the key players in each period of the Spanish rule and the roles that
they played
 Succeeds in relating the history of Philippine Spanish rule (rated as
Professional, Not quite professional, and Novice)

The example below displays a scoring rubric that was developed to aid in the evaluation of
essays written by college students in the classroom (based loosely on Leydens &
Thompson, 1997). The scoring rubrics in this particular example exemplify what is called a
―holistic scoring rubric‖. It will be noted that each score category describes the
characteristics of a response that would receive the respective score. Describing the
characteristics of responses within evaluators would assign the same score to a given
response. In effect, this increases the objectivity of the assessment procedure using
rubrics. In the language of test and measurement, we are actually increasing the ―inter-
rater reliability‖.

Example of a scoring rubric designed to evaluate college writing samples.

 Major Criterion: Meets Expectations for a first Draft of a Professional report


Substatements:

 The document can be easily followed. A combination of the following are apparent
in the document:
1. Effective transitions are used throughout.
2. A professional format is used.
55 | P a g e
3. The graphics are descriptive and clearly support the document‘s purpose.
 The document is clear and concise and appropriate grammar is used throughout.
*Adequate
 The document can be easily followed. A combination of the following are apparent
in the document:
1. Basic transitions are used.
2. A structured format is used.
3. Some supporting graphics are provided, but are not clearly explained
 The document contains minimal distractions that appear in a combination of the
following forms:
1. Flow in thought
2. Graphical presentations
3. Grammar/mechanics
*Needs Improvement

 Organization of document is difficult to follow due to a combination of following:


1. Inadequate transitions
2. Rambling format
3. Insufficient or irrelevant information
4. Ambiguous graphics
 The document contains numerous distractions that appear in the combination of
the following forms:
1. Flow in thought
2. Graphical presentations
3. Grammar/mechanics
*Inadequate

 There appears to be no organization of the document‘s contents.


 Sentences are difficult to read and understand.

When are scoring rubrics an appropriate evaluation technique?

Grading essays is just one example of performances that may be evaluated using
scoring rubrics. There are many other instances in which scoring rubrics may be used
successfully: evaluate group activities, extended projects and oral presentations. Also,
56 | P a g e
rubrics scoring cuts across disciplines and subject matter for they are equally appropriate
to the English, Mathematics and Science classrooms. Where and when a scoring rubric is
used does not depend on the grade level or subject, but rather on the purpose of the
assessment.

Other Methods

Authentic assessment schemes apart from scoring rubrics exist in the arsenal of a
teacher. For example, checklists may be used rather than scoring rubrics in the evaluation
of essays. Checklists enumerate a set of desirable characteristics which are actually
observed. As such, checklists are an appropriate choice for evaluation when the
information that is sought is limited to the determination of whether specific criteria have
been met. On the other hand, scoring rubrics are based on descriptive scales and support
the evaluation of the extent to which criteria have been met.

The ultimate consideration in using a scoring rubric for assessment is really the
―purpose of the assessment.‖ Scoring rubrics provide at least two benefits in the evaluation
process. First, they support the examination of the extent to which the specified criteria
have been reached. Second, they provide feedback to students concerning how to
improve their performances. If these benefits are consistent with the purpose of the
assessment, then a scoring rubric is likely to be an appropriate evaluation technique.

General versus Task-Specific

In the development of scoring rubrics, it is well to bear in mind that it can be used
to assess or evaluate specific tasks or general or broad category of tasks. For instance,
suppose that we are interested in assessing the student‘s oral communication skills. Then,
a general scoring rubric may be developed and used to evaluate each of the oral
presentations given by that student. After each such oral presentation of the students, the
general scoring rubrics is shown to the students which then allows them to improve on
their previous performances. Scoring rubrics have this advantage of instantaneously
providing a mechanism for immediate feedback.

In contrast, suppose now that the main purpose of the oral presentation is to determine the
students‘ knowledge of the facts surrounding the EDSA I revolution, then perhaps a
57 | P a g e
specific scoring rubrics would be necessary. A general scoring rubric for evaluating a
sequence of presentations may not be adequate since, in general, events such as EDSA I
(and EDSA II) differ on the surrounding factors (what caused the revolutions) and the
ultimate outcomes of these events. Thus, to evaluate the students‘ knowledge of these
events, it will be necessary to develop specific rubrics scoring guide for each presentation.

Process of Developing Scoring Rubrics

The development of scoring rubrics goes through a process. The first step in the
process entails the identification of the qualities and attributes that the teacher wishes
to observe in the students‘ outputs that would demonstrate their level of proficiency
(Brookhart, 1999). These qualities and attributes from the top level of the scoring criteria
for the rubrics. Once done, a decision has to be made whether a holistic or an analytic
rubric would be more appropriate. In an analytic scoring rubric, each criterion is
considered one by one and the descriptions of the scoring levels are made separately.
This will then result in separate descriptive scoring schemes for each of the criterion or
scoring factor. On the other hand, for holistic scoring rubrics, the collection of criteria is
considered throughout the construction of each level of the scoring rubric and the result is
a single descriptive scoring scheme.

The next step after defining the criteria for the top level of performance is the
identification and definition of the criteria for lowest level of performance. In other
words, the teacher is asked to determine the type of performance that would constitute the
worst performance or a performance which would indicate lack of understanding of the
concepts being measured. The underlying reason for this step is for the teacher to capture
the criteria that would suit a middle level performance for the concept being measured. In
particular, therefore, the approach suggested would result in at least three levels of
performance.

It is of course possible to make greater and greater distinctions between


performances. For instance, we can compare the middle level performance expectations
with the best performance criteria and come up with an above average performance
criteria; between the middle level performance expectations and the worst level of
58 | P a g e
performance to come up with a slightly below average performance criteria and so on. This
comparison process can be used until the desired number of score levels is reached or
until no further distinctions can be made. If meaningful distinctions between the score
categories cannot be made, then additional score categories should not be created
(Brookhart, 1999). It is better to have a few meaningful score categories than to have
many score categories that are difficult or impossible to distinguish.

A note of caution, it is suggested that each score category should be defined using
descriptors of the work rather than value-judgment about the work (Brookhart, 1999). For
example, ―Student‘s sentences contain no errors in subject-verb agreements,‖ is preferable
over, ―Student‘s sentences are good.‖ The phrase ―are good‖ requires the evaluator to
make a judgment whereas the phrase ―no errors‖ is quantifiable. Finally, we can test
whether our scoring rubric is ―reliable‖ by asking two or more teachers to score the
same set of projects or outputs and correlate their individual assessments. High
correlations between the raters imply high interrater reliability. If scores assigned by
teachers differ greatly, then such would suggest a way to refine the scoring rubrics so that
they would mean the same thing to different scorers.

Exercises 4.3

A. Design a project or task for each of the following learning objectives:


1. Analyze the events leading to Rizal‘s martyrdom.
2. Differentiate between monocotyledon and dicotyledon.
3. Find an approximate value of the gravitational constant .
4. Illustrate the concept of ―diffusion‖.
5. Illustrate the concept of ―osmosis‖.
6. Illustrate the cultural diversity in the Philippines.
7. Identify similarities and differences of at least two major dialects in the
Philippines.
B. Differentiate process-oriented and product-oriented performance based
assessment.
C. Differentiate general and specific task oriented scoring rubrics.
D. What factors determine the use of a scoring rubric over other authentic
assessment procedures? Discuss.
59 | P a g e
E. Identify and describe the process of developing scoring rubrics for product-
oriented performance-based assessment.
F. For each of the following, develop a scoring rubric:
1. Essay on ―Why Jose Rizal should be the national hero‖
2. Essay on ―Should the power industry be deregulated?‖
3. Oral presentation of the piece ―Land of Bondage, Land of the Free‖
4. Oral presentation of the piece ―Rhyme of the Ancient Mariner‖
5. Scrapbook on ―EDSA I revolution‖
6. Group activity on ―geometric Shapes through Paper Folding‖
7. Specimen preservation on a biological diversity class
8. Evaluating an output of a typing class
9. Writing a short computer program on ―Roots of a quadratic equation‖
10. Evaluating kinder piano performance

4.4 GRASPS Model


Why do we give a performance task to our students?

Performance task provide students‘ need to work more independently and to encourage
them to pay attention to the quality of their work. This also enables the teacher to efficiently
provide students with information on the strengths and weaknesses of students' works.

What makes a performance task to be AUTHENTIC?

According to McTighe and Wiggins (2004) a performance task is authentic if its ‗‘reflect the
way in which people in the world outside of school must use knowledge and skills that
address various situation were expertise is challenged‖

Descriptors of Authentic performance task

 What is done in the world


 Address realistic problems
 Have realistic options
 A genuine purpose

60 | P a g e
Designing and constructing authentic performance task can be tricky, but Wiggins and
McTighe‘s GRASPS model is an excellent starting point.

GRASPS Model

The GRASPS Model is an authentic assessment design model to help you develop
authentic performance task, project units and/or inquiry lessons.

There are six parts to the G.R.A.S.P.S model:

a. Goal – the goal provides the student with the outcome of the learning experience
and the contextual purpose of the experience and product creation.
b. Role – the role is meant to provide the student with the position or individual
persona that they will become to accomplish the goal of the performance task. The
majority of roles found within the tasks provide opportunities for students to
complete real-world applications of standards-based content.
c. Audience – the audience is the individual(s) who are interested in the findings and
products that have been created. These people will make a decision based upon
the products and presentations created by the individual(s) assuming the role
within the performance task.
d. Situation – the situation provides the participants with a contextual background for
the task. Students will learn about the real-world application for the performance
task
e. Performance or Product – the products within each task are designed using the
multiple intelligences. The products provide various opportunities for students to
demonstrate understanding. Based upon each individual learner and/or individual
class, the educator can make appropriate instructional decisions for product
development.
f. Standard or Expectation – provide student with a clear picture of success.
Identifies specific standards of success. Issues rubrics to the student or develop
them with the students.
These six parts come together to form an authentic assessment that include an essential
question to share with the student.

61 | P a g e
Example:

You are a member of a team of scientists investigating deforestation of the Papua New
Guinean rainforests. You are responsible for gathering scientific data (including visual
evidence such as photos) and producing a scientific report in which you summarize current
conditions, possible future trends and the implications for both Papua New Guinea and its
broader influence on our planet. Your report, which you will present to a United Nations
subcommittee, should include detailed and fully supported recommendations for an action
plan that are clear and complete.

G – The goal (within the scenario) is to determine current deforestation conditions and
possible future trends
R – Student is a member of a team of investigative scientists

A – The target audience is the United Nations subcommittee

S – The scenario: inform the United Nations subcommittee of the effects of deforestation
on the Papua New Guinean rain forest and convince them to follow the recommended
action plan.

P – The product is a clear and complete action plan

S – The standards by which the product will be judged are detailed and fully supported
recommendations in an action plan that is both clear and complete.

Exercises 4.4
A. Explain the GRASPS model.
B. Use one of the sentence starters from each letter to help you write your task. Once
you have your sentences, then write it up as a task.

62 | P a g e
63 | P a g e
References:
Bolaños, A. B. (1997). Probability and Statistical Concepts : An Introduction. Manila: REX
Book Store

Brookhart, S. M. (1999). The Art and Science of Classroom Assessment: The Missing Part
of Pedagogy. ASHE-ERIC Higher Education Report (Vol. 27, No. 1). Washington, DC: The
George Washington University, Graduate School of Education and Human Development.

De Guzman-Santos, R. (2007). Advanced Methods in Educational Assessment and


Evaluation: Assessment of Learning 2. Quezon City, Metro Manila: Lorimar Publishing, Inc.

Navarro, R. L. et. al. (2017). Assessment of Learning 1. Quezon City, Metro Manila:
Lorimar Publishing, Inc.

64 | P a g e

You might also like