Assesmaent in Maths
Assesmaent in Maths
Assesmaent in Maths
(COURSE OUTLINE)
COURSE CODE:
MATE 464
COUSE TITLE:
ASSESSMENT IN MATHEMATICS
COURSE OUTLINE
COURSE OUTLINE
LECTURER:
Isaac Owusu-Darko
[IOD]
2021
EDST 464: ASSESSMENT IN MATHEMATICS
2
COURSE TITLE: ASSESSMENT IN MATHEMATICS
COURSE CODE: MATE 464
LECTURER: ISAAC OWUSU-DARKO,
MPHIL MATHEMATICS (APPLIED MATHEMATICS- STATISTICS); M.ED (MATHEMATICS
EDUCATION); BED(MATHEMATICS EDUCATION), DIP. (BASIC EDUCATION) ‘A’-3YR POSTSEC
Course Description
The course is designed to assess the behaviours of students in terms of performance in order to
identify the strengths and weaknesses that may help in decision making process. The assessment
is based on the profile dimensions – knowledge and understanding that are acquired through the
receptive skills of listening and reading, and the use of knowledge that are also acquired through
productive skills of writing and speaking. Both the formative and summative types of assessment
will be covered. The criterion-referenced testing procedure will be used in the area of class tests,
class assignment, and homework, projects (practical and investigative study) more frequently than
the norm-referenced testing procedures. In the construction of the test, the test purpose, content
specification, test development, etc. may be covered
Course Objective:
By the end of the course, students will be able to apply the basic concepts of assessment
techniques which is essential practical classroom assessment procedure and for further studies in
Mathematics and its applications in classroom test planning, formal and informal assessment,
continuous assessment, formative and summative assessment procedures, analyses of test results
using educational statistics and Information Technology.
Course Requirements
• You are to revise your course content on measurement and evaluations as well as
general educational assessment course already introduced.
• Any assignment not submitted on the date specified will not be accepted
• Students should switch of their mobile phones or put into silence during lectures
• Every student should be present for any class lecture, test etc.
Evaluation
• Assignments ……………………………..10%
• Quizzes………………………………………10%
• Mid- Semester Examination ……. 20%
• End of Semester Examination …… 60%
Total 100%
Grading System
Grades will be assigned as follows
𝐴 = 80 − 100 𝐶+= 56 − 60
𝐴−= 75 − 79 𝐶 = 50 − 55
𝐵+= 70 − 74 𝐶−= 45 − 49
3
𝐵 = 65 − 69 𝐷 = 40 − 44
𝐵−= 61 − 64 𝐹 = 0 − 39
Calendar of Event
This will later be communicated to students
Class Contribution
Your contribution is essential component in the overall educational lecture and learning process.
Contribution takes place in many forms: asking informed questions in class, making intelligent
comments, reading the case and being prepared to discuss the issues, actively listening to your
peers and working with others. Please remember that quantity is no substitute for quality.
There will be ample opportunity to contribute to the class. The format of the in-class discussions
of cases may take a variety of forms including: group analysis of single case issues during class,
presentation of issues and leading discussions of the case issues and participating in group
discussions.
It is your responsibility to ensure that you take an active role in class. If this is a problem for
you, I urge you to talk to me to discuss ways that you can make a contribution. The grading
for the class contribution in each class is as follows:
Grading Scale:
A 80 – 100 C+ 60 – 56
A- 79 – 75 C 55 – 50
B+ 74 – 70 C- 49 – 45
B 69 – 65 D 44 – 40
B- 64 – 61 F 39 – 00
Dress Code
All students are expected to dress formally for classes. For gentlemen, a shirt, trousers (if
possible with a tie) and a shoe is acceptable. For ladies a top and skirt and a shoe is required.
Jeans and any form of “T” shirt as well as slippers and sandals of any kind is not an
acceptable dress for students undertaking this course. The dress code is intended to
inculcate into students the need to dress appropriately as pertaining in the business
environment.
4
Ø Study (not Read) the Textbook
Ø Ask questions in class
Ø Chat with me after class or appointed times
Ø Form study groups
TERM PAPER/PROJECT
Students in their groups should present a Solution to one of the following questions in the
form of project
• Our lecture encounter would concentrate on the following course outline defined for your
course in the University bulletin:
WEEK 1
Concept definitions in assessment
WEEK 2
Types of assessment
WEEK 3
Profile dimension (lesson objectives as a form of assessment)
WEEK 4
Planning classroom assessment.
WEEK 5
Validity of assessment
WEEK 6
Reliability
WEEK 7
Planning classroom test
WEEK 8
Types of Test [multiple, true/false, matching, fill-in, essay typed-tests]
WEEK 9
Interpretations of test scores [measurement of central tendencies]
5
WEEK 10
Variability in test scores
WEEK 11
Relative position of students in test evaluation
WEEK 12
The standard normal distribution curve and performance
interpretation)-skewness, kurtosis
WEEK 13
Performance interpretation)-skewness, kurtosis
WEEK 14
Marking scheme interpretation in mathematics assessment
WEEK 15
Revision and examinations
WEEK 16
Examinations
REFRENCES
2. Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA:
Brooks/Cole.
5. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test
reliability.
Psychometrika, 2, 151-160.
6. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York: McGraw-Hill
6
CHAPTER ONE
ASSESSMENTS
Assessment of student learning requires the use of a number of techniques for measuring
achievement. This is done through a systematic process that plays a significant role in effective
teaching. It begins with the identification of learning goals and ends with a judgment concerning
how well those goals have been attained.
For Savage & Armstrong (1987): “Assessment includes objective data from measurement …
(and) from other types of information, some of which are subjective (anecdotal records and
teacher observations and ratings of student performance). In addition, assessment also includes
arriving at value judgments made on the basis of subjective information.”
In each of the definitions above, a process is outlined. It is clear that some sort of instrument /
technique must be administered / used in order to obtain data /information. This data /information
can then be used to judge the level of understanding or standard of student performance in relation
to knowledge, skills, attitude and pattern of behaviour.
Assessment in Mathematics
Assessment is the systematic collection, review, and use of information about educational
programs undertaken for the purposes of improving learning and development. It is the process
of gathering and discussing information from multiple and diverse sources in order to develop a
deep understanding of what students know, understand, and can do with their knowledge as a
result of their educational experiences.
Assessment in education generally refers to the process for obtaining information that is used for
making decisions about students’ curricula programmes and educational policy.
Assessments are designed to help schools, parents, districts and students to determine the level of
proficiency of student understanding of mathematics learning standards.
Continuous Assessment
Continuous assessment can be defined as the daily process by which one gathers information
about students’ progress in achieving instructional objectives.
From the definition, continuous Assessment implies that a student’s final grade in instructional
programmes is a cumulative total of his performance on planned learning activities given during
the course. The main purpose of continuous assessment is to help every student become a
successful learner as well as remedy the shortcomings of the traditional “one-short” examination.
7
Characteristics of Continuous Assessment
(a) Systematic
It must have an operational plan; this must be done before the programmes starts: these include;
Measurement to be made or what is to be measured.
Determine tools or instrument to be used.
When and how assessment or measurement will be done (periods)
Taking and filling of records of organized information.
(c) Cumulative
Take accounts of learners’ achievement or performance over a period of time.
Decisions made at any point in time consider any other previous decisions made on the learners
or pupils and represent many items put together.
(d) Guidance-Oriented
It points out or reveals areas of weakness and strength from time to time to allow redirection and
motivation of pupils and information obtained about pupils is used to guide pupils for further
growth and development.
(e) Formative
It uses measurement to diagnose pupils’ problems and help him/her overcome the problem or
master task at hand. This results in pupils becoming adjusted to new forms and shapes.
8
8. It makes possible the measurement of all educational outcomes especially those
cognitive, psychomotor, affective abilities that can only be measured over a reasonably
long period of time or are not measured at all under examination conditions.
3. The high population of pupils in a class and the high number of teaching period per teacher
are likely to have adverse influence on the teacher’s attitude work.
4. The candidate is anonymous under the system of external examination and external examiners,
so that in theory the examiner has no way of favoring or victimizing enemies. In continuous
assessment, the teacher knows the student well and there is the possibility of a student-tutor
relationship influencing tutors’ assessment, the teacher knows the student well and there is
possibility of a student-tutor relationship influencing tutors’ assessment. This possibility can put
the reliability of continuous assessment works in doubt.
5. The same score awarded by a teacher or teachers from different schools may not mean the
same level of performance. There is the possibility of schools and teachers trying to impress the
public by giving easy test or inflating scores. This leads to lowering of academic standards.
6. In continuous assessment the face of the student is determined to some extent by individual
classroom teachers. Each teacher designs his own assessment. Standard will vary since each
teacher’s assessment is therefore bound to generate fears about lack of uniformity and fairness in
assessing students.
7. Many tutors in Ghana lack the skill in constructing classroom test thus a poorly constructed
classroom test will yield bias information.
Nature of Assessment
Assessment of student learning requires that the classroom teacher review the nature of
assessment in order to effectively link teaching, learning and assessment.
Here are seven principles which emphasize the importance of assessment:– The Nature of
Assessment.
1. How to assess:
Teachers must select from among all the techniques (methods) at their disposal.
Thus they must decide whether to use oral method or written techniques in assessing students.
2. What to assess:
Teachers must be aware and decide what they are looking for in the individuals involved in the
learning process. Thus teachers must identify what exactly they want to assess in their students.
• Achievement (the extent to which students grasp content taught)
9
• Performance (how fast students can work out a given task)
3. When to assess:
Teachers must establish the purpose for assessment to be administered.
• Before instruction • During instruction • After Instruction
5. The developmental level of the students: Teachers must use their knowledge of learning
theories to plan appropriate assessment corresponding to students’ level of development, as well
as individual differences. Thus they must consider • Chronological, • Mental, • Physical, and
• Emotional, state of students before coming out with the assessment tasks.
6. How to interpret results: Teachers must consider the purpose and consequence of
assessment to facilitate the method of interpreting scores.
7. Provide feedback: Teachers must share strengths and weaknesses with the stakeholders of
education. Thus • Students, • Parents, • Administrators and • Policy makers, must be abreast
with the overall outcome of educational assessment in order to make an informed decision
affecting various stakeholders.
The myriad of educational outcomes has been classified to make mathematical assessment easy
in identifying the most important goals and objectives to consider when teaching specific subject
matter.
10
The Cognitive Domain:
Generally the cognitive domain refers to educational outcomes that focus on knowledge and
abilities requiring memory, thinking and reasoning processes (Nitko 2001). In other words, the
cognitive domain deals with all mental processes including perception, memory and information
processing by which individual acquire knowledge, solves problems and plan for the future.
Blooms taxonomy
Bloom, Engelhart, Furst, Hill and Krathwohl developed this taxonomy in 1956. It is generally
known as Blooom’s taxonomy. It is a comprehensive outline of a range of cognitive abilities that
might be taught in a course. The taxonomy can be described as general instructional outcomes
and classifies cognitive performance into six major categories arranged from simplex to complex.
These are explained each major learning outcome of the classification and give examples to
illustrate them.
Knowledge
Knowledge refers to facts and tested and accepted explanations (theories) thus.
Knowledge in the cognitive domain involves the recall of facts, principles and procedures among
others. As Bloom and others define it, knowledge refers to recall of knowledge of specifics and
knowledge of ways and means of dealing with specifics. Knowledge of specifics includes
knowledge of terminology and knowledge of specifics facts. For instance we can talk about
knowledge of dates and events.
The knowledge of ways and means of dealing with specifics embraces knowledge of conventions,
classifications, criteria and methodology among others. Thus, we can talk of knowledge of the
criteria by which facts, principles and conduct are tested or judged. Thus, as a teacher if you ask
yourself whether your pupils can recall the main characters of the short story you told them or
whether they can recall the procedures in solving a problem, then you are within the realm of
knowledge in the Bloom’s taxonomy. However, for measurement purposes, the recall situation
involves little more than bringing to mind the appropriate material. Some of the action verbs that
can be used to state knowledge outcomes in specific terms include recall, identify and list.
Comprehension
Comprehension refers to a type of understanding that indicates that the individual knows what is
being communicated and can make use of the material or idea being communicated without
necessarily relating it to other materials or ideas. Comprehension is a bit more complex than
knowledge. One can recall a piece of information without necessarily understanding it. The
achievement of comprehension is evidenced by being able to carefully and accurately translate,
interpret and determine implication of what is communicated by the recipient.
An implication of comprehension is that, one can say what is understood in a different way
accurately. Examples of action verbs that can be used to specifically indicate comprehension
include explain, give and find.
Application:
11
solve new or novel problems. Thus, at this level of complexity, you do not only know and
understand but also able to apply the knowledge and understanding to solve relevant problems.
It is necessary that, in educating our students, we emphasize this learning outcome of cognitive
domain. The emphasis should not be on memorizing fact and figures and recalling them but on
making use of the knowledge and understanding achieved to solve new mathematical problems.
Analyses
Example: Abi can do a piece of work in 6 days while Joe can do the same piece of work in 10
day. How many days can the two take to do the piece of work? This question requires high level
of thinking. So here, students are expected to display analytical thinking ability.
Syntheses
It simply concern with putting together elements and part so as to form a whole. It involves the
process of working with pieces, parts, elements, etc. and arrangement and combining them in
such a way as to constitute a pattern or structure not clearly there before.
Example: Asking students to show similarities among two mathematical phenomenon such as
comparing and contrasting square and rhombus etc. We note that analyses and syntheses concerns
part of a whole. While in analyses, the whole is broken into its components part, in synthesis, the
elements or parts are put together.
Example 2. Suppose we are asked to obtain the equation whose are 𝑥3 = 3 and 𝑥4 = −2.
Here we need to put the various ‘part’ of the roots together to get the equation.
Evaluation:
By evaluation we refer to judgment about the value of materials, methods and things for their
effectiveness. Judgment may be about the extent to which materials satisfy specific criteria or
standard. It is the most complex cognitive area in blooms taxonomy of educational outcomes.
Quellmalz’s Taxonomy
Quellmalze, just like Bloom et al, classified the cognitive taxonomy also into recall, analyses,
comparison, inference and evaluation.
Recall: This refers to recognizing or remembering key facts, definitions concepts and rules and
principles. Blooms taxonomy levels of knowledge and comprehension are subsumed in
Quellmalz’s category of recall.
12
In deductive reasoning, we operate from a generalization to specific. It is the method in which
the law is accepted and applied to a number of specific examples [deduction]. Student does not
discover the law but develop skills in applying the same, proceed from general to specific or
abstract to concrete.
𝑃: Kofi is a human
Inductive reasoning is the opposite of deductive reasoning which operate from specific to
general. Inductive is that form of reasoning in which a general law is derived from a study of
particular object or specific process. Students use measurement, manipulators or contractive
activities and patterns, etc. to discover a relation. They later formulate a law or rule about the
relationship based on their observations, experiences, inferences and conclusion.
Example of this application is found from mathematical induction and other contrapositive
proves.
Example: 44 × 4= = 44>= = 4?
34 × 3= × 34 = 34>=>4 = 3@
Therefore 𝑎C × 𝑎D × 𝑎E = 𝑎C>D>E
Evaluation: this category of learning outcome is concerned with judging quality, credibility,
worth or practicality. It is related to Bloom’s levels of syntheses and evaluation.
The affective domain is concerned with educational outcomes that focus on feelings, attitudes,
disposition and emotional states. In other words, the affective domain describes our feelings, likes
and dislikes and our experience as well as the resulting behaviours (reactions).
Krathwohl et al. identified five main categories of outcomes in the affective domain. These are
Receiving, Responding, Valuing, Organizing and Characterization
• Responding: this category refers to active participation on the part of the individual. At
this level, the individual does not only attend a particular phenomenon stimulus but also
react to it in some way. Learning outcome involves obedience or compliance, willingness
to respond and satisfaction. E.g. when a student voluntarily read beyond what is assigned
or solves some given exercises more the instructed.
13
• Valuing: it concerns with the worth or value an individual attach to a particular object or
behaviour. It is based on internalization of set of specific values. It embraces acceptance
of values, commitment and appreciations.
The psychomotor domain refers to educational outcomes that focus on motor (movement) skills
and perceptual processes. Motor skills relate to movement whilst perceptual processes are
concerned with interpretation of stimuli from various modalities providing data for the learner to
make adjustment to his environment.
Harrow’s taxonomy of psychomotor and perceptual objectives has six levels including: reflex
movement, basic-fundamental movements, perceptual abilities, physical abilities, skilled
movements and non-discursive communication.
Reflex movement: Reflex movements are movement elicited without conscious volition on the
part of individual in respond to some stimuli. Examples of such movement include extension,
stretching and postural adjustment. The sub-categories of reflex movement according to Harrow
(1972) are segmental reflexes, inter-segmental reflexes and super-segmental reflexes.
Basic fundamental movements: This category is concern with inherent movement patterns that
are formed from a combination of reflex movements and are the basis for complex skilled
movement. E.g. Walking, running, jumping, bending, pulling. Sub categories of this level include
locomotor movement, non-locomotor movement, and manipulative movement.
Physical abilities: physical abilities involve functional characteristics of organic vigor which are
essential to the development of highly skilled movement. The category entails endurance strength,
flexibility and agility, E.g include distance running, distance covered or measured, weight lifting,
wrestling and typing, etc.
Skilled movements: it refers to complex movement tasks with degree of efficiency based on
inherent movement patterns. It builds up locomotor and manipulative movements. Three sub-
categories include adaptive skills, compound adaptive skills and complex adaptive skills.
14
has two levels - expressive movement and interpretative movement. Body posture, gestures,
facial expression, skilled dance movement are included in this category.
Purposes of Assessment.
Ø Serving instruction
Ø Accountability
Ø Selection
Ø Licensure
Assessing students’ performance in order to inform instruction is something that all teachers
do. It is often the case that an external agency of some sort gets involved in the assessment,
normally to serve instruction. The time lapse between the administration of the tests and the
reporting of ‘scores’ to teachers who might be able to use the information is such that there
is little reason to assume that any such testing by an external agency has much to contribute
to assessment for instruction.
Assessment for the purpose of saying how well a student, or a class, or a school, or an
instructional program is doing is the primary purpose of assessment for accountability.
Traditionally, such information has being presented in one of two quite different forms; norm-
referenced and criterion-referenced. Norm referenced accountability statements involve
comparing students’ performance (or classes or schools) to another one and then presenting
the results of those comparisons in rank order. It should be noted that this can only be done
if the performance of the students can be encoded in a one- dimensional measure. Criterion-
referenced accountability statements involve comparing involve comparing students’
performance (or classes or schools) to some predetermined set of performance criteria without
regard to how they compare to one another. It should be noted that this can only be done if
one has a clearly defined set of performance criteria that reflect one’s theory of competence
in the domain being assessed.
Assessing for selection is normally done for the purpose of helping to ascertain whether a
student will have access to limited resources. Such assessment is often employed in order to
inform decisions about access to select universities, polytechnics, colleges of educations,
program for gifted music students, special education programs, etc.
Assessing for the purpose of licensure is normally done in order to ascertain whether the
people being assessed have exceeded some threshold of minimal competence and are thus
permitted to practice in an unsupervised fashion the skill that they have demonstrated. Such
skills include driving automobiles, swimming in the deep part of the pool, barbering,
butchering, working as an electrician or plumber, etc.
15
viii. Motivating students;
ix. Reporting to stakeholders;
x. Certifying examinees.
Mehrens & Lehmann (1984, 7–12) conclude that the main purpose of assessment, therefore, is
to make EDUCATIONAL DECISIONS.
These include the following:
Generally, we want to find out about our students in order to make decisions related to:
• Placement • Selection • Aptitude • Achievement • Classification • Guidance
• Promotion
16
CHAPTER TWO
MEASUREMENT
Measurement refers to the procedure for assigning numbers or scores to a specific attribute or
characteristics of a person in such a way that the numbers describe the degree to which the person
possesses the attributes. It is the process of assigning numbers or numerical index to an attribute
or a trait possessed by a person or a learner or an event or a set of objects or whatever quality that
is being assessed according to specific rules. The purpose is to indicate the differences among
those who are being assessed in the degree to which they possess the characteristics being
measured. Thus the essence of measurement is to find the number of attribute possessed by the
people or objects.
Scale of measurement
Depending upon the traits / attributes, characteristics and the way they are measured, different
kinds of data result representing different scales of measurement.
Thus, measurement implies the use of scales. Four measurement scales exist as nominal, ordinal,
interval and ratio.
1.Nominal
Nominal is hardly measurement. It refers to quality more than quantity. A nominal level of
measurement is simply a matter of distinguishing by name, e.g., 1 = male, 2 = female. Even
though we are using the numbers 1 and 2, they do not denote quantity. The binary category of 0
and 1 used for computers is a nominal level of measurement. They are categories or
classifications. The categories are established by the researcher and an item is counted when it
falls into this category.
The most significant point about nominal scales is that they do not imply any ordering among
the response.
For example, when classifying people according to their favorite color, there is no sense in
which green is placed “ahead of” blue. A nominal level of measurement is the least precise form
of measurement.
Examples:
1. Meal Preference: Breakfast, Lunch, Dinner
2. Religious Preference: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 = Other
3. Political Orientation: NDC, NPP, PNC, PPP, CPP, GFP.
4. Number of males or females.
5. Number of individuals who fall under the category of introvert or extrovert
6. Height – the number of tall, medium or short people in a group
7. Counting the number of participants who did or did not experience anxiety
17
2. Ordinal refers to “order” in measurement. An ordinal scale indicates direction, in addition to
providing nominal information. Low / Medium / High; or Faster/Slower are examples of ordinal
levels of measurement. Ranking an experience as a "nine" on a scale of 1 to 10 tells us that it
was higher than an experience ranked as a "six." Many psychological scales or inventories are at
the ordinal level of measurement.
Unlike nominal levels of measurement, ordinal measurement allows comparisons of the degree
to which two subjects possess the dependent variable.
For example, placing feelings as being “very unsatisfied” “satisfied”, or “very satisfied”
makes it meaningful to assert that one person is more satisfied than another with the with the way
the country Ghana is managed. Such an assertion reflects the first person’s use of a verbal label
that comes later in the list than the label chosen by the second person. However, ordinal data fail
to capture the precise difference between the data. In particular, it cannot be assumed that
differences between two levels of ordinal data are the same as the differences between two other
levels. For instance, it cannot be assumed that the difference between “very unsatisfied” and
“satisfied” is the same as the difference between “satisfied” and “very satisfied.
In the same way, it cannot be assumed that if rank a group of people from tallest to shortest that
the difference between the tallest person in the group and second tallest person in the group is the
same amount of difference between the 4th and 5th tallest people in the group. In other words,
ordinal level data lacks a degree of specific information.
Examples:
a. Rank: 1st place, 2nd place, ... last place
b. Level of Agreement: No, Maybe, Yes
d. Rating of Attractiveness on a scale of 1 to 10
e. Race Results – which racers came in 1st, 2nd, 3rd, etc. (actual times or intervals may be
widely different)
f. Height: Group of people in order from Shortest to Tallest
t Dawn Morning Noon Afternoon Evening
3. Interval scales provide information about order, and also possess equal intervals. From the
previous example, if we knew that the distance between 1 and 2 was the same as that between 7
and 8 on our 10-point rating scale, then we would have an interval scale.
Equal-interval scales of measurement can be devised for opinions and attitudes. However,
constructing them involves an understanding of mathematical and statistical principles beyond
18
those covered in this course. But it is important to understand the different levels of
measurement when using and interpreting scales.
Examples:
a. Time of Day on a 12-hour clock
b. Political Orientation: Score on standardized scale of political orientation. Thus the vote of
the poor and rich has the same magnitude.
c. Other scales constructed so as to possess equal intervals
d. Height of a person(s) in centimeters or Inches
Interval – example is time of day - equal intervals; analog (12-hr.) clock, difference between 1
and 2 pm is same as difference between 11 and 12 am
4.Ratio
The ratio scale of measurement is the most informative level of measurement. It really just an
“interval” measurement with the additional property that its zero position indicates the absence
of the quantity being measured. You can think of a ratio scale as the three earlier scales rolled up
in one. Like a nominal scale, it provides a name or category for each object (the numbers serve
as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers).
Like an interval scale, the same difference at two places on the scale has the same meaning. And
in addition, the same ratio at two places on the scale also carries the same meaning. In other
words, In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale
has an absolute zero (a point where none of the quality being measured exists).
Using a ratio scale permits comparisons such as being twice as high, or one-half as much.
Reaction time (how long it takes to respond to a signal of some sort) uses a
ratio scale of measurement -- time.
Although an individual's reaction time is always greater than zero, we conceptualize a zero point
in time, and can state that a response of 24 milliseconds is twice as fast as a response time of 48
milliseconds. Example in what do you know?
In memory experiments, the dependent variable is often the number of items correctly recalled.
What scale of measurement is this? You could reasonably argue that it is a ratio scale. First, there
is a true zero point: some subjects may get no items correct at all. Moreover, a difference of one
represents a difference of one item recalled across the entire scale. It is certainly valid to say that
someone who recalled 12 items recalled twice as many items as someone who recalled only 6
items. However, these words must be roughly the same level of difficulty.
Other Examples
a. Ruler: inches or centimeters
b. Years of work experience
c. Income: Money earned last year
d. Memory – number of correctly remembered items from a list of words (if equal difficulty)
19
e. GPA: Grade point average
f. Number of children a couple has
Ratio – A 24-hr. time format has an absolute 0 (midnight); 14 o'clock is twice as long from
midnight as 7 o'clock
ADDITIONAL NOTES
• The level of measurement for a particular variable is defined by the highest category that it
achieves.
If we categorize people 1 = shy, 2 = neither shy nor outgoing, 3 = outgoing, then we have an
ordinal level of measurement. If we use a standardized measure of shyness (and there are such
inventories), we would probably assume the shyness variable meets the standards of an interval
level of measurement.
• As to whether or not we might have a ratio scale of shyness, although we might be able to
measure zero shyness, it would be difficult to devise a scale where we would be comfortable
talking about someone's being 3 times as shy as someone else.
• Measurement at the interval or ratio level is desirable because we can use the more powerful
statistical procedures available for Means and Standard Deviations. To have this advantage, often
ordinal data are treated as though they were interval; for example, subjective ratings scales
(1 = poor, 2 = fair, 3 = good, 4 = excellent). The scale probably does not meet the requirement of
equal intervals -- we don't know that the difference between 1 (poor) and 2 (fair) is the same as
the difference between 3(good) and 4 (excellent).
• In order to take advantage of more powerful statistical techniques, researchers often assume
that the intervals are equal.
20
CHAPTER THREE
TEST
Formal and Informal Assessment
There are two major approaches to assessment. These are the Formal and Informal
assessment.
Formal Assessment: This is also known as test technique (pencil and paper test). As described
by Passing, it is that type of test done at the end of a lesson, topic unit, school term or year, etc.
It is planned or structured or well designed. In formal assessment, the design may be objective
test or essay type test. It is quantitative in nature.
A test is a task or series of tasks, which are used to measure specific attributes or traits of people
in educational setting. Tests are classified in various ways using criteria like purposes, uses and
nature. Some of the common ones are diagnostic, aptitude, intelligence and achievement tests.
Achievement tests essentially measure knowledge obtained from formal learning situations. It
measures the degree of students learning in specific curricula areas for which he had received
instructions. They focus on more concretes objectives in the measuring of ability. Achievement
tests therefore measure preciously acquired knowledge.
Achievement tests can be classified into two as:
• Teacher-made achievement tests e.g. Objective and essay tests.
• Standardized achievement test (SAT)
The major difference between standardized and teacher-made achievement test is that
standardized tests are carefully constructed by test experts, administrated and scored under
specific uniform conditions. In addition, the scores are interpreted in terms of established norms
stated in the test manual while teacher-made tests are not necessarily so. Teacher-made tests may
be confined to some specific content covered.
Standardized Tests
v These are tests carefully constructed by test experts administered and scored
under specified uniform conditions. In addition the scores are interpreted in
terms of established norms stated in the test Manual. In administering the test
the test administrator must adhere strictly to the instructions. Any deviations or
violations render the results useless. The test usually has its validity clearly
stated. The norms and their interpretations are very often indicated. The results
are usually expressed in grade equivalent, percentile ranks and standard scores.
Examples of standardized Achievement tests are Sanford Achievement test (SAT), California
Achievement test (CAT) and the Comprehensive Test of Basic Skills (CTBS). They tend to be
more or less commercial.
21
Strengths of standardized Tests
§ The inherent validity and reliability makes the results genuine.
§ The information they give truly represent the trait measured hence permitting decisions
that are well informed.
§ Professionals are enabled to make decisions related to eligibility and placement.
§ The tests have no room for subjective tendencies. This means that the assessor or
evaluator cannot depend on his / her wits to interpret results.
§ The test protocol specifically describes how the tests should be administered, scored and
interpreted.
§ The outcomes are objective. Professionals can easily compare an individual to normative
group. The extent to which a child is deviating can be known.
§ Standardized tests are very often used as screening devices to sort out an individual who
deviate from the norm group. This helps professions to categorize pupils.
• the test does not favour children in deprive localities. Most of the children may find it
difficult coping with the instructional requirements. In certain instances, since the child
is aware that he is being assessed he / she could put up behaviour, which may not be
natural. This may provide misleading results.
• the tests do not provide sufficient information on why the test taker fails to achieve.
• some of the tests are not culturally fair; they may be full of biases. Test takers may not
understand the language. Most seriously, since the experiences one has have effects on
his performance, we can imagine the mess that will arise when the test is applied to
children in an environment different from where the test was normed. This is why it
becomes important for test administrators to be careful when selecting test. Leaving test
in the hands of inexperienced individuals will lead to enormous wreckage on the lives of
innocent children.
Teacher-made Tests
Unlike standardized test, these one are structured by teachers in the classroom. For instance after
teaching, a teacher can construct test made up of a few items to test the degree of students’
learning in that specific unit. By doing this, the teacher does not go through any elaborate process
as in the construction of standardized tests. More so, the test may be confined to the specific
content covered within a given period. The test may be either objective or essay tests. Teacher-
made test are means to an end. They aid in decision-making. As a recapitulation, teacher-
constructed classroom achievement tests are used to
22
(a) Determine what students know.
(b) Identify student’s learning problems and areas that should receive remedial teaching.
(c) Determine the effectiveness of pedagogical strategies.
(d) Find out to what extent students are meeting set out instructional objectives.
(e) Give guidance and counseling to students on how and what to study as well as choice of
content.
(f) Encourage and motivate students to learn.
(g) Give students feedback on their performance to enable them buck up in areas in which
they are weak.
(h) Select and promote students from one grade to another.
(i) Group and select students for instructional purposes.
(j) Predict students’ future performance.
(k) Provide parent or guardians with information on the performance of their children
or wards.
Students may be at ease with examination as question items are constructed by teacher.
Cultural biases may be removed. Teachers may be comfortable administering test items.
• There is also the tendency of teachers being partial in administering and scoring test
items.
23
(b) Listing the main topics covered or to be covered
(c) Marrying the objectives and the list of topics to build the table of specifications for
the test and
(d) Determine the appropriate test items and types.
It is worthy to note that in the planning; one does not only list objectives but also tries to classify
the objectives. For instance those dealing with recall, comprehension, application interpretation
etc. should be clearly delineated. This information is used to build the Table of specification.
A convenient way to set up the table of specification is to have the objectives or the abilities to
be demonstrated across the top of the page and the subject matter contents or topics in a column
on the left hand side of the page. An example for Mathematics based on Bloom’s Taxonomy is
given in table below.
In the table of specifications, the number of items is indicated in the set where the two meet. In
table 1 the writer has indicated that for all the content, 4 items will be constructed to test for
24
knowledge of usage and 3 items to test for application. Not all the sets in the table of
specifications needs to have items since certain processes will be unsuitable or irrelevant for
certain topics. The number of items devoted to each topic and objectives as well as the importance
attached to them indicates the relative weight given to each area of content and behaviour. A
table of specification is usually used more especially in the case of objective test items than with
essay items, because objectives item seem to measure single units of behaviour in content areas.
However the table of specification is still applicable in the construction of essay tests.
Importance of the Table of Specification
1. It helps the teacher to cover adequately the topics treated during the term. Also the
behaviours that students were expected to acquire are all catered for when the table of
specification is used.
2. It helps the teacher to determine the content validity of the test. In that the teacher is
able to do a sampling to cover all that has been taught during the term.
3 It helps the teacher to do a meaningful weighting of the test items in each set of the
table of specification accordingly.
4. It avoids over lapping in construction of the test items.
5. It helps teachers to determine content areas where students / pupils have difficulty.
25
2. Determine the item Format to Use
The teacher has several options to use as far as the format of the test item is concerned. That is
he may either use the essay type test where the student produces his own answers in an extended
form or the objective type test where the student is required to select an answer from some
alternatives supplied by the teacher. The format must be appropriate for testing the topic and
objectives concerned. It may be necessary to use more than one item format
The format depends on:
• The purpose of the test
• Time available for writing items
• Number of students to be tested
• Physical facilities available
• Academic standard of testees
• Writer’s skill in item writing
• Ages of the students.
3. Determine what is to be tested [define the task and shape the problem situation]
The teacher must know what the test intends to measure and the content area that he seeks to
cover so that the expected knowledge, skills and attitudes from students could be measured. Test
items as much as possible should reflect the content and instructional objectives. Test items
should also match the maturing level of testees. Ideally, a test plan made up of a table of
specification of a blue print must be made.
The following are some important clues that will guide the teacher in determining what to be
tested.
• Define instructional objectives
• Know the chapter or units that the test is to cover
• Make sure test items match with course objective
• Prepare a table of specifications
26
i. Include questions of varying difficulty.
j. Write the items and the scoring scheme as soon as possible after the material has
been taught.
k. Avoid lifting questions directly from test books and past questions.
l. Write test items in advance of test date to permit review and editing,
6. Writing Directions
Every test must be provided with some directions. The directions provided will help the student
respond to the questions appropriately. The directions should include
• Number of questions to answer.
• The time limit for the questions.
• The various sections in the examination and how they are to select questions from the
sections.
• Penalties for offences committed.
• How and where the answers are to be written.
• Clarity of expression etc.
• Marks allocated to the various items.
7. Preparing the Scoring Scheme
Objective type test: Here the best way is to compare a key, which contains the correct
best answer to each question to the answer a student gives.
Essay Test
This type of assessment usually requires students to solve mathematical problem and present their
own solution. Depending on the amount of time given to the testee, easy type test can be divided
into two types. Restricted response type and extended response type.
Restricted response type: it limits the content and the form the testee’s answer should take.
27
Advantages of essay tests
• They are easy to prepare.
• Little time is require to prepare the items
• It encourages global learning
• Skills such as ability to organize materials and to write and to arrive at
conclusions are improved
• They are best suited for measuring higher order behaviours and mental acuity
Disadvantages of essay test
§ Scoring objectively is difficult
§ It can be time consuming for the taker and the marker
§ It is prone to halo effect where scoring can be influenced by extraneous factors like
relationship and handwriting
§ Content validity can be reduced since essay test necessitate testing a limited sample of
the subject matter.
§ Bluffing by testee may arise where students may provide unnecessary stuff
§ Student who write faster may score higher marks since premium is placed on the writing
Marking Scheme
Marking scheme is a step by step procedure outlining (detailing) how a given (mathematical)
question is be solved. It indicates marks that should be awarded to each (interested) step. Thus in
writing the scheme, marks must be allocated to the various expected qualities or behaviour you
want yours students to respond to.
In Mathematics, the letters A for accuracy, B for accuracy and M for methods are use in the award
of marks.
Example
OPQ S
1. a Show that log N 𝑥 = R . 𝟒 𝒎𝒂𝒓𝒌𝒔
OPQR N
1. b Hence, solve for 𝑥 if log T 𝑥 = 4 log S 3
𝟔 𝒎𝒂𝒓𝒌𝒔
Solution: log N 𝑥 = 𝑝 𝑥 = 𝑎E . 𝑴𝟏
𝑴𝟏
Taking log V of both sides gives log V 𝑥 = log V 𝑎E
𝑝 log V 𝑎 = log V 𝑥 𝑩𝟏
OPQ S
𝑝 = log N 𝑥 = OPQR N 𝑴𝟏
R
W OPQX T
log T 𝑥 = 𝑴𝟏
OPQX S
W
log T 𝑥 = 𝑴𝟏
OPQX S
(log T 𝑥)4 = 4
𝑩𝟏
28
log T 𝑥 = ± 2
3
𝑥 = 34 or 3g4 𝑥 = 9 or 𝑥 = 𝑨𝟐
@
Example
S jg4S>T
Given that 𝑦 = , find the value of 𝑦 when 𝑥 = −2
T
Solution: Student A
𝑦=
(g4)jg4(g4)>T
𝑴𝟐 Student B
T
W>W>T (g=) jg4(g=)>T
= 𝑦= 𝑴𝟐
T
𝟏𝟏
𝑩𝟏 T
= 𝒐𝒓 𝟑. 𝟔𝟔𝟔𝟔̇ 4=>3A>T
𝟑 𝑨𝟏 = 𝑩𝟏
T
𝟑𝟖
= 𝟑
𝒐𝒓 𝟏𝟐. 𝟑𝟑𝟑̇ 𝑨𝟎
In the two solutions presented by the two students, Student A scored 4 (all the) marks. Student B
scored 3 marks. He got the M and B marks because he did the correct substitution and the answers
(25 and 10) are correct. His answer is also correct but that is not the answer for the question so he
got the A marks wrong.
Try: Set a mathematics question that would attract 10 marks. Prepare a marking scheme
indicating clearly, the marks at the interest steps.
Multiple-choice test is the most frequently used and most highly regarded objective test.
A multiple choice test is a type of objective test in which the respondent is given a stem and he is
to select from among three or more alternatives, options or responses, the one that best completes
the stem. The incorrect options are called foils or distracters.
The multiple choices item consist of two parts.
The stem contains the problem and the incomplete statement introducing the test item.
A list of suggested answers known as responses, options, alternative or choices follows.
There are two types of multiple-choice test. These are:
The single best response and the multiple best responses
The single best response type consists of a stem followed by three or more responses and the
respondent is to select one option to complete the stem.
The multiple best response type consists of a stem followed by several true or false statements or
words. The respondent is to select which statements could complete the stem.
29
2. Specific determiners should be avoided. They lead to guess work e.g. ‘an’, ‘a’, ‘some’,
‘most’, ‘often’, ‘all’, ‘always’, ‘never’, ‘none’
3. Items should be stated in positive terms rather than in negative terms.
4. Test items should not be copied directly from textbooks or from other people past test items.
Original items should always be constructed.
5. Create independent items. The answer to one item should not depend on the knowledge of
the answers to previous items.
7. Items that measure opinions should be avoided. One option should clearly be the best
answer.
30
3. The test occupies much space
4. It cannot be used to measure certain problem-solving skills.
Example of multiple – choice question
2. The length of a rectangle is twice its width. If the perimeter of the rectangle is 42 meters,
find its width.
A. 9 m
B. 8 m
C. 7 m
D. 5 m
4. A man bought a car for Gh¢15,000.00. He later sold it at a profit of 20%. What was the
selling price?
A. Gh¢3,000.00
B. Gh¢18,000.00
C. Gh¢12,500.00
D. Gh¢30,000.00
The Venn diagram below shows a class of 35 students studying one or more of three subjects,
Mathematics (M), Economics (E) and Geography (G).
U = 35
M E
5 k
7
3 2
4
10
G
Use it to answer Questions 5 and 6.
5. Find the value of k.
A. 6
B. 5
C. 4
D. 3
31
CHAPTER FOUR
VALIDITY
In order to ensure a high degree of reliability, suitability, objectivity and validity, there are several
approaches the teacher can utilize to evaluate assessment.
For the statistical analyses of students assessed scores, educational researchers and science
researchers, the estimation of reliability and validity is a task frequently encountered.
Measurement issues differ in the social sciences in that they are related to the quantification of
abstract, intangible and unobservable constructs. In many instances, then, the meaning of
quantities is only inferred.
It is important to bear in mind that validity and reliability are very important in analyses of test
results.
Validity is the extent to which a test measures what it is supposed to measure and the accuracy of
inferences and decisions made on the basis of the assessment results. It refers to the degree to
which evidence and theory support the interpretation of test score entailed by proposed uses of
tests. In other words validity refers to the soundness or appropriateness of interpretations and uses
of students’ assessment results.
For example, if a timed test of one-digit multiplication is used to determine how quickly students
can recall their multiplication facts, the test is measuring what it was designed to measure. If the
same test were employed to assess students’ capacity to determine whether to use addition,
subtraction, multiplication, or division to solve a variety of problems, the test would not meet the
criterion. To the extent that standards-based mathematics test is valid, we should be confident
that a student who does well on it is in fact competent in the mathematics skills and processes
specified in the standards. To be valid, an assessment should also be fair or equitable; that is, it
should enable students to demonstrate their mathematical competence, regardless of their
language or cultural background, or physical disabilities.
The question of validity is raised in the context of the three points- the form of the test,
the purpose of the test and the population for whom it is intended. Therefore, we cannot ask the
general question “Is this a valid test?” The question to ask is “how valid is this test for the decision
that I need to make?” or “how valid is the interpretation I propose for the test?” We can divide
the types of validity into logical and empirical.
Ø The concept of validity refers to the ways in which we interpret and use the
assessment results and not to the assessment itself.
Ø The assessment results have different degree of validity for different purposes
and for different situations.
Ø Judgment about the validity of interpretation or uses of assessment results
should be made only studying and combining several types of validity
evidence.
Validation (of assessment) refers to ascertaining the appropriateness or soundness of the uses
and interpretations of assessment results based on available evidence. Nitco (2001) noted that
validity judgment must be based on four principles:
32
1. The interpretation or meaning you give to your students’ assessment results are valid
only to the degree that you can point to evidence that supports their appropriateness and
correctness. Eg. Consider a situation where Ansah has taken the mathematics
achievement test each year but his scores suddenly rose this year. Ansah’s score has
several interpretations:- his mathematical conceptual skills has improved, or is highly
motivated or his solutions to mathematical problems have been improved etc.
2. The uses you make of your assessment results are valid only to the degree to which you
can point to evidence that supports their correctness and appropriateness. Eg. Ansah’s
teacher can use his score in a number of ways:- diagnoses, placement, certification, etc.
3. The interpretation and uses you make of your assessment results are valid only when
the values implied by them are appropriate.
4. The interpretation and uses you make of your assessment results are valid only when
the consequences for the interpretations and uses are consistent with appropriate values.
Content Validity:
When we want to find out if the entire content of the behavior / construct / area is represented in
the test, we compare the test task with the content of the behavior. This is a logical method, not
an empirical one. Example, if we want to test knowledge on area of plane figures, we must not
limit the question to say rectangle but it should cover as many plane figures as possible.
Also, if we want to test knowledge on Ghanaian Geography, it is not fair to have most questions
limited to the geography of Brong Ahafo Region but questions must cover the whole of Ghana.
Face Validity:
Basically face validity refers to the degree to which a test appears to measure what it purports to
measure.
Example we want to test the students’ understanding of the term ‘area’. We can present to the
students various plane figures and ask them to find ‘their areas’. Here we are only interested in
the ‘area’ and nothing else.
Ø Predictive validity evidence: it is when the criterion data are gathered at a later date.
E.g. When a student’s JHS mathematics results are used to predict performance in SHS
Ø Concurrent validity evidence: when the scores, both test scores and criterion scores are
collected at the same time, we have concurrent validity evidence.
Concurrent Validity: is the degree to which the scores on a test are related to the
scores on another, already established test administered at the same time, or to some
other valid criterion available at the same time. Example, a new simple test is to be
used in place of an old cumbersome one, which is considered useful; measurements are
obtained on both at the same time. Logically, predictive and concurrent validation are
33
the same, the term concurrent validation is used to indicate that no time elapsed
between measures.
When you are expecting a future performance based on the scores obtained currently by the
measure, correlate the scores obtained with the performance. The later performance is called
the criterion and the current score is the prediction. This is an empirical check on the value of
the test – a criterion-oriented or predictive validation.
The method use in determining and expressing validity is the same for concurrent and
predictive. We use correlation analyses (correlation coefficient) to measure and quantify the
strength of relationship between the scores. The appropriate correlation to compute is the
Pearson-product moment correlation coefficient given as
xyz
𝑟=x [ the covariance method]
yy xzz
Which can be rewritten as;
∑D}•3(𝑥} − 𝑥̅ )(𝑦} − 𝑦F)
𝑟S{ = … … … (1)
€∑D}•3(𝑥} − 𝑥̅ )4 ∑D}•3(𝑦} − 𝑦F)
That is if 𝑟S{ = 𝜌, then;
∑D}•3(𝑥} − 𝑥̅ )(𝑦} − 𝑦F)
𝜌=
€∑D}•3(𝑥} − 𝑥̅ )4 ∑D}•3(𝑦 − 𝑦F)4
Example
Total 60 70 44 50 33
33
𝑟=
√44 × 50
33
= = 0.7035
10√22
34
The concept of correlation provides information about the extent of the relationship between two
variables. Two variables are correlated if they tend to ‘go together’. For example, if high scores
on one variable tend to be associated with high scores on a second variable, then both variables
are correlated. Correlations aim at identifying relationships between variables and also to be able
to predict performances based on known results.
The statistical summary of the degree and direction of the linear relationship or association
between any two variables is given by the coefficient of correlation. Correlation coefficients range
between – 1.0 𝑎𝑛𝑑 + 1.0. Correlation coefficients are normally represented by the
symbols, 𝑟 (for sample) and 𝜌 (rho) (for populations). When two sets of data are strongly linked
together we say they have a High Correlation. The word Correlation is made of Co- (meaning
"together") and Relation (related or have something in common)
Correlation is Perfect when the value 1 or -1. When the relationship is strong (high), the value
of the correlation coefficient, r, is greater than 0.60 or less than – 0.60, i.e. r > 0.60, r < - 0.60
2. When the relationship is moderate (mild), the value of the correlation coefficient, r, lies
between 0.40 𝑎𝑛𝑑 0.60 𝑜𝑟– 0.60 𝑎𝑛𝑑 − 0.40,
3. When the relationship is weak (low), the value of the correlation coefficient, r, is less than
0.40 𝑜𝑟𝑔𝑟𝑒𝑎𝑡𝑒𝑟𝑡ℎ𝑎𝑛 − 0.40, 𝑖. 𝑒. 𝑟 < 0.40 𝑟 > −0.40
4. When the relationship is perfect, the value of the correlation coefficient, r, is 1.0
or -1.0 i.e. r = 1 for perfect positive r = -1.0 for perfect negative.
5. When there is no linear relationship, the value of the correlation coefficient, r, is 0.0
• Correlation is positive when there is direct relation. I.e. One variable increases as the
other also increases and
• Correlation is Negative when one value decreases as the other increases
Scatter Plots
A scatter plot or scatter diagram gives a pictorial representation of the two variables and shows
the nature of the relationship between the two variables. It is important that scatter plots are drawn
before any analysis is done on the variables. This is because scatter plots could either be linear or
curvilinear.
The following diagrams show the trend or nature of correlation between dependent and
independent variables.
35
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation coefficient, r,
will be to either +1 or -1 depending on whether the relationship is positive or negative,
respectively. Achieving a value of +1 or -1 means that all your data points are included on the
line of best fit - there are no data points that show any variation away from this line. Values for r
between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of
best fit. The closer the value of r to 0 the greater the variation around the line of best fit. Different
relationships and their correlation coefficients are shown in the diagram
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡, 𝑟
𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝐴𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑖𝑜𝑛 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑆𝑚𝑎𝑙𝑙 . 1 𝑡𝑜 .3 −0.1 𝑡𝑜 − 0.3
𝑀𝑒𝑑𝑖𝑢𝑚 . 3 𝑡𝑜 .5 −0.3 𝑡𝑜 − 0.5
𝐿𝑎𝑟𝑔𝑒 . 5 𝑡𝑜 1.0 −0.5 𝑡𝑜 − 1.0
Remember that these values are guidelines and whether an association is strong or not will also
depend on what you are measuring.
36
Does the Pearson correlation coefficient indicate the slope of the line?
It is important to realize that the Pearson correlation coefficient, r, does not represent the slope of
the line of best fit. Therefore, if you get a Pearson correlation coefficient of +1 this does not mean
that for every unit increase in one variable there is a unit increase in another. It simply means that
there is no variation between the data points and the line of best fit. This is illustrated below:
There are four assumptions that are made with respect to Pearson's correlation:
37
5. Obtain the values of 𝑛(𝑛2 – 1).
6. Divide the values in Step 4 by the result in Step 5.
7. Subtract the result from 1 and obtain ρ (rho)
The result, ρ = 0.74 shows that there is a strong positive relationship between Quiz 1 and Quiz 2.
Follow the steps and calculate the rank-order correlation coefficient for the data below.
Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Quiz 1 7 6 14 14 18 3.5 3.5 19 20 8.5 17 5 16 12 8.5 1.5 11 10
Quiz 2 6 18 12 14 7 23 20 4 9 19 18 5 17 8 13 2 10.5 10.5
If your answer is 0.87 or close, then congratulations, you have done well.
If your answer is very different, then check your steps and your calculations again.
38
Using the Product Method
This method uses the product of the two variable and squares of each variable for the
computations. The formula is given below.
39
Using the product formulae
W3=W? W3=W?
Therefore, 𝑟S{ = = W¡@.AW = 0.7
€(WWA)(=AA)
You will notice that the answer we had here is the same as the answer we got with the
covariance method. It does not matter therefore which method is used. The answers will always
be exact or very close.
Now, follow the steps and calculate the correlation coefficient for the data below given
students’ scores in quiz 1 as predictive in quiz 2 using equation (1)
Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Quiz 1 14 16 10 10 8 18 18 8 8 13 10 16 10 12 13 20 13 12 20
Quiz 2 13 14 13 11 12 15 15 10 11 14 14 14 11 12 13 15 12 12 16
Exercise
Calculate the correlation coefficient for the data below.
Student (xi) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Mid-sem 11 17 12 15 8 15 16 10 17 12 17 15 12 14 13 15 20 20 12 9
End-sem 10 14 15 16 12 16 15 15 18 16 18 18 15 16 10 12 20 19 14 11
Given the data below, compute the Pearson product moment correlation coefficient using:
1. The covariance method
2. The product method
Interpret your result in relation to the strength of relationship between students’ performance in
Mid-semester and end of semester examinations.
Stude 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2
nt 0 1 2 3 4 5 6 7 8 9 0
Physic 7 7 8 5 6 8 4 5 6 9 4 5 7 6 5 6 8 6 9 8
s 0 5 8 6 0 0 5 0 8 0 0 5 4 4 8 4 0 5 0 8
Histor 5 6 4 8 6 7 9 8 7 5 8 4 6 7 8 7 4 6 4 5
y 0 0 8 5 0 0 2 6 2 8 5 5 6 5 0 0 6 0 2 0
Construct Validity:
Construct validity is the degree to which a test measures an intended hypothetical construct.
Many times psychologists assess / measure abstract attributes or constructs. The process of
validating the interpretations about that construct as indicated by the test score is construct
validation. A construct validation may be defined as the process of determining the extent to
which performance on assessment can be interpreted in terms of one or more constructs. A
40
construct is an individual characteristics that we assume exist in order to explain some aspect of
behaviour (e.g. mathematical reasoning, mathematical conceptualization, abstraction and
generalization, perception and anxiety related behaviours). This can be done experimentally, e.g.,
if we want to validate a measure of mathematical reasoning or anxiety. We have a hypothesis
that anxiety increases when subjects are under the threat of an electric shock, then the threat of
an electric shock should increase anxiety scores (Note: not all construct validation is this
dramatic!)
A correlation coefficient is a statistical summary of the relation between two variables. It is the
most common way of reporting the answer to such questions as the following: Does this test
predict performance on the job? Do these two tests measure the same thing? Do the ranks of
these people today agree with their ranks a year ago?
According to Cronbach, to the question “what is a good validity coefficient?” the only sensible
answer is “the best you can get”, and it is unusual for a validity coefficient to rise above 0.60,
though that is far from perfect prediction.
All in all, we need to always keep in mind the contextual questions: what is the test going to be
used for? How expensive is it in terms of time, energy and money? What implications are we
intending to draw from test scores? You can use several methods to establish the construct
validity of your test results:
Furthermore, the teacher must be aware of the many factors which may influence the
validity of tests, measurement, or evaluation results at any given time in the assessment
process. Therefore, the teacher must pay attention to:
(1) the test;
(2) administration and scoring;
(3) pupil’s responses;
(4) the group and the criterion.
41
Factors which may influence Validity:
1. Factors in the test:
a. Unclear directions
b. Poor sentence structure
c. Inappropriate level of difficulty of items
d. Poorly constructed test items
e. Ambiguity
f. Test items inappropriate for items being measured
g. Test too short
h. Improper arrangement of items
i. Identifiable patterns of items
42
CHAPTER FIVE
RELIABILITY
Reliability refers to how consistent an assessment measures student’ knowledge, skill and
understanding.
Reliability refers to the consistency in assessment scores over time on a population of individuals
or group.
In general, reliability refers to the degree to which students assessment results are the same when:
Test reliability
Applied to test, test reliability refers to the consistency of the score obtained by the same
individuals when examined with the same test (or with alternate forms) on different occasions.
Test-retest, equivalent forms and split-half reliability are all determined through correlation.
SCORE
Obtained scores:
When you conduct any test, the scores or marks your students obtained when you assessed and
mark them are called obtained scores. Those obtained scores can contain errors
True scores:
The true score is the portion of the observed score that contains no measurement errors
Error Scores:
The error score is the remaining portion of the obtained score when the hypothetical true score is
taken away from it. It is referred to as error of measurement. For example, if a student is assessed
ten times and obtained scores recorded as follows:
43
Test Obtained scores Error scores(𝑒} ) 𝑒}
1 68 3 9
2 68 3 9
3 57 −8 64
4 70 5 25
5 70 5 25
6 69 4 16
7 72 7 49
8 65 0 0
9 55 −10 100
10 56 −9 81
If the mean score is 65 and the error variance is 37.8, then the standard deviation is 6.15. The
concept of reliability focuses on the consistency of assessment results whilst the measurement of
error focuses on inconsistency of assessment results.
It is the standard deviation of errors of measurement that is associated with the test score for a
specified group of test takers. It is the measure of the variability of the errors of measurement.
𝑆𝐸𝑀 = 𝜎S √1 − 𝑟
Where 𝑆𝐷S = 𝜎S =the standard deviation of the obtained scores of the group and
𝑟 = reliability coefficient
If a test has a standard deviation of 10 and a reliability coefficient of 0.89, then the standard error
of measurement will be 3.32
𝑆𝐸𝑀 = 𝜎S √1 − 𝑟
Test-retest Reliability:
Test-retest reliability is the degree to which scores are consistent over time. It indicates score
variation that occurs from testing session to testing session as a result of errors of measurement.
Problems: Memory, Maturation and Learning can contribute to score variation.
Consistency or stability over time is measured by test – retest reliability. This type of reliability
is “in-line” with the traditional view of reliability, and is usually measured by correlation tests
given to a group of subjects twice over a tasteful period, during which nothing has happen to your
participants to affect their results. Therein lays, the major disadvantage of this method of
reliability. Other problems are concerned with the first test influence on the retesting, perhaps
there is some type of learning effect where taking the first test teaches one how to take the second
test. The problems of history and maturation are additional limitation of the type of reliability
44
Parallel-form reliability:
The other form of stability over time is parallel-form reliability. This reliability is determined by
correlating two forms of a test that measure the same concept. This form of reliability assumes
that the two test versions are equally worded and the word and reading difficulty are as similar as
possible, a hard criterion to justify (a limitation of the method).
Two tests that are identical in every way except for the actual items included assessed when it is
likely that test takers will recall responses made during the first session and when alternate forms
are available. Correlate the two scores. The obtained coefficient is called the coefficient of
stability or coefficient of equivalence.
Problem: Difficulty of constructing two forms that are essentially equivalent. This is a method
used to provide a measure of the degree to which generalizations about students’ performance
from one assessment to another are justified. Both of the above require two administrations.
Split-Half Reliability:
Requires only one administration and especially appropriate when the test is very long. The most
commonly used method to split the test into two is using the odd-even strategy. Since longer tests
tend to be more reliable, and since split-half reliability represents the reliability of a test only half
as long as the actual test, a correction formula must be applied to the coefficient. Split-half
reliability is a form of internal consistency reliability and measures the internal consistency of a
test.
If you conceptualize reliability in terms of stability of the internal structure of a test then the split
half or internal consistency reliabilities are the preferred procedures. Split half reliability is
determined by correlating a sub-score obtained by adding up first half of the test items with a sub-
score determined by the adding of the remaining items. Example, a teacher can give 50 short
questions and mark and take the scores of even numbers for assessment. Since learning may
influence latter item placement, where exposure to early test item may influence your score on
latter items (a limitation of this form of reliability), usually the sum of odd items are correlated
with the sum of even items. Another, limitation of the split half method is that the reliability is
based on just half the test items, not the items of the total test. This restriction of the number of
items will lead to an under estimation of the reliability. The split half correlation needs to be
adjusted for test length; this concept is called attenuation in the literature. The formula for
accomplishing this calculation is called the Spearman-Brown formula.
Internal consistency refers to the consistency of the items comprising your instrument. You
might view this method of reliability as a logical extension of the split half method, where the test
items are viewed as individual sub-tests. Typically these protocols are made of a variety of
questions (items) that are responded to on a dichotomous format. For example, a yes (1) or no
(0) format or a true (1)-false (0) format. The task now is determining the mean correlation among
the various items comprising the test. These correlations are typically calculated by means of a
phi correlation coefficient. This average correlation needs to be modified by Cronback’s alpha
(Cronkack,1947).
45
The formula is as follows:
𝑁. 𝑟̅
𝛼=
1 + (𝑁 − 1). 𝑟̅
Where ‘N’ is the number of items in the test and ‘r’ is the average correlation among test items.
If the average ‘r’ is small, then the alpha approaches zero.
Rationale equivalence reliability is not established through correlation, but rather, it estimates
internal consistency by determining how all items on a test relate to all other items and to the total
test.
Determining how all items on the test relate to all other items. It is an estimate of reliability that
is essentially equivalent to the average of the split-half reliabilities computed for all possible
halves.
There are several statistical indexes that may be used to measure the amount of internal
consistency for an exam. The most popular index (and the one reported in Testing & Evaluation’s
item analysis) is referred to as Cronbach’s alpha. Cronbach’s alpha provides a measure of the
extent to which the items on a test, each of which could be thought of as a mini-test, provide
consistent information with regard to students’ mastery of the domain. In this way, Cronbach’s
alpha is often considered a measure of item homogeneity; i.e., large alpha values indicate that the
items are tapping a common domain.
The formula for Cronbach’s alpha is as follows:
𝑘 ∑©}•3(1 − 𝑝} )𝑝}
§=
∝ ¨1 − ¬
𝑘−1 𝜎ª«4
k is the number of items on the exam; pi, referred to as the item difficulty, is the proportion of
examinees who answered item i correctly; and 𝜎ª«4 is the sample variance for the total score.
To illustrate, suppose that a five - item multiple − choice test was administered with the following
percentages of correct response:
𝑃3 = 0.4 𝑃4 = 0.5 𝑃T = 0.6 𝑃W = 0.75, 𝑃= = 0.85, and σ §4® =1.84.
Cronbach’s alpha ranges from 0 to 1.00, with values close to 1.00 indicating high consistency.
Professionally developed high-stakes standardized tests should have internal consistency
coefficients of at least .90. Lower-stakes standardized tests should have internal consistencies of
at least .80 or .85. For a classroom exam, it is desirable to have a reliability coefficient of .70 or
higher.
High reliability coefficients are required for standardized tests because they are administered only
once and the score on that one test is used to draw conclusions about each student’s level on the
trait of interest. It is acceptable for classroom exams to have lower reliabilities because a student’s
score on any one exam does not constitute that student’s entire grade in the course. Usually grades
are based on several measures, including multiple tests, homework, papers and projects, labs,
presentations, and/or participation.
46
Suggestions for Improving Reliability
There are primarily two factors at an instructor’s disposal for improving reliability: increasing
test length and improving item quality.
1. Test Length. In general, longer tests produce higher reliabilities. This may be seen in the old
carpenter’s adage, “measure twice, cut once’’. Intuitively, this also makes a great deal of sense.
Most instructors would feel uncomfortable basing midterm grades on students’ responses to a
single multiple-choice item, but are perfectly comfortable basing mid- term grades on a test of 50
multiple-choice items. This is because, for any given item, measurement error represents a large
percentage of students’ scores. The percentage of measurement error decreases as test length
increases. It is evident that even very low achieving students can answer a single item correctly,
especially through guessing. However it is much less likely that low achieving students can
correctly answer all items on a 20-item test.
Although reliability does increase with test length, the reward is more evident with short tests
than with long ones. Increasing test length by 5 items may improve the reliability substantially if
the original test was 5 items, but might have only a minimal impact if the original test was 50
items. The Spearman-Brown prophecy formula (shown below) can be used to predict the
anticipated reliability of a longer (or shorter) test given a value of Cronbach’s alpha for an existing
test.
𝑚 ∝³´µ
∝D±² =
1 + (𝑚 − 1) ∝³´µ
D±²
∝ is the new reliability estimate after lengthening (or shortening) the test; aold is the reliability
estimate of the current test; and m equals the new test length divided by the old test length. For
example, if the test is increased from 5 to 10 items, m is 10 / 5 = 2.
Consider the reliability estimate for the five-item test used previously (aˆ = .54). If the test is
doubled to include 10 items, the new reliability estimate would be
2(0.54)
∝D±² = = 0.70
1 + (2 − 1)0.54
a substantial increase. Note, however, that increasing a 50-item test (with the same reliability) by
5 items will result in a new test with a reliability of just 0.56. It is important to note that in order
for the Spearman-Brown formula to be used appropriately, the items being added to lengthen a
test must be of a similar quality as the items that already make-up the test. In addition, before
lengthening a test, it is important to consider practical constraints such as time limit and examinee
fatigue. As a general guideline, it is wise to use as many items as possible while still allowing
most students to finish the exam within a specified time limit.
Item Quality
Item quality has a large impact on reliability in that poor items tend to reduce reliability while
good items tend to increase reliability. How does one know if an item is of low or high quality?
The answer lies primarily in the item’s discrimination. Items that discriminate between students
with different degrees of mastery based on the course content are desirable and will improve
reliability. An item is considered to be discriminating if the “better” students tend to answer the
item correctly while the “poorer” students tend to respond incorrectly.
Item discrimination can be measured with a correlation coefficient known as the point-biserial
correlation (rpbi). rpbi is the correlation between students’ scores on a particular item (1 if the
student gets the item correct and 0 if the student answers incorrectly) and students’ overall total
score on the test. A large, positive rpbi indicates that students with a higher test score tended to
answer the item correctly while students with a lower test score tended to respond incorrectly.
47
Items with small, positive rpbi’s will not improve reliability much and may even reduce reliability
in some cases. Items with negative rpbi’s will reduce reliability. For a classroom exam, it is
preferable that an item’s rpbi be 0.20 or higher for all items. Note that the item analysis provided
by Testing and Evaluation Services reports the rpbi for each item.
Regarding item difficulty, it is best to avoid using too many items that nearly all of the students
answer correctly or incorrectly. Such items do not discriminate well and tend to have very low
rpbi’s. In general, 3-, 4-, and 5- alternative multiple-choice items that are answered correctly by
about 60% of the students tend to produce the best rpbi’s. For 2-alternative items, the target item
difficulty is 75% correct.
Reliability can also be expressed in terms of the standard error of measurement. It is an estimate
of how often you can expect errors of a given size.
Some of the factors that might affect the degree of reliability of test results include:
48
CHAPTER SIX
Suppose you have 200 students in your class. If you give them a quiz and you mark, you will
have a 200 scores before you. What meaning can you give to the performance of the class? Has
the class performed well? Has the performance of the class been poor? To answer these
questions, it will not be wise to start calling out the names of the individual students and their
scores. It will be a tiring and fruitless exercise. The best thing to do is to compute a typical score.
This typical score would either be the mean, median or the mode.
Purpose Two
They help to know the level of performance by comparing with a given standard performance.
Very often teachers are asked about the general performance of their students. The answers are
often like “The performance this year is very poor”, “This year the students did not do well at
all”, “Oh, I tell you, my students did extremely well this year. These responses are based on
subjective comparisons. Phrases such as, very poor, not do well, and extremely well, do not have
any scientific basis. One teacher perception of “poorness” may be different from another.
To solve the problem of subjectivity and ambiguity, measures of central tendency are obtained
for a group and these measures are compared with a known standard. Therefore, instead of saying
the performance is poor, a teacher can say, the performance is above average, or average, or
below average, where average would be the known standard or criterion.
In a school where the grading system is A, B, C, D and E, the average performance could be C,
and the midpoint of the C range can be taken to be the standard or criterion. For example, if the
C range is from 60 – 70, then a possible standard would be 65 (i.e 60 + 70 /2). In some situations,
there is a pass or fail category, based on a pass mark. Suppose the pass mark is 55, those who
score 55 and above have passed and those below 55, have failed. The standard or criterion is
therefore 55.
49
For individual cases, a measure of central tendency or location can be taken as the standard or
criterion for comparison. Instead of an individual responding to a question about his/her
performance as poor, very good, excellent, it is better to say performance is above average, far
above average, below average or far below average. In this instance, there is no subjectivity.
Performances are being compared to actual values that are taken as an average.
Consider the following set of scores for 40 pupils in a Social Studies class.
68 42 58 45 60 72 80 50 70 90
75 80 45 60 72 85 60 75 58 62
48 65 60 65 55 48 74 68 66 59
36 90 54 58 62 68 44 90 65 78
The mean for these scores is 64.0 and the median is 63.50. If it is assumed that the average or
standard or criterion performance is 60, then one can say that the performance of this class is
above average since 64 and 63.5 are above the standard of 60.
Purpose Three
They give the direction of student performance. One can compute the values of the mean,
median and the mode and make comparisons.
2. Where Mean < Median, or Mean < Mode or Median is < Mode, the distribution skewed
to the left (negative skewness) showing that performance tends to be high.
For example, in a class test, the following values may be obtained for the measures central
tendency. Mean = 55 Median = 60 Mode = 75
You can observe that the Mean is less than the Median and is also less than the Mode. Also the
Median is less than the Mode. This information implies that the performance tended to be high
in this class test. The frequency polygon below illustrates the point.
50
Mean Median Mode
However, if the mean, median and mode have the same value, then the distribution of the values
is normal. This is illustrated below. Mean = Median = Mode
Exercise
1. In a class quiz, a mean of 48 was obtained with a median f 62. How would the
performance of the class be described?
(a) Average
(b) Below average
(c) High
(d) Low
2. Measures of location can be used to determine the direction of student performance.
(a) True
(b) False
3. In a class test, the mean was 55 and the mode was 68. Performance is therefore high.
(a) True
(b) False
4. The median in an entrance examination was 62 with a mode 54. Performance of the
group was low.
(a) True
(b) False
5. When the mean is equal to the median, performance is skewed to the right
51
(a) True
(b) False
6. When a distribution is negatively skewed, the mode is less than the mean.
(a) True
(b) False
The arithmetic mean is often represented by the symbol, 𝑋F pronounced X bar. The mean
can be computed from raw data (ungrouped data) and grouped data. It can also be easily
obtained from Microsoft Excel, SPSS ad other statistical software.
Given the following scores, 15, 12, 10, 10, 9, 20, 14, 11, 13, 16, to obtain the mean, all the
scores are added and divided by the total number of observations.
The mean for the scores above is:
3=>34>3A>3A>@>4A>3W>33>3T>3¡ 3TA
𝑥̅ = 3A
= 3A = 13
The above expression can be written in the algebraic form as learnt in Session 2 as
∑·¸
·¹· ¶ =
3TA
·¸
=13
3A
∑«
General equation will be written as 𝑥̅ = º
Where N is the total number of observations.
The following steps are used, when given a frequency distribution table.
Step 1. Obtain the class marks or class midpoints.
Step 2 Multiply the class marks or class midpoints.
Step 3 Add the values in the 𝑓𝑥 column
Step 4. Divide the result in Step 3 with total frequency to obtain the mean.
52
Now follow the example in Table
Step 2 Create a new column after the frequency column and give a heading, d.
Step 3 Choose the class that is in the middle of the distribution, but if there is not
exactly middle class, choose one of the two middle classes (preferably the
column with the higher frequency). Under the column, d, code this class with
‘0’ (zero).
Step 4 Give a code of 1 to the class immediately above the class coded 0. The higher
class is given a code of 2, the next higher one, a code of 3. Continue till you
reach the topmost class.
Step 5. Give a code of -1 to the class immediately below the class coded 0. The lower
a code of -2, the next lower one, a code of -3. Continue until you reach the
bottom class.
Step 6. Create another column 𝑓𝑑, where you put in the values of the product of
frequencies and the codes.
Step 7. Add the values in the 𝑓𝑑 column.
Step 8 Divide the result in Step 7 with the total frequency and multiply the result with
the class size, i.
Step 9. Add the result in Step 8 to the midpoint of the class coded 0 and obtain the final
answer. This midpoint is called the assumed mean (AM).
The nine steps above are summarized in the formula for the coding method as
∑»µ
mean (X)= AM + Ÿ ∑» I,
where AM, is the assumed mean, f, is the frequency, d is the code for each class, ∑ 𝑓 is the total
frequency or N and I, the class size.
53
Now follow the example in Table below
∑»µ T 3=
Applying the formula give us X = AM + Ÿ ∑» i. = 33+ Ÿ=A 5 = 33+ =A = 33.3
You will notice that both methods give the same result. The coding method is more appropriate
where the frequencies are large in value. It is also easier to use when the midpoints have
fractions such that multiplying them with the frequencies produces large values.
TRY
Use both methods to obtain the mean for the frequency distribution below.
Classes Frequency
61-70 15
51-60 20
41-50 25
31-40 17
21-30 12
11-20 11
This property also makes it possible to calculate the mean for a combined group if only the
means and number of scores (N) are available since 𝑁𝑥̅ = ∑ 𝑥
54
For example, Sir Lovely’s class has a mean of 5 with a class of 20 while Mr IOD’s class has a
mean of 6 with a class size of 30. The mean for the combined class can be obtained by finding
the sum for Sir Lovely’s class and the sum for Mr IOD’s class. The results are added and
divided by the total number of students. The calculation is shown below.
(=×4A)>(¡×TA) 4¿A
Mean for the total group 𝑥̅ = = = 5.6
=A =A
4. If the mean is subtracted from each individual score and the differences are summed.
The result is 0. Given the scores, 4, 2, 3, 6, 5 with a mean of 4, if, if we subtract the
mean from each individual score and we sum up the results we will get 0.
This is illustrated below.
4–4= 0
2 – 4 = -2
3 – 4 = -1
6–4= 2
5–4=1
The distance of the score from the mean is known as the deviation or the spread about the mean.
The values of 0, -2, -1, 2, and 1 are called deviations and the sum of the deviations is 0.
5. If the same value is added to or subtracted from every number in a set of scores, the
mean goes up or goes down by the value of the number. For example, given the scores,
8, 2, 10, 4, the mean, 𝑥̅ = 6. If we add 2 to each score we obtain, 10, 4, 12, 6, which
gives a mean of 𝑥 = 8 which is the original mean plus the value added to each score
i.e. 6 + 2.
6. If each score is multiplied or divided by the same value, the mean increases or
decreases by the same value. For example, given the scores, 8, 2, 10, 4, the mean of 𝑥̅
= 6. If we multiply each score by 3 we obtain, 24, 6, 30, 12, which gives a mean of 𝑥 =
18 which is the original mean times 3 i.e. 6 × 3.
1. It uses every score in the data set. Thus every score contributes to obtaining the mean.
2. It is the best summary score for a set of observations which is normal and there are no
extreme scores.
3. It is used a lot for further statistical analysis. As we shall see later, the two other
measures, median and mode, have limited statistical use.
55
1. It is useful when the actual magnitude of the scores is needed to get an average. For
example, to select a student to represent a whole class in a statistics competition, the
student’s total performance in statistics is used for selection.
2. Several descriptive statistics are based on the mean. These descriptive statistics such as
the standard deviation, variance, correlation coefficients, z-scores and T-scores are very
useful in teaching and learning. Without the mean, they cannot be computed.
3. It is the most appropriate measure of central tendency when the scores are
Symmetrically distributed (i.e. normal). A symmetrical or normal distribution does not
have extreme scores to influence the mean
5. It serves as a standard of performance with which individual scores are compared. For
example, for normally distributed scores, where the mean is 56, an individual score of
80 can be said to be far above average. Also performance can be described as just
above average or far below average or far below average or just below average
considering the individual scores.
Exercise
1. The mean score obtained by 10 students in a statistics quiz was 20 out of a total of 25.
It was found later that a student who obtained 5 actually had 20. How would the
discovery affect the mean?
(a) More information is needed
(b) New mean is greater than old mean.
(c) Old mean is greater than the new mean.
(d) There is no change in the old mean.
56
The table below gives the distribution of the ages of teachers in a school.
THE MEDIAN
57
The advantage with this procedure is that you do not need to rearrange the entire set of
scores. When you locate the score at the required position, you stop
X f cf
46- 50 48 4 50
41-45 43 6 46
36- 40 38 10 40
31- 35 33 12 30
26-30 28 8 18
21-25 23 7 10
16- 20 18 3 3
58
Total 50
=A
The total frequency is 50 therefore = = 25. Now there is no 25 in the cumulative frequency
column so we select the smallest value that is greater than 25.this value is 30, which belong to
the 31 – 53 class. The median class therefore 31 – 35. The lower class boundary is 30.5 and the
size is 5. Subtracting the values in the table in the formula above, we have
ȸ
g3¿ 4=g3¿ ?
Median = 30.5+ Á j 34 Ç 5 = 30.5+ Ÿ 5 = 30.5 + Ÿ34 5 = 30.5+ [0.58]5 = 30.5 = 2.9 = 33.4
34
The median has a number of properties that distinguishes it from the other measures of
central tendency. These properties are listed below.
1. It is often not influenced by extreme scores as the mean does. For example, the median
for the Following numbers, 2, and 3,4,5,6 is 4. If 6 changes to 23 as an extreme scores
the median remains 4.
2. It does not use all the scores in a distribution but uses only one valve.
3. It has limited use for the further statistical work.
4. It can be used when there is incomplete data at the beginning or the end of the
distribution.
5. It is mostly appropriate for the data from interval and ratio scales.
6. Where there is very few observations, the median is not representative of the data.
7. Where the data set is large, it is tedious to arrange the data in an array for ungrouped data
computation of the median.
1. It has limited use in further statistical work. Most statistical distributions are assured normal
so the Median does not come into focus much.
2. Where there are very few scores or an odd patter n of scores, the medium may not be accurate.
For example in a class of 20, where 15 students had 10 and 4 students had 18 and 1 student had
`20, The distribution of scores looks like this: 10, 10, 10,10,10, 10 10,10, 10,10,10,10,10,10, 18,
18,18, 18, 20.
What would be the middle score? In a situation like, this an estimate of the median may not be
accurate.
3. It uses very little of the information available in the set of scores. It depends on only one score
and ignores information at the ends the distribution. It does not use all the scores in the
distribution.
4. It cannot be used where the variable are from the nominal scale of measurement.
5. It is not sensitive to changes in the distribution, except where the changes occur in the middle
of the distribution.
59
Uses of the Median
As a measure of central tendency, the classroom teacher and other educational practitioners will
find the median as a useful measure of central tendency or location when there is reason to believe
that the distribution is skewed. For skewed distributions, the best measure of central tendency,
which provides a summary score or the typical score, is the median.
2. It is used as the most appropriate measure of location when there are extreme scores to affect
the mean. For example in an establishment of senior and junior staff, the best measure of the
‘average’ or ‘typical’ salary is the median because the senior staff salaries will inflate the mean.
4. It provides a standard of performance for comparison with individual scores when the scores
distribution is skewed. For example, if the median score is 60 and an individual student obtains
55, performance can be said to be below average/median. Also performance can be described as
just above average or far below average or just below average.
5. It can be compared with the mean to determine the direction of student performance.
Where Median < Means, the distribution is skewed to the right (positive skewness) showing that
Performance tends to below and where Median > Mean, the distribution is skewed to the left
The median score for group of 19 students was 58. A 20th student who had a score of 45
joined the group. What is the new median score?
A. 10.5
B. 45.0
C. 58.0
D. It cannot be determined
4. The median score for 15 students in a test was 67. Fourteen of the students had a median
score of 66. What was the score for the 15th student?
A. 66
B. 67
C 68
D. More information is required.
60
5. One limitation of the median as a measure of location is that it
A. can be used when data is incomplete.
B. depends largely on extreme scores.
C. is inappropriate for skewed distributions.
D. uses few values in a distribution.
THE MODE
In Set 1,18 occurred most frequently. It occurred 3 times. Therefore there is only one mode.
In Set3, 42, 50, 62 and 68 occurred the same number of times, i.e. 2 times. There are therefore
4 modes.
61
errors, a tally method is recommended. Here you list the numbers and as each appears you
represent it with a slash. At the end, find the value that has the most number of slashes.
From the distribution of raw data, the mode is 45. It appeared 5 times, which is more than the
others.
If you have not understood it, go over it again and find the mode for the following distribution.
62
MEASURES OF CENTRAL TENDENCY
Classes Frequency
61 -70 15
51-60 20
41-50 25
31-40 17
21-30 12
11-20 11
5. It can be compared with the medium to determine the direction of student’s performance.
63
Where Mode < Median, the distribution is skewed to the right (positive skewed showing that
performance tends to be low and where Mode > Median, the distribution skewed to the left
(negative skewness) showing that performance tends to be high.
Exercise
1. One strength of the mode as a measure of location is that it is
2. One weakness of the mode as a measure of central tendency for a distribution is that, it
A. is appropriate for nominal scale data.
B. is used if there is incomplete data.
C. provides more than one modal score.
D. uses every score in the distribution.
3. The mode for a group of 19 students was 58. A 20th student had a score of 57.
What is the new mode?
A. 20
B. 57
C. 58
D. It cannot be determined.
4. The mode for a group of 30 students in a test was 55. For twenty-nine of the students, the
mode was 54. What was the score for the 20th students?
A. 1
B. 54
C. 55
D. More information is required.
4. Compute the modal, mean and median age in the following distribution and deduce
whether the distribution is skewed positively or negatively or normal. Justify your
answer.
QUARTILES
Nature of Quartiles
Quartiles are individual scores of location that divide a distribution into 4 equal parts such that
each part contains 25% of the data. Practically there are 3 quartiles the first (lower) quartile, the
second (middle) quartile and the third (upper) quartile. The second (middle) quartile is the
median which you studied in Session 4. The symbols used to represent the quartiles are:
64
Q1 – First (lower), quartile; Q2 – Second (middle) quartile;
Q1 Q2 Q3
Median
Quartiles can be computed from both ungrouped and grouped data. Our focus is on the lower
quartile and the upper quartile since we have studied the middle quartile (median) already.
Example.
Suppose you are given the following scores: 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9.
1. Arrange the scores in ascending order as, 4, 6, 7, 8, 9, 10, 12, 13, 18, 22, 25
D>3 33>3 34
2. Median: The score at the Ÿ 4
=Ÿ 4
= 4
= 6𝑡ℎ position which is 10.
4. Find the median for the second part: 12, 13, 18, 22, and 25. This gives Q3 as 18.
65
Suppose you are given the following scores: 8, 10, 12, 7, 6, 13, 18, 25, 4, 22, 9.
1. Arrange the scores in ascending order as, 4, 6, 7, 8, 9, 10, 12, 13, 18, 22, 25
3 34
2. Find the W (11 + 1)th position. This gives us W = 3rd position. The score at the
3rd position is 7 which is Q1
T T¡
3. Find the W (11 + 1)th position. This gives us W
= 9th position. The score at the 9th
position is 18 which is Q3
For an even set of numbers, the positions may end up with fractions. Let us look at an
example. Suppose you are given a set of observations as: 8 11 26 7 12 9 6 20 14
18 10 22.
There are 12 observations so this is an even number of scores.
To find the quartiles:
1. Arrange the scores in the an ascending order as, 6, 7, 8, 9, 10, 11, 12, 14, 18, 20, 22, 26
3
2. To obtain Q1, find the (12 + 1) th position. This gives us
W
3T 3
W
= 3 Wth position. This means that Q1 lies between the 3rd and 4th positions. Now, at
3
the 3rd position is 8 and the 4th position is 9. Multiply the difference between 8 and 9 the W.
3 3
This gives us W × 1 = W .
3 3
Add the answer to 8 to obtain Q1 as 8 or 8.25
W W
T T@ T
3. To obtain Q3, find the W (12 + 1)th position. This gives us W = 9 Wth position.
This means that Q3 lies between the 9th and 10th positions. Now, at the 9th position is 18
T
and the 10th position is 20. Multiply the difference between 18 and 20 with W.
T 3 3 3
This gives you W
× 2 = 1 4 . Add the answer 1 4 to 18 to obtain Q3 as 194 or 19.5
Now obtain the lower quartile and the upper quartile for the following set of numbers using both
the median method and the formula method.
45, 82, 75, 87, 60, 48, 92, 72, 65, 80, 65, 49, 52, 56, 68, 72, 64, 80, 70, 58
Computing the quartiles involves 4 simple steps. These steps are described below
66
that is greater than the position for Q3. It is the class that will contain the upper quartile. Find the
T T
value W Σ f, or W N where Σf (or N) is the total frequency.
This is the position of the upper quartile. Checking from the cumulative frequency column, find
the value that is equal to the position or the smallest value that is greater than the position.
Steps 3. Identify the lower class boundary of the lower quartile and the upper quartile classes and
the class size.
Step 4. Apply the formula below by substituting the respective values into the formula.
Â
gÓ»
𝑄3 = 𝐿3 + Á »
Ò
Ç𝑖
Ô·
where
This value is 18, which belongs to the 26 – 30 class. The lower quartile (Q1) therefore is 26 – 30.
The lower class boundary is 25.5 and the class size is 5.
67
Substituting the values in the table in the formula above, we have:
T
For Q3 × 50= 37.5. Now there is no 37.5 in the cumulative frequency column so we select the
W
smallest value that is greater than 37.5. This value is 40, which belongs to the 36 – 40 class. The
upper quartile (Q3) class therefore is 36 – 40, upper class boundary is 35.5 and the class size is
5. Substituting the values in the table in the formula above, we have: Q3 =
150
− 30 37.5 − 30 7.5
35.5 + Ö 4 × 5 = 35.5 + Ð Ñ 5 = 35.5 + Ð Ñ 5 = 35.5 + [0.75]5
10 10 10
= 35.5 + 3.7
TRY:
Calculate Q1 and Q3 for the distribution in the table at the next page.
Classes Frequency
61 – 70 15
51 – 60 20
41 – 50 25
31 – 40 17
21 – 30 12
11 – 20 11
Exercise
A. 10.5
B. 15.5
C. 42.0
D. 43.0
68
4. What is the third quartile in the following distribution?
12 18 10 19 22 25 17 20 14
A. 8
B. 13
C. 18
D. 21
69
CHAPTER SEVEN
The measures of variation are also called measures of variability or disperses scatter. The main
measures that are used mainly in educational practice are:
1. The range
2. The Variance
3. The Standard Deviation
4. The Quartile Deviation (also known as the semi-interquartile range).
The variance and the standard deviation are closely related. The variance is the square of the
standard deviation and the standard deviation is the square root of the variation. Thus if the
variance is 144, the standard deviation is 12. If the standard deviation is 9 then the variance is 81.
You will notice that in the first data the scores are close to each other. All the scores are close to
the mean of 49, or they cluster around the mean, which serves as the centre point. In the second
set, the scores are far from each other. For example, 4 is so far from 90 but both sets have the
same mean.
The measures of variation tell us how far the scores are from each other. This information is
important for teaching and learning. If there is a big variation within a class, the teacher needs to
adopt a method to suit the wide dispersion of abilities.
However, if the variation is small, this means that all the students are at about the same level of
performance, which may be low, moderate or high, Again the teacher needs to adopt the
appropriate teaching method to suit the class.
The measures of variation or variability serve two main purposes. These purposes are described
below.
1. Purpose One
Measures of variation are used as single scores to describe differences within data. They are
scores that are used to indicate whether there are variations in the group. Where there is variation,
the group is believed to be heterogeneous and where the scores are around a typical value, the
group is homogeneous.
70
Let us consider the following example.
Set 1: 44, 4, 40, 42, 42, 45, 43, 40, 40, 41, 40, 40, 41, 40, 40, 42, 46
Set 2: 20, 48, 50, 50, 50, 48, 121, 10, 55, 54, 48, 58, 59, 35, 24, 56, 30, 51, 30, 52
Now let us look at the highest score and the lowest score for each of the sets.
You will notice that though both sets of scores have mean scores of 42, the difference between
the highest and lowest scores differ. In set 1, it is 8 units and in set 2 is 49 units.
For a heterogeneous class, the classroom teacher will notice that there are high achievers as well
as low achievers. It is a mixed ability group. As a teacher, you need a method to cater for the
high achievers, moderate achievers as well as the low achiever
On the other hand, where the class is homogeneous, the teacher has to find out the performance
by computing the measures of central tendency. In our example the mean is 42. Assume that the
proficiency level is 30. Since 42 is higher than 30, the class can be described as performing above
the proficiency level.
2/ Purpose Two
Measures of variability are descriptive statistics. They are single numbers that are used to
describe a group. To know the correlation or relationship between groups, you need to obtain a
measure of variability. Here the most appropriate measures are the variance and the standard
deviation. Knowledge of the standard deviation or variance will help you to understand the
formula used in computing the correlation coefficient, which is a measure of the relationship
between variables.
THE RANGE
Nature of the range
The range is defined as the difference between the highest (largest) and the lower (smallest)
values in a set of data. For example for the set of data, 48, 51, 47, 50, largest values is 51 and
the smallest value is 47. The range is therefore 51 – 47 =4 is the simplest of all the measures of
variation.
The range can be computed for both raw (ungrouped) data and group data. Procedures are
described below.
Three simple steps are involved in computing the range from raw data.
71
These set are:
14 22 8 56 46 28 30 17 29 10 60 40 33
The highest value (H) is 60 and the lowest value (L) is 8. The range is H - L 60 - 8 = 52.
82 90 66 78 88 72 60 80
Classes Frequency
46 - 50 4
41 - 45 6
36 - 40 10
31 - 35 12
26 - 30 8
21 - 25 7
16 - 20 3
The bottom class is 16 − 20 and the lower class boundary (L) is 15.5. The class is 46 −
50 and the upper class boundary is 50.5 (H). The range is H – L= 50.5 − 15. 5 = 35
The range has a number of strengths and weaknesses. These are listed below
72
Weaknesses of the Range
1. It does not take into account all the data/scores. It uses only values.
2. It ignores the actual spread of all the scores. It may therefore give a picture of the variation
in the data
3. It does not consider how the scores relate to each other.
4. It does not consider the typical observations in the distribution but consider only on the
extreme values.
5. Different distributions can have the same range which would give conclusions.
6. It is only a crude or rough measure of variation
2. It may be necessary to require knowledge of only the extreme scores or total spread in a set
of observations. In a test, a teacher may be interested in only the highest score and the lowest.
The range will conveniently serve that purpose.
Exercise
Compute the range for the following sets of data.
1. 18, 22,48,45,90,93,65,62,28,75,15,30,35,80,82
2. 44, - 8,14, - 14,24,28, - 30,52,58,40,42,48,50, - 1
3. – 4, - 15, - 18, - 56, - 52, - 40, - 75, - 18, - 36, - 19, - 50, - 55, - 0,
4.
Classes Frequency
61 - 70 15
51- 60 20
41 – 50 25
31 - 40 17
21 - 30 12
11 - 20 11
THE VARIANCE
The variance is the mean square deviation. It is defined as the mean of the square of the
deviations of the scores from the mean of the distribution. The symbols used are or S2 for
population variance and s2 for sample variance.
The variance can be computed from both the raw (ungrouped) data and grouped data
73
Computing from Raw Data (Ungrouped Data)
The variance can be computed from raw data by using two formula. These are conventional
formula and the computing formula. The procedures are described below.
∑⌈SgS̅ ⌉j
i. The conventional formula for variance is given by 𝑠 4 = À
Example: Given the set of scores, find the variance
15, 12, 10, 10, 9, 20, 14, 11, 13, 16
Answer 10.2
∑ Sj ∑S 4
ii. The computational formula for variance is given by 𝑠 4 = −ÌÀ Î
À
Use this method to solve for the variance of
15, 12, 10, 10, 9, 20, 14, 11, 13, 16
Note:
The coding method can also be used to calculate for variance
Properties of Variance
i. The variance of a Constance is zero.
ii. It is not resistant. It is affected by extreme scores or outliers.
iii. The variance is independent of change of origin
iv The variance is not independent of change of scale
Weaknesses of variance
1. It is influenced by extreme scores. It gives more weight to these extreme scores
resulting in a wrong interpretation of results
2. It is sensitive to change in the value of any score in the distribution.
3. It cannot be computed if missing data is reported since the variance depends on
every individual score.
4. It is not appropriate for judging the variation within a set of observations
74
The standard deviation
It is the most used measure of variation. It is the square root of the mean square deviation. It is
denoted by the symbols 𝜎 or 𝑆 .
The standard deviation can be computed from both the raw (ungrouped) data and grouped data.
Note: The coding method can also be used to calculate for standard deviation.
75
2. It is used a lot for statistical analysis
3. It is appropriate for scores that are normally distributed
Coefficient of variation
The coefficient of variation for both grouped and ungrouped data is given by
Ü
𝐶𝑉 = S̅ × 100 ,
76
Exercise
1. The variance for set of scores is 25. What is the standard deviation?
A . 2
B. 5
C. 25
D. 625
4. The standard deviation for a set of scores is 9. The variance for the same set of scores is 3.
A. True
B. False
77
TRY QUESTIONS[PASCO]
GENERAL INSTRUCTION: answer all the questions in section A, B, C and one (1)
question from section D.
INSTRUCTION: This section consists of 10 items. Circle the most appropriate option in ink
once only. One mark for each question.
1. Why is it necessary for the teacher to specify what he or she wants to assess?
A. To ensure easiness in the development procedure
B. To ensure the reliability of the procedure used
C. To ensure the selection of appropriate procedures
D. None of the above
2. Which of the following is the most specific?
A. Instructional aims
B. Instructional objectives
C. Educational goals
D. Educational outcomes
3. One of the general principles of assessment is that
A. Good assessments are provided by multiple indicators of performance
B. Good assessments focus on students’ critical thinking objectives
C. Assessment techniques require knowledge about student learning
D. Assessment techniques must serve the needs of the community
4. Taxonomy means the same as
A. Organization
B. Selection
C. Classification
D. Demarcation
5. The hierarchical sequence of Bloom’s taxonomy is
A. Knowledge, comprehension, synthesis, application, analysis and evaluation
B. Knowledge, comprehension, analysis, application, synthesis and evaluation
C. Knowledge, application, comprehension, analysis, synthesis and evaluation
D. Knowledge, comprehension, application, analysis, synthesis and evaluation
6. Which of the following statements depict the nature of validity?
A. Assessment results may have high, moderate or low validity for a situation
B. A single validity is most appropriate for an evaluative judgement
C. Validity refers to the appropriateness of the test items to meet learning
objectives
D. Validity refers to whether the assessment measures what it purports to measure
78
7. When constructing her test items for the end-of-term examination in mathematics Ms.
Sarpong checked each item to see if it matched the material that she taught the class.
What type of evidence was Ms. Sarpong looking for?
A. Construct –related evidence
B. Concurrent-related evidence
C. Content-related evidence
D. Predictive-related evidence
8. The objectivity of a test refers to the
A. Format of its items
B. Selection of items for the test
C. Use made of the results
D. Scoring of the students responses
9. Which of the following item format is the best to use to assess analysis type of learning
behavior
A. Short answer type item
B. Essay items
C. Multiple choice items
D. True – false items
10. Instructional outcomes that aim at inculcating in students’ movement abilities is
concerned with
A. Affective domain
B. Quellmalz domain
C. Cognitive domain
D. Psychomotor domain
11. The following scores were available for 9 students in a elective mathematics class.
18 20 15 12 12 10 8 17 13
th
The score for the 10 student was missing but it was known to be the second highest
score. What would be the median for the distribution?
A. 14
B. 15
C. 16
D. 17
12. The median score for group of 19 students was 58. A 20th student who had a score of
45 joined the group. What is the new median score?
A. 10.5
B. 45.0
C. 58.0
D. More information is required
13. The variance for set of scores is 25 and the mean score is 12.81. What is the coefficient
of variation?
A. 256%
B. 93%
C. 37.81%
D. 39%
79
14. The mean score obtained by 10 students in a statistics quiz was 20 out of a total of 25,.
It was found later that a student who obtained 5 actually had 20. How would the
discovery affect the mean score?
(e) More information is needed
(f) New mean score is greater than old mean.
(g) Old mean score is greater than the new mean.
(h) There is not change in the old mean score.
INSTRUCTION: This section has ten (10) TRUE OR FALSE items. Write the appropriate
response in ink once only. One mark for each question.
18. One advantage of essay-type test is that is premium is placed on writing speed.
A. True
B. False
19. Educational goals are geared towards meaningful functioning of the society.
A. True
B. False
20. An assessment can be done through interviewing.
A. True
B. False
21. Statements that pose more than one central theme should be avoided when contracting
short answers test.
A. True
B. False
22. Test scores are perfect measures of student’s performance.
A. True
B. False
23. Assessment is necessary for making certification decision.
A. True
B. False
80
24. Test scores that have high validity are necessarily reliable.
A. True
B. False
25. Communication using gestures is an example of a sub-domain in the affective domain.
A. True
B. False
……………………………………………………………………………………………………
……………………………………………………………………………………………………
A test or examination needs to be planned before being administered and scored. Determine the
four main principal stages involved in classroom testing. (4 marks)
……………………………………………………………………………………………………
……………………………………………………………………………………………………
……………………………………………………………………………………………………
……………………………………………………………………………………………………
Explain the following terms (1 marks for each)
a. Obtained score
……………………………………………………………………………………………………
……………………………………………………………………………………………………
b. True score
..........................................................................................................................................................
..........................................................................................................................................................
c. Error score
……………………………………………………………………………………………………
……………………………………………………………………………………………………
Differentiate between essay test and objective test (2 marks)
……………………………………………………………………………………………………
……………………………………………………………………………………………………
……………………………………………………………………………………………………
……………………………………………………………………………………………………
81
SECTION D (ESSAY)
INSTRUCTION: Answer any one (1) Question. Each question carries equal marks of 20
Question 1
(a) Outline the categories (levels) of Benjamin Bloom (1958) cognitive domain. (6
marks)
(b) With a practical example identify and demonstrate how to set an essay question on
each taxonomy of learning outcomes of educational activities on the topic,” graph
of relations and functions. (9 marks)
Compute the coefficient of skewness of the test scores and deduce whether the general
performance is good or weak. (5 marks)
Question 2
(a) Explain the view that “assessment is a means to an end and not an end in itself’”.(5
marks)
(b) The following data represent the recorded scores of two quizzes conducted and marked
over 10
Quiz 1(x) 3 3 4 5 6 7 7 8 9 6
Quiz 2(y) 4 6 5 4 6 8 7 7 9 9
Calculate the Pearson-product-moment correlation coefficient for the two quizzes and
interpret your results in relation to concurrent validity between the two assessments. (5 marks)
(c) Suppose that a five-item multiple-choice exam was administered with the following
percentages of correct response: p1 = .4, p2= .5, p3 = .6, p4 = .75, p5 = .85, and
𝜎ª«4 =1.84 . Compute the internal consistency of the test using Cronbach’s alpha
estimator
© ∑Þ
ݹ· E(3gEÝ )
§=
∝ ¯1 − j °
©g3 ß
§y
82
Question 3
(a) (i) Explain the term variability in test scores (8 marks)
(ii) The table below shows the end of term examination scores of Otwebeweate SHS
Science 1 class
in elective mathematics marked over 70.
Calculate the coefficient of variation in the test distribution for the class and interpret
your results.
(c) Compare and contrast the norm and criterion-referenced interpretation of mathematics
test scores.
83