Development of General Physics 1 Proficiency Test

Development of General Physics 1 Achievement Test (GP1-AT) in Force, Energy and Motion
for Grade 12 STEM Students
Edwin C. Barba, Jr.*

*Post-graduate student, College of Education, University of the Philippines; Chairperson, STEM-HUMMS Area,
Senior High School Division, Central Colleges of the Philippines, QuezonCity
ABSTRACT
The purpose of this study is to develop an achievement test which will assess grade 12 STEM students’
knowledge and understanding of topics in General Physic 1, particularly force, energy and motion. For
this purpose, a 100-item multiple choice questions were created and pilot tested to 300 grade 12 STEM
students. Item and distractor analysis was carried out to analyze the quality of items and distractors.
Internal consistency reliability was also investigated to check the internal consistency and reliability of
the items and the test. Results showed that the mean difficulty index of .51 which suggests that the test
has an over-all desirable difficulty. It was also shown that 7% of the test is easy, 63% are desirable and
26% are difficult. The mean discrimination of the test is .35 which implies that the test has good
discriminating power. However, 47% of the items have very good discriminating power and 15% have
good discriminating power. Distractor analysis of the test showed 82% of the items have 100%
distractor efficiency and only 7.7% of the total 300 distractors were found to be non-functional.
Reliability analysis shows that the Cronbach’s alpha of the test is .904 which means that the initial form
of the test was highly reliable and has excellent internal consistency.
Keywords: Achievement Test, General Physics 1,Multiple Choice Questions, Item Analysis Difficulty
Index, Discrimination Index, Distractor Efficiency,
INTRODUCTION
The implementation of the Senior High School program under the new K-12 Curriculum has
brought a lot of changes in the educational landscape of the Philippines. A new program entails new
courses, innovative teaching strategies and most especially new ways of assessing and evaluating students’
learning. Today, student assessment and evaluation in the K-12 program is divided into three components:
written works, performance task and quarterly examination. Each component’s weight varies across
different subjects but the methods and principle remains the same.
Though there are many ways to measure students’ learning under the new curriculum, the use of
pen-and-paper tests remain to be one of the most common way teachers’ assess and evaluate students.
Perhaps, multiple choice tests is the most extensively-used format in the assessment of students today.
National standardized test such as the National Achievement Test (NAT) and the National Career
Assessment Exam (NCAE) and licensure examinations in the Philippines remains to be administered in a
multiple choice test format. This shows that multiple choice remains to be relevant in today’s educational
landscape.
The challenge to the teachers now, particularly those teaching new courses in the Senior High
School Program, is to develop a pool of valid and reliable pool of multiple choice questions that can be
used in classroom assessment as well as in large-scale testing. The need for any test to be valid and
reliable is due to the fact that decisions are usually based on the results of the test. Validity is the extent
or degree to which the test can measure the qualities, skills, abilities, traits, or information that it was
designed to measure; while reliability is the extent to which it can consistently make these measurements.
(Nwadinigwe and Naibi, 2013). Multiple choice tests are a commonly preferred format because they have
been found to give valid and reliable results.
This purpose of this study is to develop a reliable and valid achievement test in General Physics 1,
particularly in topics involving force, energy and motion for grade 12 STEM students. General Physics 1
is one of the new courses implemented in the STEM track of the Senior High School Program and the
development of an achievement test will surely help teachers and eventually add to a pool of quality
multiple choice items that may be used not only for classroom assessment but for large-scale testing as
well.
Page 2 of 16
METHODS
Research Design
This study utilizes the descriptive research design. This research design was used to describe the
quality of the items and the test as a whole using different statistical analyses such as item analysis,
distractor analysis as well as reliability analysis using Cronbach’s alpha.
Participants
This study uses the purposive sampling design in which participants are chose based on the criteria
that they must be grade 12 students, enrolled in the STEM track of the Senior High School Program of
their school and have taken General Physics 1 or at least the first half of the said course.
The test was pilot tested to 300 grade 12 STEM students from two private institutions in Quezon
City. The test was administered for a total of two hours and thirty-minutes under the supervision of the
author and a teacher. The students were asked to write their answers on a separate sheet of paper. After
the test administration, the test questionnaire and the answer sheets were collected from the students.
Test Description
The General Physics 1 Achievement Test in Force, Energy Motion (GP1-AT) is a 100-item,
multiple choice test developed to measure the knowledge and understanding of students in the General
Physics 1, particularly in the chapter that deals with motion, force and energy. The researcher decided to
utilize questions with four options (1 key, 3 distractors). The researcher believed that using four options
will minimize the chance of students guessing the correct answer. The questions were developed based
on the content of the Senior High School Curriculum Guide in General Physics 1 issued by the Department
of Education last December, 2013.
Page 3 of 16
Test Development Procedure
The GP1-AT was developed with the intention of measuring the knowledge and understanding of
grade 12 STEM students in physics topics concerning force, energy and motion. The researcher identified
the content of the test based on the topics in the Senior High School Curriculum Guide for General Physics
1. Table 1 below shows a simplified version of the table of specification of the test. (A detailed table of
specification may be found at the Appendix.)
Table 1. GP1-AT Table of specifications

Thinking Skills
Analyzing &
Topics Remembering Understanding Applying Evaluating Total
Concept of Force 4 4 5 3 16
Newton’s Laws of
Motion 5 8 10 5 28
Kinematics of
Motion 6 6 14 3 29
Work, Power and
Energy 6 6 10 5 27
Total 21 24 39 16 100
The initial GP1-AT covered topics namely Concept of Force, Newton’s Laws of Motion,
Kinematics of Motion and Work, Power and Energy. There are 16 items (16%) for Concept of Force, 28
items (28%) for Newton’s Laws of Motion, 29 items (29%) for Kinematics of Motion and 27 items (27%)
for Work, Power and Energy. The questions in GP1-AT are designed to measure students’ thinking
abilities specifically Remembering, Understanding, Applying, Analyzing and Evaluating. There are 21
items (21%) for Remembering, 24 items (24%) for Understanding, 39 items (39%) for Applying and 16
items (16%) for Analyzing and Understanding.
It is worth noting that the initial form GP1-AT was not validated by experts in physics education
due to time constraint. The researcher depended on his own knowledge and experience in writing and
creating the multiple choice items.
Page 4 of 16
Test Statistics
Item analysis and distractor analysis were performed to determine how well the test and the
individual items contributed to the scores of the participants and to further improved the quality of the
individual items and the test as a whole. The internal consistency reliability of the test was also
investigated to see how well the items on a test measure the same construct or idea.
Item analysis is the systematic evaluation of the effectiveness of the individual items on a test.
(Brown, 1996). In this study, two statistical indices – difficulty index, p and discrimination index, DI –
were used.
Difficulty index or p value, is an inverse index – the lower the value, the more difficult the item.
The formula for p value, used in the study is
𝑈𝐺 + 𝐿𝐺
𝑝=
2
where 𝑈𝐺 is the proportion of students who got the correct answer in the upper group and 𝐿𝐺 is the
proportion of students who got the correct answer in the lower group. Table 2 below shows how the
difficulty index or p value can be interpreted. An inspection of the item difficulty levels can reveal
problems with the test and even in the instruction.
Table 2. Interpretation of difficulty index or p value

Difficulty index or p value Interpretation
.86 – 1.00 Very Easy item
.71 – .85 Easy item
.40 – .70 Desirable item
.15 – .39 Difficult item
.01 – .14 Very difficult item
Page 5 of 16
Discrimination index or DI indicates the degree to which an item separates the students who
performed well from those who performed poorly. (Brown, 1996). The formula for the discrimination
index or DI used in this study is
𝐷𝐼 = 𝑈𝐺 − 𝐿𝐺
where 𝑈𝐺 is the proportion of students who got the correct answer in the upper group and 𝐿𝐺 is the
proportion of students who got the correct answer in the lower group. A negative DI value could mean
problems in the item or in the instruction. Table 3 below shows how the discrimination index or DI is
interpreted.
Table 3. Interpretation of discrimination index of DI

Discrimination index or DI Interpretation
.40 and above Very Good item
.30 – .39 Good item
.20 – .29 Needs improvement item
.19 and below Poor item
Distractor analysis on the other hand, was used to determine the degree to which the distractors
are attracting students who do not know the correct answer. (Brown, 1996). In this research, distractor
efficiency, DE was used to look at the performance of the distractors of each item in the GP1-AT. Non-
functional distractors (NDF) are options that are selected by less than 5% of the participants while
functional distractors is the option selected by 5% or more participants. (Mukherjee and Lahiri, 2015)
Distractor efficiency, DE is determined for each item on the basis of number of NFD’s and it ranges from
0 to 100%. If an item has three, or two or one or no NFD’s then DE will be 0, 33.3%, 66.6% and 100%
respectively. NFD’s must be revised, removed or replaced with more plausible distractor.
Page 6 of 16
The internal consistency of the test was measured using the Cronbach’s alpha. Cronbach’s alpha
measures the internal consistency of a group of items by measuring the homogeneity of the group of items.
(BrckaLorenz, Chiang, & Nelson Laird, 2013). Cronbach’s alpha ranges from 0 to 1.00, with values close
to 1.00 indicating high consistency. High-stakes standardized tests should have a Cronbach’s alpha of at
least .90 while low-stakes standardized tests should have an alpha of at least .80 or .85. For classroom
exam, it is desirable to have a Cronbach’s alpha of .70 or higher. (Wells & Wollack, 2003) To measure
the internal consistency of GP1-AT Cronbach’s alpha was calculated using IBM SPSS 20 software
package.
RESULTS
The initial form of GP1-AT was pilot tested to 300 grade 12 STEM students from two private
institutions in Quezon City. The 100-item test was administered for a total of two hours and thirty minutes.
Answer sheets were collected after the test and evaluated by the researcher. Table 4 below shows the
result of the pilot test.
Table 4. Results of the GP1-AT pilot test

Characteristics Value
Mean raw score 48.95
Std. Error of Mean .825
Std. Deviation 14.296
Skewness .505
Std. Error of Skewness .141
Kurtosis -.331
Std. Error of Kurtosis .281
The highest and lowest scores in the pilot test is 91 and 23, respectively. The mean score of the
participants in the pilot test is 48.95 with standard deviation of 14.296. The score distribution of 300
participants is shown to be positively skewed and platykurtic. This shows that most of the participants
scored low on the test.
Page 7 of 16
Item Analysis
The raw scores of the participants are sorted from highest to lowest to identify the upper group and
lower group. The top 80 participants (27%) are designated as the upper group (UG) and the bottom 80
participants (27%) are designated as the lower group (LG). The response of the upper and lower groups
are encoded as 1 for correct response and 0 for wrong response for each item of the test. The proportion
of the students who got the correct answer in each group is computed by dividing the number of correct
response in each group divided by the number of participants in each group. The difficulty index and
discrimination index of each item were computed and interpreted. Table 5 below shows the results of the
item analysis.
Table 5. Results of the item analysis

Correct Response Proportion
p Interpretation DI Interpretation
Item UG LG UG LG
Item1 70 22 0.88 0.28 0.58 Desirable 0.6 Very Good
Item2 79 60 0.99 0.75 0.87 Very Easy 0.24 Needs Improvement
Item3 76 38 0.95 0.48 0.72 Easy 0.47 Very Good
Item5 42 6 0.53 0.08 0.31 Difficult 0.45 Very Good
Item6 21 4 0.26 0.05 0.16 Difficult 0.21 Needs Improvement
Item7 16 12 0.2 0.15 0.18 Difficult 0.05 Poor
Item8 68 41 0.85 0.51 0.68 Desirable 0.34 Good
Item16 16 19 0.2 0.24 0.22 Difficult -0.04 Poor
Item21 60 41 0.75 0.51 0.63 Desirable 0.24 Needs Improvement
Page 8 of 16
Item27 73 50 0.91 0.63 0.77 Easy 0.28 Needs Improvement
Item48 48 9 0.6 0.11 0.36 Difficult 0.49 Very Good
Item55 34 7 0.43 0.09 0.26 Difficult 0.34 Good
Item59 70 46 0.88 0.58 0.73 Easy 0.3 Good
Item62 36 34 0.45 0.43 0.44 Desirable 0.02 Poor
Item63 44 29 0.55 0.36 0.46 Desirable 0.19 Poor
Page 9 of 16
Item72 40 46 0.5 0.58 0.54 Desirable -0.08 Poor
Item73 66 47 0.83 0.59 0.71 Easy 0.24 Needs Improvement
Item75 42 12 0.53 0.15 0.34 Difficult 0.38 Good
Item76 79 69 0.99 0.86 0.93 Very Easy 0.13 Poor
Item77 4 5 0.05 0.06 0.06 Very Difficult -0.01 Poor
Item84 80 67 1 0.84 0.92 Very Easy 0.16 Poor
The mean difficulty index of the test is .51 and the mean discrimination index is .35. In terms of
difficulty index, 3 items (3%) are very easy, 7 items (7%) are easy, 63 (63%) are desirable, 26 items (26
%) are difficult and 1 item (1%) is very difficult. In terms of discrimination index, 47 items (47%) have
very good discrimination power, 15 items (15%) are good, 17 items (17%) needs improvement and 21
items (21%) are poor in discriminating students. Table 6 below shows the distribution of items relative
to their difficulty index and discrimination index.
Page 10 of 16
Table 6. Distribution of items relative to difficulty and discrimination indices
p value
Very Easy Easy Desirable Difficult Very Difficult
DI (.86 – 1.0) (.71 – .85) (.40 – .70) (.15 – .39) (.01 – .14) Total
Very good
4 40 2 1 46
(≥. 𝟒𝟎)
Good
1 13 2 16
(. 𝟑𝟎−. 𝟑𝟗)
NI*
1 2 7 4 14
(. 𝟐𝟎−. 𝟐𝟗)
Poor
2 3 18 24
(≤. 𝟏𝟗)
Total 3 7 63 26 1 100
*NI – Needs improvement
The shaded area in Table 5 shows the number of items that have difficulty index that ranges from
.71 to .39 and discrimination index from .30 and above. There are 62 items (62%) items that fall within
these categories. These items performed well in the pilot test and are usually considered to be good to be
used in classroom tests. There are 38 items (38%) that are outside the shaded area which may be changed,
revised or rejected from the test.
Distractor analysis were performed by identifying the non-functional distractors (NFD) –
distractors that were chosen by less than 5% of the population – and computing the distractor efficiency
of each item. Table 7 shows the distribution of items based on distractor efficiency DE.
Table 7. Distribution of items based on distractor efficiency

Distractor Efficiency (DE)
0 NFD 1 NFD 2 NFD’s 3 NFD’s
(100% DE) (66.6% DE) (33.3% DE) (0 DE)
Number of Items 82 14 3 1
Percentage 82% 14% 3% 1%
Table 6 shows that 82 items (82%) have distractor efficiency of 100%, and only 1 item (1%) has
a distractor efficiency of 0 percent. Out of 300 distractors, 23 distractors (7.7%) are non-functional. Items
with zero non-functional distractors have a mean difficulty index of 0.49 and mean discrimination index
of 0.36 while items with three non-functional distractor have a mean difficulty index of 0.93 and mean
discrimination of 0.13.
Page 11 of 16
The researcher have observed that an increase in the distractor efficiency results to an increase in
difficulty index and a decrease in discrimination index. This relationship is shown in table 8 below.
Table 8. Non-functioning distractors (NFD’s) and mean difficulty and discrimination indices
Distractor Efficiency (DE)
0 NFD 1 NFD 2 NFD’s 3 NFD’s
(100% DE) (66.6% DE) (33.3% DE) (0 DE)
Number of Items 82 14 3 1
Percentage 82% 14% 3% 1%
Mean difficulty index .49 .53 .81 .93
Mean discrimination index .36 .35 .21 .13
Test Reliability
The internal consistency of the initial form of the GP1-AT was investigated using Cronbach’s
alpha using the IBM SPSS 20 software package. The initial Cronbach’s alpha of the test is .904 which is
considered to be excellent. Table 9 shows the results of the initial reliability analysis.
Table 9. Reliability analysis

Cronbach's Alpha Based on
Cronbach's Alpha Standardized Items N of Items
.904 .901 100
DISCUSSION
Single-response multiple choice test is a usual choice for classroom assessment and standardized
testing because of its objectivity and ease of scoring. However, constructing valid and reliable multiple
choice tests may prove to be a difficult task. In order ensure the efficiency of multiple choice questions,
varied statistical analysis may be utilized. The difficulty and discrimination indices are among the tools
used to check whether the multiple choice questions are well constructed and to further analyze the quality
of distractors, distractor efficiency may be used. (Mukherjee and Lahiri, 2015) Reliability analysis, such
as internal consistency reliability maybe used to investigate if the test is measuring the intended construct
or content.
Page 12 of 16
In the present study, the researcher aimed to develop a multiple choice test to evaluate the
achievement of grade 12 STEM students in selected topics in General Physics 1. The initial form of the
test has 100 multiple choice items with four options. The test was pilot tested to 300 grade 12 STEM
students from two private institutions in Quezon City.
The mean difficulty index or p value of the test was found to be .51 which means that most of the
items are within the desirable difficulty. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). Majority of the
items (63%) were found to have a desirable difficulty, 26 items (26%) were difficult and 7 items (7%)
were easy.
The mean discrimination index or DI was computed to be .35 which means that the initial form of
the test have good discriminating power. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). However, 47
items (47%) were found to have a discrimination index or DI of .40 and above which is means that these
items have very good discriminating power. (Wiersma & Jurs, 1990, Scannel & Tracy, 1975). There are
38 items (38%) whose discrimination is poor and needs to be improved. Three items (3%) have a negative
discrimination index which means that there are more students in the lower group who answered the items
correctly than in the upper group. A negative discrimination index could be caused by ambiguous
questions, wrong key or poor general preparation of the students. ((Mukherjee & Lahiri, 2015). Items
with discrimination index or DI of .20 and below must be replaced, revised or rejected.
Distractor analysis showed that out of 300 distractors, 23 (7.7%) are non-functional. Majority of
the items (82%) have 100% distractor efficiency which means that all distractors are functioning while
there is only one item (1%) who has three non-functioning distractors. Non-functioning distractors must
be changed or revised to improve the efficiency of the item in discriminating lower-group students from
upper-group students.
Page 13 of 16
Items with 100% DE have a mean p value of .49 and a mean DI of .36 while items with 0 DE
have a mean p-value of .93 and a mean DI of .13. A relationship between DE, p value and DI can be
concluded from the results. As the distractor efficiency or DE increases, the mean difficulty index or p
value increases and the mean discrimination index or DI decreases. Items with high distractor efficiency
is shown to have better discrimination index. The present study also shows that the as the non-functional
distractors in an item increases the more likely that the students will get the answer correct thus its
discriminating power decreases. This shows that the quality of distractors can affect the difficulty and
discrimination indices of the items.
The internal consistency reliability of the test was investigated using Cronbach’s alpha. The
Cronbach’s alpha of the initial form of the GP1-AT after the pilot test was found to be .904 which suggests
high reliability and excellent internal consistency. Though not much data is available, one rule of thumb
states that values equal to or greater than .7 are acceptable.
RECOMMENDATIONS
The initial form of the GP1-AT in Force, Energy and Motion is shown to have a mean desirable
difficulty index and a mean discrimination index that is considered to be good with high reliability
excellent internal consistency. Though many items performed well after the pilot test, it is highly
recommended by the author to have the test be validated by experts in physics education. Due to time
constraint, the items did not undergo a face and content validity. This will surely improve the quality of
distractors and the items as well.
Distractor analysis of the test shows that there are non-functioning distractors. The present study
has shown that distractor efficiency has an effect on the difficulty and discrimination indices of the items.
These distractors must be changed or revised in order to improve the discriminating power of the items.
Page 14 of 16
The initial form of the GP1-AT has items with four options usually placed after the stem. Though
the arrangement of options or the case of letter options (upper case or lower case) are shown to have no
effect in students’ performance (Bendulo et al., 2017) the number of options should be considered.
Traditionally, four or five options are used to increase the reliability of the test (Thorndike & Thorndike-
Christ, 2010; Hopkins, 1998; Mehrens & Lehman, 1991) however there are increasing study that endorses
the use of three options (Haladyna & Dawning, 1993; Haladyna, et al., 2002; Costin, 1970; Nwadinigwe
& Naibi, 2013). The researcher recommends for future researcher to look into the likelihood of using
three options for the GP1-AT.
REFERENCES
Bendulo, H. O., Tibus, E. D., Bande, R. A., Oyzon, V. Q., Milla, N. E., & Macalinao, M. L. (2017).
Format of options in multiple choice test vis-a-vis test performance. International Journal of
Evaluation and Research in Education, 157-163.
BrckaLorenz, A., Chiang, Y., & Laird, N. (2013). Internal Consistency. Retrieved from FSSE
Psychometric Portfolio: fsse.indiana.edu.
Brown, J. (1996). Testing in the language programs. Upper Saddle River, New Jersey: Prentice Hall
Regents.
Costin, F. (1970). The optimal number of alternatives in multiple-choice achievement tests: Some
empirical evidence of a mathematical proof. Educational and Psychological Measurement, 353-
358.
Haladyna , T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple- choice item-
writing for classroom assessment. Applied Measurement in Education, 309-334.
Haladyna, T. M., & Downing, S. M. (1993). How many options is enough for a multiple-choice item?
Educational and Psychological Measurement, 999-1010.
Hopkins, K. D. (1998). Educational and Psychological Measurement and Evaluation (8th ed.).
Needham Heights, MA: Allyn and Bacon.
Mehrens, W. A., & Lehman, I. J. (1991). Measurement and Evaluation in Education and Psychology
(4th ed.). Forthworth, TX: Harcourt Brace Jovanovich.
Mukherjee, P., & Lahiri, S. (2015). Analysis of multiple choice questions (MCQs):Item and test
statistics from an assessment in a medical college of Kolkata, West Bengal. IOSR Journal of
Dental and Medical Sciences, 47-52.
Page 15 of 16
Nwadinigwe, P. I., & Naibi, L. (2013). The number of options in a multiple-choice test item and the
psychometric properties. Journal of Education and Practice, 189-196.
Scannel, D. P., & Tracy, D. B. (1975). Testing and measurement in the classroom. Boston: Houghton
Mifflin Co.
Thorndike, R. M., & Thorndike-Christ, T. (2010). Measurement and Evaluation in Psychology and
Education (8th ed.). Upper Saddle River, NJ: Pearson/Merril Prentice Hall.
Wells, C. A., & Wollack, J. A. (2003, November). An instructor's guide to understanding test reliability.
University of Wisconsin, Testing and Evaluation Services, Madison, Wisonsin.
Wiersma, W., & Jurs, S. G. (1990). Educational Measurment and Testing. USA: Ally and Bacon.
Page 16 of 16

Development of General Physics 1 Proficiency Test

Uploaded by

Copyright:

Available Formats

Development of General Physics 1 Proficiency Test

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Development of General Physics 1 Proficiency Test

Uploaded by

Copyright:

Available Formats

Development of General Physics 1 Achievement Test (GP1-AT) in Force, Energy and Motion

for Grade 12 STEM Students

Edwin C. Barba, Jr.*

been found to give valid and reliable results.

distractor analysis as well as reliability analysis using Cronbach’s alpha.

of Education last December, 2013.

specification may be found at the Appendix.)

Table 1. GP1-AT Table of specifications

items (16%) for Analyzing and Understanding.

creating the multiple choice items.

The formula for p value, used in the study is

problems with the test and even in the instruction.

Table 2. Interpretation of difficulty index or p value

index or DI used in this study is

Table 3. Interpretation of discrimination index of DI

result of the pilot test.

Table 4. Results of the GP1-AT pilot test

scored low on the test.

Table 5. Results of the item analysis

to their difficulty index and discrimination index.

revised or rejected from the test.

Distractor analysis were performed by identifying the non-functional distractors (NFD) –

Table 7. Distribution of items based on distractor efficiency

Table 9. Reliability analysis

students from two private institutions in Quezon City.

discrimination indices of the items.

states that values equal to or greater than .7 are acceptable.

distractors and the items as well.

three options for the GP1-AT.

You might also like