Abstract:: Post Exam Item Analysis: Implication For Intervention

bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019.
The copyright holder for this preprint (which was not

certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
1 Post exam item analysis: Implication for intervention

2 Abstract: Post exam item analysis enables teachers to reduce biases on student achievement
3 assessments and improve their way of instruction. Difficulty indices, discrimination power and
4 distracter efficiencies were commonly investigated in item analysis. This research was
5 intended to investigate the difficulty and discrimination indices, distracters efficiency, whole
6 test reliability and construct defects in summative test for freshman common course at Gondar
7 CTE. In this study, 176 exam papers were analyzed in terms of difficulty index, point bi-serial
8 correlation and distracter efficiencies. Internal consistency reliability and construct defects
9 such as meaningless stems, punctuation errors and inconsistencies in option formats were also
10 investigated. Results revealed that the summative test as a whole has moderate difficulty level
11 (0.56 ± 0.20) and good distracter efficiency (85.71% ± 29%). However, the exam was poor in
12 terms of discrimination power (0.16 ± 0.28) and internal consistency reliability (KR-20 = 0.58).
13 Only one item has good discrimination power and one more item excellent in its
14 discrimination. About 41.9% of the items were either too easy or too difficult. Inconsistency in
15 option formats or inappropriate options, punctuation errors and meaningless stems were also
16 observed. Thus, future test development interventions should give due emphasis on item
17 reliability, discrimination coefficient and item construct defects.
18
19 Key words: Item analysis; difficulty index; discrimination coefficients; distracter efficiency
20
21 1. Background
22 Education quality in Ethiopia seemed to be compromised by the rapid expansion of higher
23 education institutions in the country (4). According to Arega Yirdaw (2016), problems in the
24 teaching-learning process were amongst the key factors in determining education quality in
1
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
25 private higher institutions in Ethiopia. Within the teaching-learning processes, effective
26 assessment tools have to be delivered to measure the desired learning outcomes.
27
28 It is advisable to use appropriate instruments for assessing students at higher institutions (5). The
29 rational for employing effective assessment tool is that assessment of students’ achievement is an
30 integral part of the teaching - learning process (2). Assessments should track each student’s
31 performance in a given course. With this in mind, instructors at colleges and universities must be
32 aware of the quality and reliability of their exams in a given course. Otherwise, the final results
33 may lead to a biased evaluation and certification (5). Instructors usually receive little or no
34 training on quality of assessment tools. Trainings usually focus on large-scale test administration
35 and standardized test score interpretation but not on strategies to construct test or item-writing
36 rules (2). The quality and reliability of assessments can be improved by delivering trainings on
37 post exam item analysis and item writing rules (17).
38
39 Item analysis involves collecting, summarizing and using information from students' responses
40 for assessing the quality of test items (21). It allows teachers to identify too difficult or too easy
41 items, items that do not discriminate high and low able students or items that have implausible
42 distracters (2, 3). By analyzing items, teachers/instructors can remove too easy/difficult items,
43 improve distracters’ efficiency and avoid non-discriminating items from the pool of future test
44 banks. It will also help teachers/instructors to examine misconceptions or contents difficult to
45 understand for students and adjust the way they teach (2).
46
47 According to the reports in Ethiopia, there was a serious problem in quality of education (4, 19).
48 Student’s achievement grading system in Ethiopia is carried out by administering teacher made
2
49 classroom tests and national examinations (20). It is believed that assessing students’
50 performance solely on objective items at school and national level in the country might have
51 contributed negatively on education quality (20). Therefore, objective test items need to met
52 psychometric standards in order to measure the out comes as per the course objectives.
53 Researchers suggested that objective examination results can be analyzed to improve the validity
54 and reliability of assessments (17). Therefore, the objective of this research was to analyze the
55 post examination results of a summative exam in basic natural science course at Gondar CTE.
56 Based on the results, areas for intervention in future test development were recommended.
57
58 2. Methods
59 2.1 Research Design
60 The validity and reliability of a summative test in a freshman common course entitled as ‘basic
61 natural science-I’ was assessed using descriptive analytical method. Of the two approaches of
62 item analysis, classical test theory (CTT) and item response theory (IRT), CTT was employed
63 due to its simplicity and lack of software applications for IRT. The psychometric parameters
64 considered in this study were difficulty indices, point bi-serial correlations, internal consistency
65 reliability and construct face validity, and distracter efficiency.
66
67 2.2 Study population

68 All regular first year diploma students at Gondar CTE during 2017/18 academic year were taken
69 as the study population.
70
3
71 2.3 Sample size and sampling technique

72 Assuming homogeneity of the population, a total of 176 (33.5%) students (Total = 525) were
73 selected using stratified random sampling. A stratified sampling technique was employed to
74 include representative samples from each department. The sample exam papers were collected
75 from science instructors within the department, Natural Science, Gondar CTE. Demographic data
76 of the representative samples were collected from registrar office in the college.
77
78 2.4 Instrument and scoring

79 The summative test administered during 2018 academic year in the course ‘basic natural science-
80 I’ was used as the research instrument. The first reason why this course was selected for the
81 study was that the summative test was developed by instructors with Biology, Chemistry and
82 Physics educational backgrounds. Therefore, the findings would be applicable to the department,
83 Natural Science. The other reason was that it was a common course given to a large population
84 in the college as a compulsory common course to all new modality streams. Furthermore, the
85 course was a pre-requisite for most other courses within the stream, integrated natural science.
86 Therefore, it would be better if an effective assessment tool was prepared by the department. The
87 summative test used in this study contained 31 objective items. There were 21 multiple choices,
88 7 true/false items and 3 matching questions. All the 31 items were considered for analysis. For
89 item analysis, correct responses were coded as 1 and 0 for wrong responses. The maximum mark
90 possible to score was 31 and minimum zero, with no negative marking.
91
92 2.5 Construct defect (Face validity)

93 The exam paper was checked for the following construct defects.
94 - Typing and punctuation errors
4
95 - Inappropriate/incomplete stems
96 - Inappropriate options formats/alternatives format for MCQs.
97
98 2.6 Internal consistency reliability

99 The internal consistency reliability of the summative test in basic natural science-I course was
100 investigated to determine the overall reliability of the test. The two most commonly used
101 measure of reliability were Cronbach alfa (α) and Kuder-Richardson method (KR-20). KR-20 is
102 used to measure the reliability of tests in a dichotomous item (17). Therefore, in this study, KR-
103 20 was used to estimate the test reliability. The objective test items were scored dichotomously
104 as right or wrong (17). Every correct response was coded as 1 and wrong responses as 0. The
105 acceptable value for test reliability in many literatures was α ≥ 0.7. A KR-20 value of 0.7 or
106 greater was considered as reliable in this study.
107
108 2.7 Item difficulty index (p)

109 The item difficulty index is an appropriate choice for achievement tests when the items were
110 scored dichotomously. It can be calculated for true-false, multiple choice and matching items. In
111 this study, difficulty index for every item was determined by dividing the number of respondents
112 who answered the item correctly by the total number of students taking the test. Simply, p was
113 computed using Microsoft excel 2007 based on the formula given below and average difficulty
114 level determined.
115 p = ≠ of students answering item correctly

Total ≠ of students taking the test
116 Where p – difficulty index
117
5
118 The value of p ranges from 0 - 1; the higher the value, the easier the item and vice versa. The
119 recommended range of difficulty level is between 0.3 – 0.7 (1, 6). Items having p-values below
120 0.3 and above 0.7 are considered too difficult and too easy, respectively (1).
Item difficulty index (p) Item evaluation
p > 0.7 Too easy
p = 0.3 – 0.7 Acceptable
p < 0.30 Too difficult
121 Source: Instructional Assessment Resources (IAR 2011) in (1)
122
123 2.8 Discrimination coefficient (r)

124 The item discrimination index is a value of how well a question is able to differentiate between
125 students who are high performing and those who are not (17). It can be calculated either by
126 extreme group method or point bi-serial correlation coefficient (r) or other methods. The extreme
127 group method considers only 54% of the respondents (top 27% and bottom 27%). On the other
128 hand, the point bi-serial correlation coefficients take into account all respondents. Besides, it also
129 indicates the relationship between a particular item on a test with the total test score (12, 17). For
130 this reason, the point bi-serial correlation coefficient was used in this study. The point bi-serial
131 correlation coefficient was computed using SPSS version 20. Its value ranges between -1 and 1;
132 a higher value indicates a powerful discrimination power of the item. The test items in this study
133 were classified based on the standard depicted in the table below.
r value Quality Recommendations
≥ 0.4 Excellent Retain
0.3 – 0.39 Good Possibilities for improvement
0.2 – 0.29 Average Need to check/review
6
0.0 – 0.19 Poor Discard or review in depth
< - 0.01 Worst Definitely discard
134 Adopted from (7)
135
136 2.9 Distracter efficiency (DE)

137 Distracters are classified as incorrect answers in a multiple-choice question. Student performance
138 in an exam is very much influenced by the quality of the given distracters. Hence, it is necessary
139 to determine the effectiveness of distracters in a given MCQ. Distracter effectiveness indicates
140 the percentage of students choosing that option as an answer. It was calculated based on the
141 number of non-functional distracters (NFDs) per item. An NFD was defined as an incorrect
142 option in MCQ selected by less than 5% of students. The DE was considered to be 0%, 33.3%,
143 66.7% or 100% if an item had three, two, one or zero NFDs, respectively.
144
145 3 Data analysis

146 The Statistical Package for Social Science software version 20 (SPSS-20) and Microsoft Excel
147 2007 were used for storing and analyzing data. Descriptive statistics i.e. frequencies,
148 demographic information, mean and standard deviations as well as point bi-serial correlations
149 were determined using SPSS-20. Difficulty indices and percentages were computed using excel.
150 Face validity was described qualitatively. Figures and tables were used to display results.
151
152 4 Results
153 Demographic data: Table 1 below shows the demographic characteristics of students whose
154 exam papers were analyzed. Forty two percent of the samples were females and males constitute
7
155 fifty eight percent. More than half (58.5%) of the study participants were in the age group 20-25
156 years; 38.1% were under 20 and only 3.4% were 26 and above. The mean age was 20.26 ± 2.13.
157 Table 1 Age-sex demographic characteristics of respondents
Sex Frequency Percent

Male 102 58.0
Female 74 42.0
Total 176 100.0
Age in years Frequency Percent
Under 20 67 38.1
20 - 25 103 58.5
Above 25 6 3.4
Total 176 100.0
158
159 Test statistics: Results from Table 2 showed that students’ score ranged from 5 to 27 (out of 31)
160 with mean 17.23 ± 3.85. There was no statistically significant mean difference between males
161 and females (p = 0.311, df =174). The histogram in figure 1 revealed that the total score was
162 approximately normally distributed in both sexes (Fig 1).
163 Table 2: Item descriptive statistics
Mean 17.23
Std. Error of Mean 0.29
Median 17.00
Mode 15.00 a
Std. Deviation 3.85

Range 22.00
Minimum 5.00
Maximum 27.00
a. Multiple modes exist. The smallest value is shown
8
164
165 Figure 1 Distribution of raw test score in males and females
166
167 Construct defects (Face validity): Results from face validity revealed the following findings.
168 - Punctuation errors (full stop & question mark missing) in question.5, 6, 7, 11 and 15.
169 - Inconsistent option formats (option format changed from question 26 to 31).
170 - Inappropriate stems (meaningless or incomplete) were observed in question 12, 13, 14,
171 27 and 28.
172 - Inappropriate options/alternatives (all of the above, A and B) in question 12, 13, 15, 22,
173 27 and 28.
174 - No negatively phrased stems (not or except).
175 - Absolute term (most) in question number 4.
176
177
9
178 Internal consistency reliability
179 In this study, the internal consistency reliability was used to evaluate the performance of the test
180 as a whole. The computed KR-20 value of the test was 0.58 which is less than the recommended
181 range in many literatures (≥ 0.7).
182
183 Difficulty index: Appendix A shows the distribution of difficulty indices (p) for each item. One
184 item (q.19) has the highest p-value (0.82) and q.18 has the lowest (0.15). Eighteen items (58.1%)
185 have moderate difficulty level (p-value between 0.3 – 0.7). Twelve items (38.7%) have excellent
186 difficulty levels (p-values between 0.4 – 0.6). Thirteen items (41.9%) lie outside the moderate
187 difficulty range i.e. three items were too difficult (p < 0.3) and ten items (32.3%) too easy (p >
188 0.7). The mean difficulty index was 56% that is p = 0.56, SD. 0.20. A summary of difficulty
189 index was illustrated in Table 3.
190
191 Table 3: Difficulty index summary
p - value Interpretation # of items Action
< 0.3 Difficult 3 (9.6%) Discard
0.3 – 0.7 Moderate 18 (58.1%) Accept
> 0.70 Easy 10 (32.3%) Reject
192 Item difficulty index mean 0.56, SD. 0.20
193 Discrimination coefficients: Appendix B shows the result of the point bi-serial correlation
194 coefficient for each item. Three items (q.2, q.18 and q.20) have negative discrimination power (r
195 - worst). Only a single item (q.31) has excellent discrimination power (r > 0.4). Seventeen items
196 (54.8%) were categorized as poor (r < 2.0) and nine items (29%) as average (r = 2.0 – 2.9)
197 (Table 4). Question number 7 is an ideal item in terms of difficulty level (p = 0.54, Appendix A),
10
198 but good in terms of discrimination (r = 0.39, Appendix B). The mean discrimination power is
199 0.16 (SD. 0.28). In many literatures, the acceptable mean r value is ≥ 0.4.
200
201 Table 4: Distribution of items in terms of level of discrimination
Point bi-serial Correlation (r) No. of Items % Action
Excellent (r ≥ 0.40) 1 3.23% Retain
Good (r = 0.30 – 0.39) 1 3.23% Possibilities for improvement
Average (r = 0.20-0.29) 9 29.03% Usually need and subject to improvement
Poor (r = 0 - 0.19) 17 54.84 Discard or review in depth
Worst (r < 0) 3 9.67% Definitely discard
202 Item discrimination coefficients mean 0.16, SD, 0.28.
203
204 Table 5 below displays the combination of the two indices i.e. item difficulty and discrimination.
205 According to this table, two items (q.7 and 31) have moderate p-values (p = 0.3 - 0.7) and good
206 discrimination (r ≥ 0.3). However, there was no any single item that could be labeled as excellent
207 in both difficulty and discrimination indices (p = 0.4 – 0.6 and r ≥ 0.4). Easy items (p > 0.7) such
208 as q.3, q.5, q.10 and q.24 have poor discrimination (r < 0.2). Furthermore, items with p < 0.3
209 (difficult) such as q.12, q.14 and q.18 have very low discriminating power (r < 0.2). The
210 difficulty level for q.2 was ideal (p = 0.51) but its discrimination power was worst (r = -0.004).
211 Ideal questions with p-values from 0.4 to 0.6 (q.6, q.16, q.21, q.23 and q.26) have poor r- values
212 (< 0.2). There was no statistically significant correlation between difficulty index and
213 discrimination coefficient (Pearson correlation = 0.201, Sig. (2-tailed) p = 0.279).
214
215
11
216 Table 5: Combination of item difficulty (p) and discrimination (r) indices
item p r item p r
q1 0.71 0.208 q17 0.64 0.017
q2 0.51 -0.004** q18 0.15* -0.046**
q3 0.74 0.082 q19 0.82 0.202
q4 0.67 0.211 q20 0.3 -0.02**
q5 0.78 0.16 q21 0.56 0.166
q6 0.44 0.166 q22 0.8 0.207
q7 0.54 0.393 q23 0.53 0.186
q8 0.71 0.12 q24 0.81* 0.143*
q9 0.35 0.013 q25 0.7 0.061
q10 0.78 0.186 q26 0.51 0.065
q11 0.6 0.253 q27 0.53 0.234
q12 0.19* 0.184* q28 0.75 0.159
q13 0.53 0.237 q29 0.74 0.162
q14 0.2* 0.025* q30 0.45 0.253
q15 0.38 0.203 q31 0.34 0.404
q16 0.47 0.173
217
218 Fig.2 shows the graphical representation of difficulty index and discrimination power. The
219 scatter plot allows identification of appropriate (valid and reliable) questions at the center of the
220 graph. Moreover the representation is useful to notice immediately questions that are too easy or
221 too difficult. According to fig 2, r increases up to a point where p approaches to 0.4, then after it
222 declines.
12
223
224 Figure 2: Scatter plot of difficulty and discrimination indices
225
226 Distracter efficiency: The distracter analysis shows that six items (28.57%) (q.12, q.13, q.19
227 q.20, q.24 and q.28) have nine NFDs, with a choice frequency of < 5%. All other items do not
228 have any NFDs (Appendix C). In addition, four items (q.12, q.14, q.18 and q.20) have six
229 distracters selected by more students than the (key) correct answer. There was no item with three
230 NFDs. But three items (q.12, q.13 and q.19) have 2 NFDs and the next three items (q.19, q.20
231 and q.28) have 1 NFDs (Table 6). The overall mean of DE was 85.71% with minimum 33.3%
232 and maximum 100%.
233
13
234 Table 6: Distracter analysis (DE) summary
Number of Items 21
Total Distracters 63
Functional distracters 54 (85.71%)
Non functional distracters (NFDs) 9 (14.29%)
Items with 3 NFDs (DE=0%) 0
Items with 2 NFDs (DE=33.3%) 3
Items with 1 NFD (DE=66.7%) 3
Items with 0 NFD (DE=100%) 15
Items with over distracters 4
Overall mean DE 85.71% ± 24.89%
235
236 Difficult items such as q.12, q.14 and q.18 have DE between 66.7% and 100%. Similar result
237 was recorded for easy questions such as q.18, q.22 and q.24. Some items with poor or good p-
238 values have similar DE values (Table 7). Only a single item (q.31) satisfies all the three
239 parameters of ideal questions (p > 0.3, r > 0.4 and DE = 100%, Table 8).
240 Table 7: Comparison of item difficulty with distracter efficiency for MCQs
Item p DE (%)
q11 0.6 100
q12 0.19* 66.7
q13 0.53 33.3
q14 0.2* 100
q15 0.38 100
q16 0.47 100
q17 0.64 100
q18 0.15* 100
q19 0.82 66.7
q20 0.3 100
q21 0.56 100
q22 0.8 100
q23 0.53 100
q24 0.81* 100
q25 0.7 100
14
q26 0.51 100

q27 0.53 100
q28 0.75 66.7
q29 0.74 100
q30 0.45 100
q31 0.34 100
241
242 Table 8: Comparison of p, r and DE
item p d DE
q11 0.6 0.253 100
q12 0.19* 0.184* 66.7
q13 0.53 0.237 33.3
q14 0.2* 0.025* 100
q15 0.38 0.203 100
q16 0.47 0.173 100
q17 0.64 0.017 100
q18 0.15* -0.046** 100
q19 0.82 0.202 66.7
q20 0.3 -0.02** 100
q21 0.56 0.166 100
q22 0.8 0.207 100
q23 0.53 0.186 100
q24 0.81* 0.143* 100
q25 0.7 0.061 100
q26 0.51 0.065 100
q27 0.53 0.234 100
q28 0.75 0.159 66.7
q29 0.74 0.162 100
q30 0.45 0.253 100
q31 0.34 0.404 100
243
244 Discussion
245 By analyzing summative assessments, it is possible to modify future test development techniques
246 or modify classroom instructions. With this intention the current study was conducted on post
247 exam item analysis based on psychometric standards. Based on the findings areas for
248 interventions were indicated.
15
249
250 The internal reliability calculated in this summative test was 0.58. This value is a beat less than
251 the expected range in most standardized assessments (α ≥ 0.7). According to (8), a Cronbach
252 alpha of 0.71 was obtained in a standardized Italian case study. Reliability analysis could be
253 categorized as: excellent if α > 0.9, very good if between 0.8 - 0.9 and good if between 0.6 - 0.7
254 (1). If the reliability value lies within 0.5 - 0.6, revision is required. It will be questionable if
255 reliability falls below 0.5 (1). Based on the result in this study, the summative test administered
256 requires revision because its KR-20 (0.58) value is less than 0.7. This might also imply that
257 college educators need to validate their assessment tools through item analysis. According to
258 Fraenkel and Wallen in (12), one should attempt to generate KR-20 reliability of 0.70 and above
259 to acquire reliable test instruments.
260
261 According to Table 3, 58.1% (18) of the items in the summative test have average difficulty (p =
262 0.3 - 0.7). An item is considered to be good item if its p-value lies in the moderate range (17). In
263 this study, a little higher than half of the exam items have moderate difficulty. It is important to
264 include more questions with average difficulty even though the mean difficulty level (0.56) is
265 acceptable. A similar finding was reported in many other literatures (1, 6 and 9).
266
267 Questions which are too easy or too difficult for a student contribute little information regarding
268 student’s ability (17). Data in this study showed that 32.3% of the test items were too easy
269 (recommended – 10%-20%) and 9.6% were too difficult (recommended-20%). Though it is
270 advisable to include easy and difficult items in a given test (10), it would be better if the
271 recommended limits were met. Hence in this exam paper, there were more easy items and fewer
16
272 difficult items. A difficult item could mean either the topic is difficult for students to grasp (10,
273 11) or not taught well (10) or mis-keyed (12) or poor preparation of students.
274
275 According to (12), the discriminatory power of individual items can be computed either by
276 discrimination index, biserial correlation coefficient, point biserial coefficient or phi coefficient.
277 In this study, the discrimination power of every item was calculated by using point-biserial
278 coefficient. The point-bi-serial coefficient result (Table 4) showed that only one item was
279 considered as ‘Excellent” (r > 0.4) and another one reasonably good (r = 0.393). All other items
280 in this summative test need revision or subjected to improvement (r < 0.3). Similar study was
281 reported in (3) that there was no a single item having discrimination index greater than or equal
282 to 0.30. Contrary to this study, 46.67% of items have good to excellent discrimination power (r ≥
283 0.3) (15).
284
285 Large and positive values are required for the point bi-serial correlation as it indicates that
286 students who get an item right tend to obtain high scores on the overall test and vise-versa (8).
287 An item with negative and/or low discriminating power needed to be considered in subsequent
288 test development phases. In this study, three items have negative discrimination. This could be
289 due to the fact that low ability students guessed the item right and high ability students
290 suspicious of any clue less successful to guess (16). Items with negative discrimination decrease
291 the validity of the test and should be removed from the collection of questions (10, 12, 13, 14
292 and 15).
293
294 Difficulty and discrimination indices are often reciprocally related. However, this may not
295 always be true. Questions having high p-value discriminate poorly; conversely, questions with
17
296 low p-value may discriminate well (17). This variation could be as a result of students who make
297 a guess when selecting the correct responses (12). Data (Table 5) showed that guessing has
298 occurred in this study. According to (1), moderately difficult items should have the maximal
299 discriminative ability. The findings of this study contradict with (1). This may reflect that some
300 extent of guessing occurred during test administration.
301
302 Distracter analysis was conducted to determine the relative usefulness of distracters in each item.
303 Seventeen (81%) items have no NFDs (DE = 100%), three items (q.12, q.19 and q.28) have 1
304 NFDs (DE = 66.7%) and one item (q.13) has 2 NFDs (DE = 33.3%, Table 6). There is no item
305 with 3 NFDs (DE = 0). On the other hand, seven distracters (11%) (12 - A, 14 - A, B, 18- B, C
306 and 20-A) were selected by more students than the correct answer. This may indicate that the
307 items were confusing (12). An overall DE mean of 92.1% (considered as ideal/acceptable) was
308 obtained in this study. Similar finding was reported by (10) in an internal microbiology
309 examination in India.
310
311 Non-functional distracters make an item easy and reduce its discrimination (10). Question
312 number 31 (Table 8) has moderate difficulty and excellent discrimination power. Probably this
313 could be due to absence of NFDs. However, this doesn’t work for other test items probably due
314 to random guessing or some flaws in item writing (10).
315
316 5.1 Conclusion and Recommendations
317 Post exam item analysis is a simple but effective method to assess the validity and reliability of a
318 test. It detects specific technical flaws and provides information for further test improvement. An
319 item with average difficulty (p = 0.3 – 0.7), high discrimination (r ≥ 0.4) and higher DE value
18
320 (>70%) is considered as an ideal item. In this study, the summative test as a whole has moderate
321 difficulty (mean = 0.56) and good distracter efficiency (mean = 85.71%). But it poorly
322 discriminates between high and low achieving students (0.16). The test as a whole needs revision
323 as its reliability was not reasonably good (KR-20=0.58). Some flaws in item writing were also
324 observed.
325
326 According to Xu and Liu (2009) in (1), teachers’ knowledge in assessment and evaluation is not
327 static but a dynamic and ongoing activity. Therefore, it is plausible to suggest that teachers or
328 instructors should have some in-service seminars on test developments. Since most of the
329 summative tests constructed within the college are objective types, item analysis is
330 recommended for instructors at some points in their teaching life. It is also suggested that there
331 might be a specific unit responsible for testing and the analysis of the items after exam
332 administration.
333
334 Competing interests
335 The author declares that there is no competing interest.
336 Acknowledgment
337 I would like to thank lecturers at Department of Natural Science, Gondar CTE, for providing
338 exam papers for the study. I would like to extend my appreciation to Mr. Awoke Debebe for his
339 assistance in data entry and critical review of the manuscript.
340
341 References
342 1. Zia-ul-Islam, Usmani A. (2017). Psychometric analysis of anatomy MCQs in modular
343 examination. Pak. J. Med. Sci. 2017; 33(5).
19
344 2. Anna Siri and Freddano M. (2011). The use of item analysis for the improvement of
345 objective examinations. Procedia - Social and Behavioral Sciences 29 (2011) 188 – 197.
346 3. Deshpande S. and Prajapati R.K. (2018). Item analysis of mid-trimester test paper and its
347 implications. Int. J. Manag. and App. Sci., Volume-4, Issue-2, Feb.-2018.
348 4. Deena Kheyami, Ahmed Jaradat, Tareq Al-Shibani and Fuad A. Ali (2018). Item analysis
349 of multiple choice questions at the department of paediatrics, Arabian Gulf University,
350 Manama, Bahrain. SQU Medi. J., Volume 18, Issue 1, pp. e68–74.
351 5. Towns Marcy H. (2014). Guide to developing high-quality, reliable, and valid multiple-
352 choice ssessments. J. Chem. Educ. (910) 1426−1431.
353 6. Chauhan P., Chauhan G.R., Chauhan B.R., Vaza J.V and Rathod S.P. (2015).
354 Relationship between difficulty index and distracter effectiveness in single best-answer
355 stem type multiple choice questions. Int. J. Anat. Res. 3(4):1607-10.
356 7. Backhoff, E., Larrazolo, N., & Rosas, M. (2000). The level of difficulty and
357 discrimination power of the basic knowledge and skills examination. Revista Electrónica
358 de Investigación Educativa, 2 (1).
359 8. Gnaldi P., Matteucci M., Mignani S. and Falocci N. (2015). Methods of item analysis in
360 standardized student assessment: an Application to an Italian case study.
361 9. Kennedy Quaigrain & Ato Kwamina Arhin (2017). Using reliability and item analysis to
362 evaluate a teacher-developed test in educational measurement and evaluation. Cogent
363 Education (2017), 4: 1301013.
364 10. Ardra Ravindranathan Menon and Prithi Nair Kannambra (2017). Item analysis to
365 identify quality multiple choice questions. National J. Lab.y Medi. 6(2): MO07-MO10.
20
366 11. Shenoy P., Sayeli V. and Rao R.R. (2016). Item-analysis of multiple choice questions: A
367 pilot attempt to analyze formative assessment in pharmacology. Res. J. Phar. Bio. Chem.
368 Sci. 7(2): 1683.
369 12. Shafizan Sabri S. (2013). Item analysis of student comprehensive test for research in
370 teaching beginner string ensemble using model based teaching among music students in
371 public universities. Int. J. Edu. Res. 1(12).
372 13. Mitra N. K, Nagaraja H. S, Ponnudurai G, Judson J. P. (2009). The levels of difficulty
373 and discrimination indices in type-A multiple choice questions of pre-clinical semester-1
374 multidisciplinary summative tests. IeJSME 2009: 3 (1): 2-7.
375 14. Sim S.M. and Rasiah R.I. (2006). Relationship between item difficulty and discrimination
376 indices in true/false-type multiple choice questions of a para-clinical multidisciplinary
377 paper. Ann. Acad. Med. (35): 67-71.
378 15. Mukherjee P. and Lahiri S.K. (2015). Analysis of multiple choice questions: Item and test
379 statistics from an assessment in a medical college of Kolkata, West Bengal. J. Den. Medi.
380 Sci. 14(12): 47-52.
381 16. Kolte V. (2015). Item analysis of multiple choice questions in physiology examination.
382 Indian J. Basic App. Medi. Res.:4 (4): 320-326.
383 17. Tavakol M. and Dennick R. (2011). Post-examination analysis of objective tests. Medical
384 Teacher; 33: 447–458.
385 18. Arega Yirdaw (2016). Quality of education in private higher institutions in Ethiopia: The
386 role of governance. SAGE open, pp. 1–12.
387 19. Fekede Tuli (2012). Examining quality issues in primary schools in Ethiopia:
388 Implications for the attainment of the education for the all goals. ECPS J. 5/2012.
21
389 20. Ministry of Education, Ethiopia. (2008). General education quality improvement package
390 (GEQIP). November, 2008.
391 21. Adhi M. I. and Aly S. M. (2018). Student perception and post-exam analysis of one best
392 MCQ and one correct MCQs: A comparative study. J. Pak. Med. Assoc. 68 (4).
22

Abstract:: Post Exam Item Analysis: Implication For Intervention

Uploaded by

Copyright:

Available Formats

Abstract:: Post Exam Item Analysis: Implication For Intervention

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abstract:: Post Exam Item Analysis: Implication For Intervention

Uploaded by

Copyright:

Available Formats

bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019.

The copyright holder for this preprint (which was not

1 Post exam item analysis: Implication for intervention

17 reliability, discrimination coefficient and item construct defects.

25 private higher institutions in Ethiopia. Within the teaching-learning processes, effective

26 assessment tools have to be delivered to measure the desired learning outcomes.

37 post exam item analysis and item writing rules (17).

44 banks. It will also help teachers/instructors to examine misconceptions or contents difficult to

65 reliability and construct face validity, and distracter efficiency.

67 2.2 Study population

69 as the study population.

71 2.3 Sample size and sampling technique

78 2.4 Instrument and scoring

90 possible to score was 31 and minimum zero, with no negative marking.

92 2.5 Construct defect (Face validity)

94 - Typing and punctuation errors

96 - Inappropriate options formats/alternatives format for MCQs.

98 2.6 Internal consistency reliability

106 greater was considered as reliable in this study.

108 2.7 Item difficulty index (p)

114 level determined.

115 p = ≠ of students answering item correctly

Item difficulty index (p) Item evaluation

p > 0.7 Too easy

p = 0.3 – 0.7 Acceptable

p < 0.30 Too difficult

121 Source: Instructional Assessment Resources (IAR 2011) in (1)

123 2.8 Discrimination coefficient (r)

r value Quality Recommendations

≥ 0.4 Excellent Retain

0.3 – 0.39 Good Possibilities for improvement

0.2 – 0.29 Average Need to check/review

0.0 – 0.19 Poor Discard or review in depth

< - 0.01 Worst Definitely discard

134 Adopted from (7)

136 2.9 Distracter efficiency (DE)

145 3 Data analysis

157 Table 1 Age-sex demographic characteristics of respondents

Sex Frequency Percent

162 approximately normally distributed in both sexes (Fig 1).

163 Table 2: Item descriptive statistics

Std. Deviation 3.85

171 27 and 28.

173 27 and 28.

174 - No negatively phrased stems (not or except).

175 - Absolute term (most) in question number 4.

178 Internal consistency reliability

181 range in many literatures (≥ 0.7).

189 index was illustrated in Table 3.

191 Table 3: Difficulty index summary

p - value Interpretation # of items Action

< 0.3 Difficult 3 (9.6%) Discard

0.3 – 0.7 Moderate 18 (58.1%) Accept

> 0.70 Easy 10 (32.3%) Reject

192 Item difficulty index mean 0.56, SD. 0.20

201 Table 4: Distribution of items in terms of level of discrimination

Point bi-serial Correlation (r) No. of Items % Action

Excellent (r ≥ 0.40) 1 3.23% Retain

Good (r = 0.30 – 0.39) 1 3.23% Possibilities for improvement