Abstract:: Post Exam Item Analysis: Implication For Intervention
Abstract:: Post Exam Item Analysis: Implication For Intervention
Abstract:: Post Exam Item Analysis: Implication For Intervention
4 distracter efficiencies were commonly investigated in item analysis. This research was
5 intended to investigate the difficulty and discrimination indices, distracters efficiency, whole
6 test reliability and construct defects in summative test for freshman common course at Gondar
7 CTE. In this study, 176 exam papers were analyzed in terms of difficulty index, point bi-serial
8 correlation and distracter efficiencies. Internal consistency reliability and construct defects
9 such as meaningless stems, punctuation errors and inconsistencies in option formats were also
10 investigated. Results revealed that the summative test as a whole has moderate difficulty level
11 (0.56 ± 0.20) and good distracter efficiency (85.71% ± 29%). However, the exam was poor in
12 terms of discrimination power (0.16 ± 0.28) and internal consistency reliability (KR-20 = 0.58).
13 Only one item has good discrimination power and one more item excellent in its
14 discrimination. About 41.9% of the items were either too easy or too difficult. Inconsistency in
15 option formats or inappropriate options, punctuation errors and meaningless stems were also
16 observed. Thus, future test development interventions should give due emphasis on item
18
19 Key words: Item analysis; difficulty index; discrimination coefficients; distracter efficiency
20
21 1. Background
22 Education quality in Ethiopia seemed to be compromised by the rapid expansion of higher
23 education institutions in the country (4). According to Arega Yirdaw (2016), problems in the
24 teaching-learning process were amongst the key factors in determining education quality in
1
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
27
28 It is advisable to use appropriate instruments for assessing students at higher institutions (5). The
29 rational for employing effective assessment tool is that assessment of students’ achievement is an
30 integral part of the teaching - learning process (2). Assessments should track each student’s
31 performance in a given course. With this in mind, instructors at colleges and universities must be
32 aware of the quality and reliability of their exams in a given course. Otherwise, the final results
33 may lead to a biased evaluation and certification (5). Instructors usually receive little or no
34 training on quality of assessment tools. Trainings usually focus on large-scale test administration
35 and standardized test score interpretation but not on strategies to construct test or item-writing
36 rules (2). The quality and reliability of assessments can be improved by delivering trainings on
38
39 Item analysis involves collecting, summarizing and using information from students' responses
40 for assessing the quality of test items (21). It allows teachers to identify too difficult or too easy
41 items, items that do not discriminate high and low able students or items that have implausible
42 distracters (2, 3). By analyzing items, teachers/instructors can remove too easy/difficult items,
43 improve distracters’ efficiency and avoid non-discriminating items from the pool of future test
45 understand for students and adjust the way they teach (2).
46
47 According to the reports in Ethiopia, there was a serious problem in quality of education (4, 19).
48 Student’s achievement grading system in Ethiopia is carried out by administering teacher made
2
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
49 classroom tests and national examinations (20). It is believed that assessing students’
50 performance solely on objective items at school and national level in the country might have
51 contributed negatively on education quality (20). Therefore, objective test items need to met
52 psychometric standards in order to measure the out comes as per the course objectives.
53 Researchers suggested that objective examination results can be analyzed to improve the validity
54 and reliability of assessments (17). Therefore, the objective of this research was to analyze the
55 post examination results of a summative exam in basic natural science course at Gondar CTE.
56 Based on the results, areas for intervention in future test development were recommended.
57
58 2. Methods
59 2.1 Research Design
60 The validity and reliability of a summative test in a freshman common course entitled as ‘basic
61 natural science-I’ was assessed using descriptive analytical method. Of the two approaches of
62 item analysis, classical test theory (CTT) and item response theory (IRT), CTT was employed
63 due to its simplicity and lack of software applications for IRT. The psychometric parameters
64 considered in this study were difficulty indices, point bi-serial correlations, internal consistency
66
70
3
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
73 selected using stratified random sampling. A stratified sampling technique was employed to
74 include representative samples from each department. The sample exam papers were collected
75 from science instructors within the department, Natural Science, Gondar CTE. Demographic data
76 of the representative samples were collected from registrar office in the college.
77
80 I’ was used as the research instrument. The first reason why this course was selected for the
81 study was that the summative test was developed by instructors with Biology, Chemistry and
82 Physics educational backgrounds. Therefore, the findings would be applicable to the department,
83 Natural Science. The other reason was that it was a common course given to a large population
84 in the college as a compulsory common course to all new modality streams. Furthermore, the
85 course was a pre-requisite for most other courses within the stream, integrated natural science.
86 Therefore, it would be better if an effective assessment tool was prepared by the department. The
87 summative test used in this study contained 31 objective items. There were 21 multiple choices,
88 7 true/false items and 3 matching questions. All the 31 items were considered for analysis. For
89 item analysis, correct responses were coded as 1 and 0 for wrong responses. The maximum mark
91
4
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
95 - Inappropriate/incomplete stems
97
100 investigated to determine the overall reliability of the test. The two most commonly used
101 measure of reliability were Cronbach alfa (α) and Kuder-Richardson method (KR-20). KR-20 is
102 used to measure the reliability of tests in a dichotomous item (17). Therefore, in this study, KR-
103 20 was used to estimate the test reliability. The objective test items were scored dichotomously
104 as right or wrong (17). Every correct response was coded as 1 and wrong responses as 0. The
105 acceptable value for test reliability in many literatures was α ≥ 0.7. A KR-20 value of 0.7 or
107
110 scored dichotomously. It can be calculated for true-false, multiple choice and matching items. In
111 this study, difficulty index for every item was determined by dividing the number of respondents
112 who answered the item correctly by the total number of students taking the test. Simply, p was
113 computed using Microsoft excel 2007 based on the formula given below and average difficulty
117
5
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
118 The value of p ranges from 0 - 1; the higher the value, the easier the item and vice versa. The
119 recommended range of difficulty level is between 0.3 – 0.7 (1, 6). Items having p-values below
120 0.3 and above 0.7 are considered too difficult and too easy, respectively (1).
122
125 students who are high performing and those who are not (17). It can be calculated either by
126 extreme group method or point bi-serial correlation coefficient (r) or other methods. The extreme
127 group method considers only 54% of the respondents (top 27% and bottom 27%). On the other
128 hand, the point bi-serial correlation coefficients take into account all respondents. Besides, it also
129 indicates the relationship between a particular item on a test with the total test score (12, 17). For
130 this reason, the point bi-serial correlation coefficient was used in this study. The point bi-serial
131 correlation coefficient was computed using SPSS version 20. Its value ranges between -1 and 1;
132 a higher value indicates a powerful discrimination power of the item. The test items in this study
133 were classified based on the standard depicted in the table below.
6
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
135
138 in an exam is very much influenced by the quality of the given distracters. Hence, it is necessary
139 to determine the effectiveness of distracters in a given MCQ. Distracter effectiveness indicates
140 the percentage of students choosing that option as an answer. It was calculated based on the
141 number of non-functional distracters (NFDs) per item. An NFD was defined as an incorrect
142 option in MCQ selected by less than 5% of students. The DE was considered to be 0%, 33.3%,
143 66.7% or 100% if an item had three, two, one or zero NFDs, respectively.
144
147 2007 were used for storing and analyzing data. Descriptive statistics i.e. frequencies,
148 demographic information, mean and standard deviations as well as point bi-serial correlations
149 were determined using SPSS-20. Difficulty indices and percentages were computed using excel.
150 Face validity was described qualitatively. Figures and tables were used to display results.
151
152 4 Results
153 Demographic data: Table 1 below shows the demographic characteristics of students whose
154 exam papers were analyzed. Forty two percent of the samples were females and males constitute
7
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
155 fifty eight percent. More than half (58.5%) of the study participants were in the age group 20-25
156 years; 38.1% were under 20 and only 3.4% were 26 and above. The mean age was 20.26 ± 2.13.
159 Test statistics: Results from Table 2 showed that students’ score ranged from 5 to 27 (out of 31)
160 with mean 17.23 ± 3.85. There was no statistically significant mean difference between males
161 and females (p = 0.311, df =174). The histogram in figure 1 revealed that the total score was
Mean 17.23
Std. Error of Mean 0.29
Median 17.00
Mode 15.00 a
8
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
164
165 Figure 1 Distribution of raw test score in males and females
166
167 Construct defects (Face validity): Results from face validity revealed the following findings.
168 - Punctuation errors (full stop & question mark missing) in question.5, 6, 7, 11 and 15.
169 - Inconsistent option formats (option format changed from question 26 to 31).
170 - Inappropriate stems (meaningless or incomplete) were observed in question 12, 13, 14,
172 - Inappropriate options/alternatives (all of the above, A and B) in question 12, 13, 15, 22,
176
177
9
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
179 In this study, the internal consistency reliability was used to evaluate the performance of the test
180 as a whole. The computed KR-20 value of the test was 0.58 which is less than the recommended
182
183 Difficulty index: Appendix A shows the distribution of difficulty indices (p) for each item. One
184 item (q.19) has the highest p-value (0.82) and q.18 has the lowest (0.15). Eighteen items (58.1%)
185 have moderate difficulty level (p-value between 0.3 – 0.7). Twelve items (38.7%) have excellent
186 difficulty levels (p-values between 0.4 – 0.6). Thirteen items (41.9%) lie outside the moderate
187 difficulty range i.e. three items were too difficult (p < 0.3) and ten items (32.3%) too easy (p >
188 0.7). The mean difficulty index was 56% that is p = 0.56, SD. 0.20. A summary of difficulty
190
193 Discrimination coefficients: Appendix B shows the result of the point bi-serial correlation
194 coefficient for each item. Three items (q.2, q.18 and q.20) have negative discrimination power (r
195 - worst). Only a single item (q.31) has excellent discrimination power (r > 0.4). Seventeen items
196 (54.8%) were categorized as poor (r < 2.0) and nine items (29%) as average (r = 2.0 – 2.9)
197 (Table 4). Question number 7 is an ideal item in terms of difficulty level (p = 0.54, Appendix A),
10
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
198 but good in terms of discrimination (r = 0.39, Appendix B). The mean discrimination power is
199 0.16 (SD. 0.28). In many literatures, the acceptable mean r value is ≥ 0.4.
200
203
204 Table 5 below displays the combination of the two indices i.e. item difficulty and discrimination.
205 According to this table, two items (q.7 and 31) have moderate p-values (p = 0.3 - 0.7) and good
206 discrimination (r ≥ 0.3). However, there was no any single item that could be labeled as excellent
207 in both difficulty and discrimination indices (p = 0.4 – 0.6 and r ≥ 0.4). Easy items (p > 0.7) such
208 as q.3, q.5, q.10 and q.24 have poor discrimination (r < 0.2). Furthermore, items with p < 0.3
209 (difficult) such as q.12, q.14 and q.18 have very low discriminating power (r < 0.2). The
210 difficulty level for q.2 was ideal (p = 0.51) but its discrimination power was worst (r = -0.004).
211 Ideal questions with p-values from 0.4 to 0.6 (q.6, q.16, q.21, q.23 and q.26) have poor r- values
212 (< 0.2). There was no statistically significant correlation between difficulty index and
214
215
11
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
216 Table 5: Combination of item difficulty (p) and discrimination (r) indices
item p r item p r
q1 0.71 0.208 q17 0.64 0.017
217
218 Fig.2 shows the graphical representation of difficulty index and discrimination power. The
219 scatter plot allows identification of appropriate (valid and reliable) questions at the center of the
220 graph. Moreover the representation is useful to notice immediately questions that are too easy or
221 too difficult. According to fig 2, r increases up to a point where p approaches to 0.4, then after it
222 declines.
12
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
223
224 Figure 2: Scatter plot of difficulty and discrimination indices
225
226 Distracter efficiency: The distracter analysis shows that six items (28.57%) (q.12, q.13, q.19
227 q.20, q.24 and q.28) have nine NFDs, with a choice frequency of < 5%. All other items do not
228 have any NFDs (Appendix C). In addition, four items (q.12, q.14, q.18 and q.20) have six
229 distracters selected by more students than the (key) correct answer. There was no item with three
230 NFDs. But three items (q.12, q.13 and q.19) have 2 NFDs and the next three items (q.19, q.20
231 and q.28) have 1 NFDs (Table 6). The overall mean of DE was 85.71% with minimum 33.3%
233
13
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
Number of Items 21
Total Distracters 63
Functional distracters 54 (85.71%)
Non functional distracters (NFDs) 9 (14.29%)
Items with 3 NFDs (DE=0%) 0
Items with 2 NFDs (DE=33.3%) 3
Items with 1 NFD (DE=66.7%) 3
Items with 0 NFD (DE=100%) 15
Items with over distracters 4
Overall mean DE 85.71% ± 24.89%
235
236 Difficult items such as q.12, q.14 and q.18 have DE between 66.7% and 100%. Similar result
237 was recorded for easy questions such as q.18, q.22 and q.24. Some items with poor or good p-
238 values have similar DE values (Table 7). Only a single item (q.31) satisfies all the three
239 parameters of ideal questions (p > 0.3, r > 0.4 and DE = 100%, Table 8).
240 Table 7: Comparison of item difficulty with distracter efficiency for MCQs
Item p DE (%)
q11 0.6 100
q12 0.19* 66.7
q13 0.53 33.3
q14 0.2* 100
q15 0.38 100
q16 0.47 100
q17 0.64 100
q18 0.15* 100
q19 0.82 66.7
q20 0.3 100
q21 0.56 100
q22 0.8 100
q23 0.53 100
q24 0.81* 100
q25 0.7 100
14
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
item p d DE
243
244 Discussion
245 By analyzing summative assessments, it is possible to modify future test development techniques
246 or modify classroom instructions. With this intention the current study was conducted on post
247 exam item analysis based on psychometric standards. Based on the findings areas for
15
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
249
250 The internal reliability calculated in this summative test was 0.58. This value is a beat less than
251 the expected range in most standardized assessments (α ≥ 0.7). According to (8), a Cronbach
252 alpha of 0.71 was obtained in a standardized Italian case study. Reliability analysis could be
253 categorized as: excellent if α > 0.9, very good if between 0.8 - 0.9 and good if between 0.6 - 0.7
254 (1). If the reliability value lies within 0.5 - 0.6, revision is required. It will be questionable if
255 reliability falls below 0.5 (1). Based on the result in this study, the summative test administered
256 requires revision because its KR-20 (0.58) value is less than 0.7. This might also imply that
257 college educators need to validate their assessment tools through item analysis. According to
258 Fraenkel and Wallen in (12), one should attempt to generate KR-20 reliability of 0.70 and above
260
261 According to Table 3, 58.1% (18) of the items in the summative test have average difficulty (p =
262 0.3 - 0.7). An item is considered to be good item if its p-value lies in the moderate range (17). In
263 this study, a little higher than half of the exam items have moderate difficulty. It is important to
264 include more questions with average difficulty even though the mean difficulty level (0.56) is
265 acceptable. A similar finding was reported in many other literatures (1, 6 and 9).
266
267 Questions which are too easy or too difficult for a student contribute little information regarding
268 student’s ability (17). Data in this study showed that 32.3% of the test items were too easy
269 (recommended – 10%-20%) and 9.6% were too difficult (recommended-20%). Though it is
270 advisable to include easy and difficult items in a given test (10), it would be better if the
271 recommended limits were met. Hence in this exam paper, there were more easy items and fewer
16
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
272 difficult items. A difficult item could mean either the topic is difficult for students to grasp (10,
273 11) or not taught well (10) or mis-keyed (12) or poor preparation of students.
274
275 According to (12), the discriminatory power of individual items can be computed either by
276 discrimination index, biserial correlation coefficient, point biserial coefficient or phi coefficient.
277 In this study, the discrimination power of every item was calculated by using point-biserial
278 coefficient. The point-bi-serial coefficient result (Table 4) showed that only one item was
279 considered as ‘Excellent” (r > 0.4) and another one reasonably good (r = 0.393). All other items
280 in this summative test need revision or subjected to improvement (r < 0.3). Similar study was
281 reported in (3) that there was no a single item having discrimination index greater than or equal
282 to 0.30. Contrary to this study, 46.67% of items have good to excellent discrimination power (r ≥
284
285 Large and positive values are required for the point bi-serial correlation as it indicates that
286 students who get an item right tend to obtain high scores on the overall test and vise-versa (8).
287 An item with negative and/or low discriminating power needed to be considered in subsequent
288 test development phases. In this study, three items have negative discrimination. This could be
289 due to the fact that low ability students guessed the item right and high ability students
290 suspicious of any clue less successful to guess (16). Items with negative discrimination decrease
291 the validity of the test and should be removed from the collection of questions (10, 12, 13, 14
293
294 Difficulty and discrimination indices are often reciprocally related. However, this may not
295 always be true. Questions having high p-value discriminate poorly; conversely, questions with
17
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
296 low p-value may discriminate well (17). This variation could be as a result of students who make
297 a guess when selecting the correct responses (12). Data (Table 5) showed that guessing has
298 occurred in this study. According to (1), moderately difficult items should have the maximal
299 discriminative ability. The findings of this study contradict with (1). This may reflect that some
301
302 Distracter analysis was conducted to determine the relative usefulness of distracters in each item.
303 Seventeen (81%) items have no NFDs (DE = 100%), three items (q.12, q.19 and q.28) have 1
304 NFDs (DE = 66.7%) and one item (q.13) has 2 NFDs (DE = 33.3%, Table 6). There is no item
305 with 3 NFDs (DE = 0). On the other hand, seven distracters (11%) (12 - A, 14 - A, B, 18- B, C
306 and 20-A) were selected by more students than the correct answer. This may indicate that the
307 items were confusing (12). An overall DE mean of 92.1% (considered as ideal/acceptable) was
308 obtained in this study. Similar finding was reported by (10) in an internal microbiology
310
311 Non-functional distracters make an item easy and reduce its discrimination (10). Question
312 number 31 (Table 8) has moderate difficulty and excellent discrimination power. Probably this
313 could be due to absence of NFDs. However, this doesn’t work for other test items probably due
315
317 Post exam item analysis is a simple but effective method to assess the validity and reliability of a
318 test. It detects specific technical flaws and provides information for further test improvement. An
319 item with average difficulty (p = 0.3 – 0.7), high discrimination (r ≥ 0.4) and higher DE value
18
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
320 (>70%) is considered as an ideal item. In this study, the summative test as a whole has moderate
321 difficulty (mean = 0.56) and good distracter efficiency (mean = 85.71%). But it poorly
322 discriminates between high and low achieving students (0.16). The test as a whole needs revision
323 as its reliability was not reasonably good (KR-20=0.58). Some flaws in item writing were also
324 observed.
325
326 According to Xu and Liu (2009) in (1), teachers’ knowledge in assessment and evaluation is not
327 static but a dynamic and ongoing activity. Therefore, it is plausible to suggest that teachers or
328 instructors should have some in-service seminars on test developments. Since most of the
329 summative tests constructed within the college are objective types, item analysis is
330 recommended for instructors at some points in their teaching life. It is also suggested that there
331 might be a specific unit responsible for testing and the analysis of the items after exam
332 administration.
333
336 Acknowledgment
337 I would like to thank lecturers at Department of Natural Science, Gondar CTE, for providing
338 exam papers for the study. I would like to extend my appreciation to Mr. Awoke Debebe for his
340
341 References
342 1. Zia-ul-Islam, Usmani A. (2017). Psychometric analysis of anatomy MCQs in modular
19
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
344 2. Anna Siri and Freddano M. (2011). The use of item analysis for the improvement of
345 objective examinations. Procedia - Social and Behavioral Sciences 29 (2011) 188 – 197.
346 3. Deshpande S. and Prajapati R.K. (2018). Item analysis of mid-trimester test paper and its
347 implications. Int. J. Manag. and App. Sci., Volume-4, Issue-2, Feb.-2018.
348 4. Deena Kheyami, Ahmed Jaradat, Tareq Al-Shibani and Fuad A. Ali (2018). Item analysis
349 of multiple choice questions at the department of paediatrics, Arabian Gulf University,
350 Manama, Bahrain. SQU Medi. J., Volume 18, Issue 1, pp. e68–74.
351 5. Towns Marcy H. (2014). Guide to developing high-quality, reliable, and valid multiple-
353 6. Chauhan P., Chauhan G.R., Chauhan B.R., Vaza J.V and Rathod S.P. (2015).
354 Relationship between difficulty index and distracter effectiveness in single best-answer
355 stem type multiple choice questions. Int. J. Anat. Res. 3(4):1607-10.
356 7. Backhoff, E., Larrazolo, N., & Rosas, M. (2000). The level of difficulty and
357 discrimination power of the basic knowledge and skills examination. Revista Electrónica
359 8. Gnaldi P., Matteucci M., Mignani S. and Falocci N. (2015). Methods of item analysis in
361 9. Kennedy Quaigrain & Ato Kwamina Arhin (2017). Using reliability and item analysis to
364 10. Ardra Ravindranathan Menon and Prithi Nair Kannambra (2017). Item analysis to
365 identify quality multiple choice questions. National J. Lab.y Medi. 6(2): MO07-MO10.
20
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
366 11. Shenoy P., Sayeli V. and Rao R.R. (2016). Item-analysis of multiple choice questions: A
367 pilot attempt to analyze formative assessment in pharmacology. Res. J. Phar. Bio. Chem.
369 12. Shafizan Sabri S. (2013). Item analysis of student comprehensive test for research in
370 teaching beginner string ensemble using model based teaching among music students in
372 13. Mitra N. K, Nagaraja H. S, Ponnudurai G, Judson J. P. (2009). The levels of difficulty
373 and discrimination indices in type-A multiple choice questions of pre-clinical semester-1
375 14. Sim S.M. and Rasiah R.I. (2006). Relationship between item difficulty and discrimination
378 15. Mukherjee P. and Lahiri S.K. (2015). Analysis of multiple choice questions: Item and test
379 statistics from an assessment in a medical college of Kolkata, West Bengal. J. Den. Medi.
381 16. Kolte V. (2015). Item analysis of multiple choice questions in physiology examination.
383 17. Tavakol M. and Dennick R. (2011). Post-examination analysis of objective tests. Medical
385 18. Arega Yirdaw (2016). Quality of education in private higher institutions in Ethiopia: The
387 19. Fekede Tuli (2012). Examining quality issues in primary schools in Ethiopia:
388 Implications for the attainment of the education for the all goals. ECPS J. 5/2012.
21
bioRxiv preprint doi: https://doi.org/10.1101/510081. this version posted January 15, 2019. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY 4.0 International license.
389 20. Ministry of Education, Ethiopia. (2008). General education quality improvement package
391 21. Adhi M. I. and Aly S. M. (2018). Student perception and post-exam analysis of one best
392 MCQ and one correct MCQs: A comparative study. J. Pak. Med. Assoc. 68 (4).
22