Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence
Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence
Assuring The Quality of High-Stakes Undergraduate Assessments of Clinical Competence
535–543
political attention has focused on the performance of doctors (Newble, 2004). The testing of relevant knowledge, including
in practice (Bristol Royal Inquiry Report, 2001; Southgate aspects of diagnosis, investigation and management, can be
et al., 2001; Shipman Inquiry, 2005) universities are obliged more feasibly (and cheaply) tested with written formats.
to play their part by ensuring they graduate only those However, not all aspects of clinical competence can be validly
students who have reached the required level of competence. tested in an examination setting and some need to be part of
Dissatisfaction with the present situation in the UK has summative in-course assessments, for example professional
led, in some quarters, to calls for the establishment of behaviours, which are often not formally assessed at under-
a national medical qualifying examination similar to the graduate level.
National Board of Medical Examiners (NBME) United
States Medical Licensing Examination (Patel, 2001;
Reliability
Wakeford, 2001; Bligh, 2003). However, the clinical compe-
tence examinations developed by the NBME are complex The search for standardization in clinical competence
and expensive, rely heavily on simulated patient-based (SP) assessments demands the identification and reduction of
ratings, and have to be undertaken at national testing centres. any measurement error or biases due to variation in test
Their quality assurance system (Boulet et al., 2003) depends items; examiners; patients; or examination procedures that
on the application of psychometric techniques available only might affect the observed scores of individual examinees.
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
within a few UK research-based centres. Generalizability studies have been highly influential in
In this paper, we describe an integrated final-year assess- providing guidance for identifying where the largest sources
ment of clinical competence, and the accompanying quality of measurement error might lie and suggesting ways of
assurance mechanisms, which have been developed over refining assessment procedures in order to enhance relia-
the last six years at the University of Sheffield. These ensure bility. For example, the common finding of lack of correlation
that the examination is valid, reliable and feasible to conduct of scores across cases in clinical competence assessments
in a university setting so that student scores are an accurate is often referred to as content specificity (van der Vleuten,
measure of their true ability and reasonably free from 1996). Studies have shown that content specificity is the
measurement error. We expect the principles we demonstrate major contributor to unreliability, much more so than
should be of interest to those in medical schools with the unreliability attributable to inconsistencies in marking. This
responsibility of certifying the competence of their graduates. means that testing competences across a large sample of cases
has to be done before a reliable generalization as to student
For personal use only.
536
Assuring assessment quality of clinical competence
format of assessments throughout the course to make them (Reznick et al., 1996). The initial reliability for this 16-station
both integrated and standardized. This allowed the quality examination was 0.64 (Cronbach’s alpha), which was not
assurance systems to be developed and to be embedded in considered adequate. However, it was unfeasible to
the school at both academic and administrative levels. provide the four or more hours of clinical examination time
In particular, academics and students gained increasing requiring at least 40 stations that generalizability studies
confidence that the new examination procedures would predict would be required to give a reliable examination
function as intended and that the resulting decision-making for high-stakes summative assessment of greater than 0.8
processes were acceptable to the examination committee, (Newble & Swanson, 1988). As an alternative strategy, in
the external examiners, the university, the students, and the 2003, the OSCE was combined with a written component
Postgraduate Dean responsible for PRHO placements and consisting of modified essay questions (MEQs). These,
training. in effect, were almost identical in format to some of the
non-patient-based OSCE stations assessing the ability to
interpret investigational data such as imaging and ECGs. All
Blueprinting
stations and written questions were still focused on clinical
The learning outcome objectives for clinical competence are tasks determined in advance from the same blueprint. This
defined within a core curriculum. These objectives have been now provided a test sample of nearly 40 items.
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
related to 95 clinical problems which graduates are likely to The examination is taken in two stages, a week apart. The
meet as PRHOs (Newble et al., 2005). They are available written component is undertaken first in an examination hall.
to both staff and students in a searchable database accessed Marking of each MEQ and the written OSCE items within
via Minerva, the schools’ networked learning environment the clinical component of the examination, referred to as
(Roberts et al., 2003). This information forms the content of static stations, are marked anonymously by single examiners
the core curriculum pertaining to clinical competence and using structured answer sheets. The clinical component is
the underpinning medical sciences. The aim of the final-year run in a hospital or clinical skills centre setting. There are
assessment of clinical competence is to examine an appro- four separate OSCE venues each able to conduct four circuits
priate sample of this curricular database. The blueprint for in the one day. Students rotate through the circuits in groups
the final examination has the competences on one axis of 13–15, depending on the cohort size. While general-
(e.g. history-taking, communication skills, practical skills, izability studies have suggested only one examiner need be
For personal use only.
management etc.) and the problem content on the other axis, used for scoring each OSCE station without significantly
with problems allocated as appropriate under body systems, affecting examination reliability (van der Vleuten, 1996),
some therefore being represented in more than one system. enough examiners had been recruited through our training
A multi-disciplinary team of medical teachers, including programme to double up for added fairness and reassurance
experts in assessment and psychometrics, has developed to the students. It adds to collegiality and to faculty
an agreed sampling procedure based on the blueprint development (e.g. less experienced examiners can be
and information from prior examinations. Items are then teamed with more experienced examiners). In the interests
requisitioned or accessed from a bank of items used of examination security, students are corralled so there is no
previously, or developed from workshops as part of the contact between students who have taken the examination
examiner training system. New items are reviewed by and those about to take it (Swanson et al., 1995).
the team. The proposed written and clinical components At clinical stations, a mixture of simulated and
of the examination are then sent for ratification to the real patients are used. Simulated patients are drawn from a
external examiners, usually clinicians who are conversant pool of volunteers and actors, managed by a coordinator.
with the best-practice assessment principles we endorse. They are trained on their clinical scenario by clinical skills
staff and senior clinical examiners prior to the examination.
Real patients with well-established physical signs are involved
Test format
in at least three of the physical examination stations. One
An OSCE was first introduced as the clinical component of physical examination station is a 10-minute station, accorded
the final-year examination in June 1999. The short (five- double marks, allowing time for students to complete a more
minute) station format was adopted, as this is the approach extensive task.
being used in most medical schools in the UK (Newble, The ways in which the refinements to the structure of the
2002) and by many licensing bodies such as the GMC whole examination have been introduced during the period
(Tombleson et al., 2000) and the Medical Council of Canada 2001–04 are given in Table 1.
Table 1. Changes in the number of items in the clinical competence examination in the years 2001 to 2004.
No. of items Total marks No. of items Total marks No. of items Total marks No. of items Total marks
537
C. Roberts et al.
Performed
Performed but not Not performed
competently fully competent or incomptent
Scoring and standard setting and clinical examinations (Norcini, 2003), the borderline
method was adopted (Smee, 2001; Wilkinson et al., 2001). In
Examiners receive details of the station they will mark several
this approach, the borderline score for each question/station
days prior to the examination. For the written component
was calculated as the median score of all students identified
and static OSCE stations, questions and the marking sheet
as borderline by the examiners. At OSCE stations, where
reflect the aim of assessing relevant knowledge and clinical
there were two examiners, their marks were averaged. The
problem-solving with particular emphasis on diagnosis,
overall borderline score for the whole examination comprises
investigations and patient management. For the observed
the sum of the borderline scores for all questions/stations
stations in the clinical component, a briefing on the day
(Table 3).
reinforces the rules about the marking system and the
standard-setting procedure. On the examiner marking
sheet, the checklist items reflect the relevant history, physical
examination or practical skills the student should obtain or Performance of the examination
perform. There are generally 8–12 checklist items for each A summary of the examination results and psychometric data
case (see Table 2). Checklist items are weighted to avoid (Streiner & Norman, 2003) is provided to the examination
trivialization. Candidates are rated on each item as: having board to support decision-making (Tables 3–5). The overall
not performed or having demonstrated incompetence on the reliability (internal consistency) of the fully developed
item (awarded no marks), having performed the item but not examination in 2003 and 2004 was 0.81 using Cronbach’s
to the required level of competence (awarded half the marks (Table 4). This is above the conventional ‘gold standard’
for the task) or having performed the item competently to of 0.8 and confirms that the item sample is now large enough
the level of a starting PRHO (awarded the full mark for this for high-stakes decision-making purposes. The Standard
item). Additionally, for both written and observed questions/ Error of Measurement (SEM) of the examination is used
stations examiners are asked to provide a global rating, for grading and decision making (Table 5). Various sub-
independently of their checklist scores, as to whether in their analyses provide insights into the performance of different
opinion the student had passed, failed or was borderline for components of the examination (i.e. test items, examiners,
competence on her/his performance of the overall task. This patients or examination procedures), for quality assurance
information is required for the standard-setting procedure. purposes. We have illustrated the appropriate tests and
Whilst various standard-setting methods are used for written interpretations of data with some examples.
538
Assuring assessment quality of clinical competence
Table 3. Median borderline scores, correlations and marks available for each blueprinted question/station in the examination.
Correlation
to total Max. Median
minus marks Borderline
Station/MEQ ID Competence System Problem item available score
539
C. Roberts et al.
Table 5. Summary of provisional grading procedure for 195 students showing bands of 1SEM (2.4%) from the overall
borderline score of 58.28%.
Grade Definition of Band Status Range No Of
Students
172
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
540
Assuring assessment quality of clinical competence
80.00
70.00
65.00
60.00
B F J M
Succesive OSCE student groups at one venue
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
Figure 1. Exploring order effects at one OSCE site for four groups of students.
Examination procedures The method Sheffield adopted utilizes the SEM calculated
from the examination performance indicators discussed
Within such a complex OSCE examination conducted at
previously using the following formula:
multiple sites with multiple rotations, it is valuable to
p
investigate possible biases caused by order or location effects. SEM ¼ S:D: ð1 rÞ
In most years, it has been possible to show that no such biases
(where r is the reliability of the test and SD is the standard
existed. However, in 2004 a significant difference in the
mean scores was detected between two consecutive rotations
For personal use only.
541
C. Roberts et al.
quality that is able to discriminate reliably amongst such psychometric expertise to ensure that students’ scores reflect
excellent students. their true ability and that subsequent competence decisions
A summary of the main statistical indicators over the are reasonably free from error.
period 2001–04 is shown in Figure 1. It indicates the The examination we have developed progressively over
improvement in reliability since the examination was several years has followed published criteria for international
lengthened in 2003 and generally how stable the mean best practice, based on evidence derived largely from
scores, borderline scores and SEMs are with the differing generalizability studies, and from many years’ practical
cohorts of students. experience in the methods used.
A number of quality-assurance procedures have been
described to demonstrate how the assessment is subject to
Repeat assessments
continual monitoring for potential sources of error, resulting
University regulations demand that all students deemed not in the introduction of several refinements. We endorse
to have passed the examination be given a repeat assessment. the process of blueprinting the assessment; providing
Accordingly, all borderline students and students who have well-defined training regimes for examiners and simulated
failed enter the repeat assessment examination, which is patients; providing regular feedback to those involved
conducted approximately six months later. This has been in judging student performances; applying appropriate
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
controversial in that it is usual to provide students with a standard-setting procedures; attending to item development;
much earlier opportunity for redemption. However, students and providing detailed statistical analyses of all scores.
not shown to be clinically competent cannot reasonably be In conclusion, we have demonstrated that a high-quality
expected to improve their performance without a significant assessment of clinical competence can be conducted within
period of additional clinical experience and focused the resources of a university medical school. If the relevant
remediation. expertise is not available in-house we would strongly suggest
Numbers in the repeat assessment group have become such support be sought on a consultancy basis and key staff
gratifyingly low (range 2–8). University regulations and be supported to attain the appropriate skills. This will reduce
fairness to these students dictates that the repeat assessment development time and costs, and ensure unnecessary
examination should be of the same format and quality as the mistakes are not made.
main examination. This makes it an expensive exercise for
such a small group, and provides some additional challenges
For personal use only.
542
Assuring assessment quality of clinical competence
GENERAL MEDICAL COUNCIL (1993) Tomorrow’s Doctors: Recommendations SHIPMAN INQUIRY (2005) Independent inquiry into the issues arising
on Undergraduate Medical Education (London, GMC). from the case of Harold Fredrick Shipman. Available at: http://
GENERAL MEDICAL COUNCIL (2003) Tomorrow’s Doctors: Recommendations www.shipmaninquiry.org.uk/ (accessed February 2005).
on Undergraduate Medical Education (London, GMC). In: NEWBLE, D.I., SMEE, S.M. (2001) Setting standards for objective structured clinical
JOLLY, B. & WAKEFORD, R. (Eds) (1994) The Certification and examination: the borderline group method gains grounds on Angoff,
Recertification of Doctors: Issues in the Assessment of Clinical Competence Medical Education, 35, pp. 1009–1010.
(Cambridge, Cambridge University Press). SOUTHGATE, L., CAMPBELL, L., COX, J., FOULKES, J., JOLLY, B.,
KRAMER, A., MUIJTJENS, A., JANSEN, K., DÜSMAN, H., TAN, L. & VAN DER MCCRORIE, P. & TOMBLESON, P. (2001) The General Medical
VLEUTEN, C.P. (2003) Comparison of a rational and an empirical Council’s performance procedures: the development and implemen-
standard setting procedure for an OSCE, Medical Education, 37, tation of tests of competence with examples from general practice,
pp. 132–139. Medical Education, 35, pp. 20–28.
NEWBLE, D.I. & SWANSON, D.B. (1988) Psychometric characteristics STREINER, D.L. & NORMAN, G.R. (2003) Health Measurement Scales: A
of the objective structured clinical examination, Medical Education, 22, Practical Guide to their Development and Use (Oxford, Oxford University
pp. 325–334. Press).
NEWBLE, D.I. (2001) (Letter to the editor), Medical Education, 35, SWANSON, D.B., NORMAN, G.R. & LINN, R.L. (1995) Performance based
pp. 308–309. assessment: lessons learnt from the health professions, Educational
NEWBLE, D.I. (2002) Assessing Clinical Competence at the Undergraduate Researcher, 24, pp. 5–11.
Level, Medical Education Booklet, No. 25 (Edinburgh, Association for TAMBLYN, R.M., ABRAHAMOWICZ, M., BRAILOVSKY, C. &
Med Teach Downloaded from informahealthcare.com by QUT Queensland University of Tech on 10/31/14
543