A Concise
Guide to
Clinical Trials
Allan Hackshaw
A John Wiley & Sons, Ltd., Publication
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
C 2009 by Allan Hackshaw
This edition first published 2009,
BMJ Books is an imprint of BMJ Publishing Group Limited, used under licence by Blackwell
Publishing which was acquired by John Wiley & Sons in February 2007. Blackwell’s publishing
programme has been merged with Wiley’s global Scientific, Technical and Medical business to
form Wiley-Blackwell.
Registered office: John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex,
PO19 8SQ, UK
Editorial offices: 9600 Garsington Road, Oxford, OX4 2DQ, UK
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
111 River Street, Hoboken, NJ 07030-5774, USA
For details of our global editorial offices, for customer services and for information about how
to apply for permission to reuse the copyright material in this book please see our website at
www.wiley.com/wiley-blackwell
The right of the author to be identified as the author of this work has been asserted in
accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording
or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without
the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks.
All brand names and product names used in this book are trade names, service marks,
trademarks or registered trademarks of their respective owners. The publisher is not associated
with any product or vendor mentioned in this book. This publication is designed to provide
accurate and authoritative information in regard to the subject matter covered. It is sold on the
understanding that the publisher is not engaged in rendering professional services. If
professional advice or other expert assistance is required, the services of a competent
professional should be sought.
The contents of this work are intended to further general scientific research, understanding,
and discussion only and are not intended and should not be relied upon as recommending or
promoting a specific method, diagnosis, or treatment by physicians for any particular patient.
The publisher and the author make no representations or warranties with respect to the
accuracy or completeness of the contents of this work and specifically disclaim all warranties,
including without limitation any implied warranties of fitness for a particular purpose. In view
of ongoing research, equipment modifications, changes in governmental regulations, and the
constant flow of information relating to the use of medicines, equipment, and devices, the
reader is urged to review and evaluate the information provided in the package insert or
instructions for each medicine, equipment, or device for, among other things, any changes in
the instructions or indication of usage and for added warnings and precautions. Readers
should consult with a specialist where appropriate. The fact that an organization or website is
referred to in this work as a citation and/or a potential source of further information does not
mean that the author or the publisher endorses the information the organization or website
may provide or recommendations it may make. Further, readers should be aware that Internet
websites listed in this work may have changed or disappeared between when this work was
written and when it is read. No warranty may be created or extended by any promotional
statements for this work. Neither the publisher nor the author shall be liable for any damages
arising herefrom.
ISBN: 978-1-4051-6774-1
A catalogue record for this book is available from the British Library.
R
Inc., New Delhi, India
Set in 9.5/12pt Palatino by Aptara
Printed and bound in Singapore
1 2009
Contents
Preface, v
Foreword, vii
1 Fundamental concepts, 1
2 Types of outcome measures and understanding them, 17
3 Design and analysis of phase I trials, 31
4 Design and analysis of phase II trials, 39
5 Design of phase III trials, 57
6 Randomisation, 77
7 Analysis and interpretation of phase III trials, 91
8 Systematic reviews and meta-analyses, 129
9 Health-related quality of life and health economic evaluation, 141
10 Setting up, conducting and reporting trials, 157
11 Regulations and guidelines, 187
Reading list, 203
Statistical formulae for calculating some 95% confidence intervals, 205
Index, 209
iii
Preface
Clinical trials have revolutionised the way disease is prevented, detected or
treated, and early death avoided. They continue to be an expanding area of
research. They are central to the work of pharmaceutical companies, which
cannot make a claim about a new drug or medical device until there is sufficient evidence on its efficacy. Trials originating from the academic or public
sector are more common because they also evaluate existing therapies in different ways, or interventions that do not involve a commercial product.
Many health professionals are expected to conduct their own trials, or to
participate in trials by recruiting subjects. They should have a sufficient understanding of the scientific and administrative aspects, including an awareness
of the regulations and guidelines associated with clinical trials, which are now
more stringent in many countries, making it more difficult to set up and run
trials.
This book provides a comprehensive overview of the design, analysis and
conduct of trials. It is aimed at health professionals and other researchers, and
can be used as an introduction to clinical trials, as a teaching aid, or as a reference guide. No prior knowledge of trial design or conduct is required because
the important concepts are presented throughout the chapters. References to
each chapter and a reading list are provided for those who wish to learn more.
Further details of trial set up and conduct can also be found from countryspecific regulatory agencies.
The contents have come about through over 18 years of teaching epidemiology and medical statistics to undergraduates, postgraduates and health professionals, and designing, setting up and analysing clinical studies for a variety of disorders. Sections of this book have been based on successful short
courses. This has all helped greatly in determining what researchers need to
know, and how to present certain ideas. The book should be an easy-to-read
guide to the topic.
I am most grateful to the following people for their helpful comments and
advice on the text: Dhiraj Abhyankar, Roisin Cinneide, Hannah Farrant, Christine Godfrey, Nicole Gower, Michael Hughes, Naseem Kabir, Iftekhar Khan,
Alicja Rudnicka, and in particular Roger A’Hern. Very special thanks go to
Jan Mackie, whose thorough editing was invaluable. And final thanks go to
Harald Bauer.
Allan Hackshaw
Deputy Director of the Cancer Research UK & UCL Cancer Trials Centre
v
Foreword
No one would doubt the importance of clinical trials in the progress and practice of medicine today. They have developed enormously over the last 60
years, and have made significant contributions to our knowledge about the
efficacy of new treatments, particularly in quantifying the magnitude of their
effects. Crucial in this development was the acceptance, albeit with considerable initial opposition, to randomisation – essentially tossing a coin to determine treatment allocation. Over the past 60 years clinical trials have become
highly sophisticated, in their design, conduct, statistical analysis and the processes required before new medicines can be legally sold. They have become
expensive and requiring large teams of experts covering pharmacology, mathematics, computing, health economics and epidemiology to mention only a
few. The systematic combination of the results from many trials to provide
clearer results, in the form of meta-analyses, have themselves developed their
own sophistication and importance.
In all this panoply of activity and complexity it is easy to lose sight of the
elements that form the basis of good science and practice in the conduct of
clinical trials. Allan Hackshaw, in this book, achieves this with great skill. He
informs the general reader of the essential elements of clinical trials; how they
should be designed, how to calculate the number of people needed for such
trials, the different forms of trial design, and importantly the recognition that
a randomised clinical trial is not always the right way to obtain an answer to
a particular medical question.
As well as dealing with the scientific issues, this book is useful in describing the terminology and procedures used in connection with clinical trials,
including explanations of phase I, II, III and IV trials. The book describes the
regulations governing the conduct of clinical trials and those that relate to
the approval and sale of new medicines – an area that has become extremely
complicated, with few people having a grasp of the “whole” picture.
This book educates the general medical and scientific reader on clinical trials without requiring detailed knowledge in any particular area. It provides
an up to date overview of clinical trials with commendable clarity.
Professor Sir Nicholas Wald
Director, Wolfson Institute of Environmental & Preventive Medicine
Barts and The London School of Medicine & Dentistry
vii
CHAPTER 1
Fundamental concepts
This chapter provides a brief background to clinical trials, and why they are
considered to be the ‘gold standard’ in health research. This is followed by
a summary of the main types of trials, and four key design features. Further
details on design and analysis are given in Chapters 3–7.
1.1 What is a clinical trial?
There are two distinct study designs used in health research: observational
and experimental (Box 1.1). Observational studies do not intentionally involve
intervening in the way individuals live their lives, or how they are treated.
However, clinical trials are specifically designed to intervene, and then
evaluate some health-related outcome, with one or more of the following
objectives:
r to diagnose or detect disease
r to treat an existing disorder
r to prevent disease or early death
r to change behaviour, habits or other lifestyle factors.
Some trials evaluate new drugs or medical devices that will later require a
licence (or marketing authorisation) for human use from a regulatory authority, if a benefit is shown. This allows the treatment to be marketed and routinely available to the public. Other trials are based on therapies that are
already licensed, but will be used in different ways, such as a different disease group, or in combination with other treatments.
An intervention could be a single treatment or therapy, namely an administered substance that is injected, swallowed, inhaled or absorbed through the
skin; an exposure such as radiotherapy; a surgical technique; or a medical/
dental device. A combination of interventions can be referred to as a regimen,
such as, chemotherapy plus surgery in treating cancer. Other interventions
could be educational or behavioural programmes, or dietary changes. Any
administered drug or micronutrient that is examined in a clinical trial with
the specific purpose of treating, preventing or diagnosing disease is usually
referred to as an Investigational Medicinal Product (IMP) or Investigational
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
1
2
Chapter 1
Box 1.1 Study designs in health research
Observational
Cross-sectional: compare the proportion of people with the disorder among
those who are or are not exposed, at one point in time.
Case-control: take people with and without the disorder now, and compare the
proportions that were or were not exposed in the past.
Cohort: take people without the disorder now, and ascertain whether they happen to be exposed or not. Then follow them up, and compare the proportions
that develop the disorder in the future, among those who were or were not
exposed.
Semi-experimental
Trials with historical controls: give the exposure to people now, and compare
the proportion who develop the disorder with the proportion who were not
exposed in the past.
Experimental
Randomised controlled trial: randomly allocate people to have the exposure or
control now. Then follow them up, and compare the proportions that develop
the disorder in the future between the two groups.
An ‘exposure’ could be a new treatment, and those ‘not exposed’ or in a control group could have been given standard therapy.
New Drug (IND).# An IMP could be a newly developed drug, or one that
already is licensed for human use. Most clinical trial regulations that are part
of law in several countries cover studies using an IMP, and sometimes medical
devices.
Throughout this book, ‘intervention’, ‘treatment’ and ‘therapy’ are used
interchangeably. People who take part in a trial are referred to as ‘subjects’ or
‘participants’ (if they are healthy individuals), or ‘patients’ (if they are already
ill). They are allocated to trial or intervention arms or groups.
Well-designed clinical trials with a proper statistical analysis provide robust
and objective evidence. One of the most important uses of evidence-based
medicine is to determine whether a new intervention is more effective than
another, or that it has a similar effect, but is safer, cheaper or more convenient
to administer. It is therefore essential to have good evidence to decide whether
it is appropriate to change practice.
#
IMP in the European Union, and IND in the United States and Japan.
Fundamental concepts
3
World Health Organization definition of a clinical trial1,2
Any research study that prospectively assigns human participants or groups
of humans to one or more health-related interventions to evaluate the effects
on health outcomes.
Health outcomes include any biomedical or health-related measures
obtained in patients or participants, including pharmacokinetic measures and
adverse events.
1.2 Early trials
James Lind, a Scottish naval physician, is regarded as conducting the first
clinical trial.3 During a sea voyage in 1747, he chose 12 sailors with similarly
severe cases of scurvy, and examined six treatments, each given to two sailors:
cider, diluted sulphuric acid, vinegar, seawater, a mixture of several foods
including nutmeg and garlic, and oranges and lemons. They were made to
live in the same part of the ship and with the same basic diet. Lind felt it was
important to standardise their living conditions to ensure that any change in
their disease is unlikely to be due to other factors. After about a week, both
sailors given fruit had almost completely recovered, compared to little or no
improvement in the other sailors. This dramatic effect led Lind to conclude
that eating fruit was essential to curing scurvy, without knowing that it was
specifically due to vitamin C. The results of his trial were supported by observations made by other seamen and physicians.
Lind had little doubt about the value of fruit. Two important features of his
trial were: a comparison between two or more interventions, and an attempt
to ensure that the subjects had similar characteristics. That the requirement
for these two features has not changed is an indication of how important they
are to conducting good trials that aim to provide reliable answers.
One key element missing from Lind’s trial was the process of randomisation, whereby the decision on which intervention a subject receives cannot be influenced by the researcher or subject. An early attempt to do this
appeared in a trial on diphtheria in 1898, which used day of admission to
allocate patients to the treatments.4 Those admitted on one day received the
standard therapy, and those admitted on the subsequent day received the
standard therapy plus a serum treatment. However, some physicians could
have admitted patients with mild disease on the day when the serum treatment would be given, and this could bias the results in favour of this treatment. The Medical Research Council trial of streptomycin and tuberculosis in
1948 is regarded as the first to use random numbers.5 Allocating subjects using
a random number list meant that it was not possible to predict what treatment
would be given to each patient, thus minimising the possibility of bias in the
allocation.
4
Chapter 1
1.3 Why are research studies, such as clinical
trials, needed?
Smoking is a cause of lung cancer, and statin therapy is effective in treating
coronary heart disease. However, why do some people who have smoked 40
cigarettes a day for life not develop lung cancer, while others who have never
smoked a single cigarette do? Why do some patients who have had a heart
attack and been given statin therapy have a second attack, while others do
not. The answer is that people vary. They have different body characteristics
(for example, weight, height, blood pressure and blood measurements),
different genetic make-up and different lifestyles (for example, diet, exercise,
and smoking and alcohol consumption habits). This is all referred to as variability or natural variation. People react to the same exposure or treatment
in different ways; what may affect one person may not affect another. When
a new intervention is evaluated, it is essential to consider if the observed
responses are consistent with this natural variation, or whether there really
is a treatment effect. Variability needs to be allowed for in order to judge how
much of the difference seen at the end of a trial is due to natural variation
(i.e. chance), and how much is due to the action of the new intervention. The
more variability there is, the harder it is to see if a new treatment is effective.
Detecting and measuring the effect of a new intervention in the setting of
natural variation is the principal concern of medical statistics, used to design
and analyse research studies.
Before describing the main design features of clinical trials, it is worth considering other types of studies that can assess the effectiveness of an intervention, and their limitations.
1.4 Alternatives to clinical trials
Evaluating a new intervention requires comparing it with another. This can
be done using a randomised clinical trial (RCT), observational study or trial
with historical controls (Box 1.1). Although observational studies need to be
interpreted carefully with regard to the design features and other influential
factors, their results could be consistent with those from an RCT. For example,
a review of 20 observational studies indicated that giving a flu vaccine to the
elderly could halve the risk of developing respiratory and flu-like symptoms.6
Practically the same effect was found in a large RCT.7
One of the main limitations of observational studies is that the treatment
effect could be larger than that found in RCTs or, worse still, a treatment effect
is found but RCTs show either no evidence of an effect, or that the intervention
is worse. An example of the latter is β-carotene intake and cardiovascular mortality. Combining the results from six observational studies indicated that people with a high β-carotene intake, by eating lots of fruit and vegetables, had
a much lower risk of cardiovascular death than those with a low intake (31%
reduction in risk).8 However, combining the results from four randomised trials showed that a high intake might increase the risk by 12%.8
Fundamental concepts
5
Observational (non-randomised) studies
Observational studies may be useful in evaluating treatments with large
effects, although there may still be uncertainty over the actual size of the effect.
They can be larger than RCTs and therefore provide more evidence on sideeffects, particularly uncommon ones. However, when the treatment effect is
small or moderate, there are potential design problems associated with observational studies that make it difficult to establish whether a new intervention
is truly effective. These are called confounding and bias.
Several observational studies have examined the effect of a flu vaccine
in preventing flu, respiratory disease or death in elderly individuals. Such
a study would involve taking a group of people aged over 60 years, then
ascertaining whether each subject had had a flu vaccine or not, and which
subsequently developed flu or flu-related illnesses. An example is given in
Figure 1.1.9 The chance of developing flu-like illness was lower in the vaccine
group than in the unvaccinated group: 21 versus 33%. But did the flu vaccine
really work?
The vaccinated group may be people who chose to go to their family doctor
and request the vaccine, or the doctor or carer recommended it, perhaps on the
basis of a perceived increased risk. Unvaccinated people could include those
who refused to be vaccinated when offered. It is therefore possible that people
who were vaccinated had different lifestyles and characteristics than unvaccinated people, and it is one or more of these factors that partly or wholly
explains the lower flu risk, not the effect of the vaccine.
Assume that vitamin C protects against acquiring flu. If people who choose
to have the vaccine also happen to eat much more fruit than those who are
unvaccinated, then a difference in flu rates would be observed (Table 1.1). The
difference of 5 versus 10% could be due to the difference in the proportion
of people who ate fruit (80 versus 15%). This is confounding. However, if
fruit intake had not been measured, it could be incorrectly concluded that the
difference in flu rates is due to one group being vaccinated and the other not.
When the association between an intervention (e.g. flu vaccine) and a disorder (e.g. flu) is examined, a spurious relationship could be created through a
third factor, called a confounder (e.g. eating fruit). A confounder is correlated
Figure 1.1 Example of an observational study of the flu vaccine.9
6
Chapter 1
Table 1.1 Hypothetical observational study of the flu vaccine.
1000 people aged ≥60 years
Eat fruit regularly
Developed flu 12 months
after being vaccinated
Vaccinated
N = 200
Not vaccinated
N = 800
160 (80%)
120 (15%)
10 (5%)
80 (10%)
with both the intervention and the disorder of interest. Confounding factors
are often present in observational studies. Even though there are methods of
design and analysis that can allow for their effects, there could exist unknown
confounders for which no adjustment can be made because they were not
measured.
There may also be a bias, where the actions of subjects or researchers produce a value of the trial endpoint that is systematically under- or over-reported
in one trial arm. In the example above, the clinician or carer could deliberately choose fitter people to be vaccinated, believing they would benefit the
most. The effect of the vaccine could then be over-estimated, because these
particular people may be less likely to acquire the flu than the less fit ones.
Confounding and bias could work together, in that both lead to an underor over-estimate of the treatment effect, or they could work in opposite directions. It is difficult to separate their effects reliably (Box 1.2). Confounding is
sometimes described as a form of bias, since both distort the results. However, it is useful to distinguish them because known confounding factors can
be allowed for in a statistical analysis, but it is difficult to do so for bias.
Despite the potential design limitations of observational studies, they can
often complement results from randomised trials.10–14
Box 1.2 Confounding and bias
r Confounding represents the natural relationships between our physical
and biochemical characteristics, genetic make-up, and lifestyle and habits that
may affect how an individual responds to a treatment. It cannot be removed
from a research study, but known confounders can be allowed for in a statistical analysis, and sometimes at the design stage (matched case-control studies).
r Bias is usually a design feature of a study that affects how subjects are
selected for the study, treated, managed or assessed
r It can be prevented, but human nature often makes this difficult
r It is difficult, sometimes impossible, to allow for bias in a statistical analysis.
Randomisation, within a clinical trial, minimises the effect of confounding and bias
on the results
Fundamental concepts
7
Figure 1.2 Comparison of survival in patients treated with shunt surgery (circles) and medical
management (squares). The solid lines are based on a review of five studies, comparing patients
treated with surgery at the time of the study, with those treated with medical management in the
past. The dashed lines are from a review of eight randomised controlled trials, in which patients
were randomly allocated to receive either treatment. The figure is based on information reported
in Sacks et al.15
Historical (non-randomised) controls
Studies using historical controls may be difficult to interpret because they
compare a group of patients treated using one therapy now, with those treated
using another therapy in the past. The difference in calendar period is likely to
have an effect because it may reflect possible differences in patient characteristics, methods of diagnosis or standards of care. Time would be a confounder.
In RCTs, subjects in the trial arms are prospectively followed up simultaneously, so changes over time should not matter. The following example illustrates how using historical controls can give the wrong conclusion.
Patients suffering from cirrhosis with oesophageal varices have dilated
sub-mucosal veins in the oesophagus. Figure 1.2 shows the summary results
on survival in patients treated with surgery (shunt procedures) or medical
management.15 Survival was substantially better in surgical patients in the
fives studies that used historical controls, indicated by a large gap between
the solid survival curves. However, the eight RCTs showed no evidence of a
benefit; the dashed curves are close together. Survival was clearly poorest in
the historical control patients, and this could be due to lower standards of care
at that time.
1.5 A randomised trial may not always be the best
study design
Although a randomised controlled trial is an appropriate design for most
interventions, this is not always the case. When planning a study, initial
thought should be given to the disorder of interest, the intervention and any
information that could affect either how the study is conducted or the results.
8
Chapter 1
The following example illustrates how a randomised trial could be inferior to
another design.
The UK National Health Service study on antenatal Down’s syndrome
screening was conducted between 1996 and 2000.16 Screening involves measuring several serum markers in the pregnant mother’s blood, which are used
to identify those with a high risk of carrying an affected foetus. The study
aimed to compare the second trimester Quadruple test (four serum markers
measured at 15–19 weeks of pregnancy) with the first trimester Combined
test (an ultrasound marker and two other serum markers measured at 10–14
weeks). The main outcome measure was the detection rate: the percentage
of Down’s syndrome pregnancies correctly identified by the screening test.
Women classified as high risk by the test would be offered an invasive diagnostic test to confirm or rule out an affected pregnancy.
At first glance, a randomised trial seems like the obvious design. Pregnant
women would be randomly allocated to have either the Combined test or the
Quadruple test. The detection rates in the two trial arms would then be compared. However, there are two major limitations with this approach:
Sample size. Preliminary studies suggested a detection rate of 85% for the
Combined test and 70% for the Quadruple test. To detect this difference
requires a sample size of 95 Down’s syndrome pregnancies in each arm. The
prevalence in the second trimester is about 1.7 per 1000 (0.0017), so 56 000
women would be needed in each arm (95/0.0017), or 112 000 in total. This
would be a very large study that may not be feasible in a reasonable timeframe.
Bias. About 25% of Down’s syndrome pregnancies miscarry naturally
between the first and second trimesters of pregnancy. In a randomised trial
there would be an expected 127 cases seen in the first trimester and 95 in the
second trimester. The problem is that the Combined test group would include
affected foetuses destined to miscarry, while the Quadruple test group has
already had these miscarriages excluded, because a woman allocated to have
this test but who miscarried at 12 weeks would clearly not be screened in
the second trimester. The comparison of the two screening tests would not be
comparing like with like, and it can be shown that the detection rate for the
Combined Test would be biased upwards.
A better design is an observational study where both screening tests can
be compared in the same woman, which is what happened.16 Women had
an ultrasound during the first trimester and gave a blood sample in both
trimesters, but the Combined or Quadruple test markers were not measured
or examined until the end of the study (no intervention based on these
results); women just received the standard second trimester test according
to local policy, the result of which was reported and acted upon. This design
avoids the miscarriage bias because only Down’s syndrome pregnancies
during or after the second trimester were known and included in the analysis.
The comparison of the Combined and Quadruple tests was thus based on
the same group of pregnancies. Furthermore, because each woman had
Fundamental concepts
9
both tests, a within-person statistical analysis could be performed, and this
required only half the number needed compared to a randomised two-arm
trial (56 000 instead of 112 000).
1.6 Types of clinical trials
Clinical trials have different objectives. The methods for designing and
analysing clinical trials can be applied to experiments on almost any object,
for example, animals or cells, as well as humans. They can be broadly categorised into four types (Phase I, II, III or IV), largely depending on the main
aim (Box 1.3).
Phase I trials
After a new drug is tested in animal experiments, it is given to humans.
Phase I trials are therefore often referred to as ‘first in man’ studies. They
are used to examine the pharmacological actions of the new drug (i.e. how
Box 1.3 Types of trials
Phase I
r First time a new drug or regimen is tested on humans
r Few participants (say <30)
r Primary aims are to find a dose with an acceptable level of safety, and examine the biological and pharmacological effects
Phase II
Not too large (say 30–70 people)
Aim is to obtain a preliminary estimate of efficacy
Not designed to determine whether a new treatment works
Produces data in each of the trial arms, that could be used to design a phase
III trial
r
r
r
r
Phase III
r Must be randomised and with a comparison (control) group
r Relatively large (usually several hundred or thousand people)
r Aim is to provide a definitive answer on whether a new treatment is better
than the control group, or is similarly effective but there are other advantages
Phase IV
r Relatively large (usually several hundred or thousand people)
r Used to continue to monitor efficacy and safety in the population once the
new treatment has been adopted into routine practice.
10
Chapter 1
it is processed in the body), but also to find a dose level that has acceptable
side-effects. They may provide early evidence on effectiveness.
Phase I trials are typically small, often less than 30 individuals, and based
on healthy volunteers. An exception may be in trials in specialties where the
intervention is expected to have side effects, so it is inappropriate to give it to
healthy people, but rather those who already have the disorder of interest (e.g.
cancer). Subjects are closely monitored. Phase I studies may be conducted in a
short space of time, with few recruiting centres, depending on how common
the disease is and the type of intervention. There may be several phase I trials,
and if the results are favourable, they are used to design a phase II trial. Many
new drugs are not investigated further.
Phase II trials
The aim of a phase II study is to obtain a preliminary assessment of efficacy
in a group of subjects that is not large, say less than 100 and often around 50.
These trials can be conducted relatively quickly, without spending too many
resources (participants, time and money) on something that may not work. As
in phase I studies, participants are closely monitored for safety.
A phase II study could have several new treatments to examine. There could
also be a control arm in which subjects are given standard therapy, because
the disease of interest is relatively uncommon, so there is uncertainty over
the effect of the standard therapy. If the results are positive, the data in each
arm are used to design a randomised phase III trial, for example estimating
sample size. When there is more than one intervention, it is best, though not
absolutely necessary, to randomise subjects to the trial groups. The advantages
of randomising are given on page 12. A randomised phase II study could also
provide information on the feasibility of a subsequent phase III trial, such as
how willing subjects are to be randomised.
Phase III trials
A phase III trial is commonly referred to as a randomised controlled trial
(RCT). Subjects must be randomly allocated to the intervention groups, and
there must be a control (comparison). The aim is to provide a definitive
answer on whether a new intervention is better than the control, or sometimes
whether they have a similar effect. Sometimes, there are more than two new
interventions. Phase III studies are often large, involving several hundred or
thousand people. Results should be precise and robust enough to persuade
health professionals to change practice. The larger the trial, the more reliable
the conclusions. The size of these trials, and the need for several recruiting
centres, mean that they can take several years to complete.
There is sometimes a misunderstanding that a randomised phase II trial is
a quick randomised phase III trial, but they have quite different purposes.
A randomised phase II study is not usually designed for a direct statistical
comparison of the trial endpoint between the two interventions, and this is
reflected in the smaller sample size. Therefore, the results cannot be used
to make a reliable conclusion on whether the new intervention is better.
Fundamental concepts
11
However, a phase III trial is designed for a direct comparison, allowing a full
evaluation of the new intervention and, usually, a definitive conclusion.#
Phase III trials should be designed and conducted to a high standard, with
precise quantitative results on efficacy and safety. This can be particularly
important for pharmaceutical companies who wish to obtain a marketing
licence from a regulatory agency for a new drug or medical device, which
normally requires extensive data before a licence is granted. Trials used in this
way can be referred to as pivotal trials.
Phase IV trials
These are sometimes referred to as post-marketing or surveillance studies.
Once a new treatment has been evaluated using a phase III trial and adopted
into clinical practice, some organisations (usually the pharmaceutical industry) continue to monitor the efficacy and safety of the new intervention.
Because several thousand people could be included, phase IV studies may
be useful in identifying uncommon adverse effects not seen in the preceding
phase III trials. They are also based on subjects in the general target population, rather than the selected group of subjects who agree to participate in a
phase III trial. However, phase IV studies are not as common as the other trial
types, particularly in the academic or public sector. Comparisons can sometimes only be made with historical controls or groups of people (non-users of
the new drug) who are likely to have different characteristics. Because of this,
phase IV studies are not discussed in further detail in this book, though the
methods of analysis for phase III trials can be used.
1.7 Four key design features
The study population of all types of clinical trials must be defined by the
inclusion and exclusion criteria. The strength of randomised phase II and III
trials comes from three further design features: control, randomisation and
blinding.
Inclusion and exclusion criteria
It is necessary to specify which participants are recruited. This is done using a
set of inclusion and exclusion criteria (or eligibility list), which each subject
has to fulfil before entry. Every trial will have its own criteria depending on the
objectives, and this may include an age range, having no serious co-morbid
conditions, the ability to obtain consent, and that subjects have not previously
taken the trial treatment. They should have unambiguous definitions to make
recruiting subjects easier.
# Some researchers design a study as if it were a phase III trial, but using a one-sided test
with a permissive level of statistical significance ≥10% (see Chapter 5) and usually a
surrogate endpoint (see Chapter 2). It is however referred to as a randomised phase II trial.
The description of randomised phase II studies given in this book is the one preferred here.
12
Chapter 1
Table 1.2 Hypothetical example of inclusion and exclusion
criteria for a trial of a new drug for preventing stroke.
Narrow set of criteria
Inclusion
Exclusion
Male
Age 50 to 55 years
Never-smoker
History of heart disease or stroke
History of cancer
Female
Ex and current smokers
Unable to give informed consent
Family history of heart disease
Average alcohol intake <2 units per day
Wide set of criteria
Inclusion
Exclusion
Male or female
Age 45 to 85 years
Unable to give informed consent
Determining the eligibility criteria necessitates balancing the advantages
and disadvantages of having a highly selected group against those associated with including a wide variety of subjects. Having many criteria which
are narrow (Table 1.2), produces a group in which there should be relatively
little variability. Subjects are more likely to respond to the treatment in a similar manner, and this makes it easier to detect an effect if it exists, especially if
the effect is small or moderate. However, the trial results may only apply to
a small proportion of the population, and so may not be easily generalisable.
A trial with few criteria, that are wide (Table 1.2), will have a more general
application, but the amount of variability is expected to be high. This could
make it more difficult to show that the treatment is effective. When there is
much variability, sometimes only large effects can be detected easily.
Control group
The outcome of subjects given the new intervention is always compared with
that in a group who are not receiving the new intervention. A control group
normally receives the current standard of care, no intervention or placebo
(see Blinding below). Treatment effects from randomised trials are therefore
always relative. The choice of the control intervention depends on the availability of alternative treatments. When an established treatment exists, it is
unethical to give a placebo instead because this deprives some subjects of a
known health benefit.
Randomisation
In order to attribute a difference in outcome between two trial arms to the
new treatment being tested, the characteristics of people should be similar
between the groups. In the hypothetical example of the flu vaccine (Table 1.1),
Fundamental concepts
13
Box 1.4 Randomisation
r Randomly allocating subjects produces groups that are as similar as possible with regard to all characteristics except the trial interventions
r The only systematic difference between the two arms should be the treatment given
r Therefore, any differences in results observed at the end of the trial should
be due to the effect of the new treatment, and not to any other factors (or
differences in characteristics have not spuriously produced a treatment effect,
when the aim is to show that the interventions have a similar effect).
the difference in flu risk at the end of the trial could be due to the difference
in those who ate fruit regularly (confounding), not the vaccine. Randomly
allocating patients to the trial arms means that any difference in outcome at
the end of the trial should be due to the new treatment being tested, and not
any other factor (Box 1.4).
Randomisation is a process for allocating subjects between the different
trial interventions. Each subject has the same chance of being allocated to
any group, which ensures similarity in characteristics between the arms. This
minimises the effect of both known and unknown confounders, and thus has
a distinct advantage over observational studies in which statistical adjustments can only be made for known confounders. Although randomisation is
designed to produce groups with similar characteristics, there will always be
small differences because of chance variation. Randomisation cannot produce
identical groups.
Randomisation also minimises bias. If either the researcher or trial subject
is allowed to decide which intervention is allocated, then subjects with a certain characteristic, for example, those who are younger or with less severe
disease, could be over-represented in one of the trial arms. This could produce a bias which makes the new intervention look effective when it really is
not, or over-estimate the treatment effect. Selection bias can occur if a choosing a particular subject for the trial is influenced by knowing the next treatment allocation. Allocation bias involves giving the trial treatment that the
clinician or subject feels might be most beneficial. Sometimes, the researcher
has access to the list of randomisations from which the next allocation can be
seen, possibly creating allocation bias. This can be avoided if randomisation
is done through a central office (for example, a clinical trials unit) or a computer system, because the researcher has no control over either process (called
allocation concealment).
Blinding
The randomisation process minimises the potential for bias, but the benefit
could be greater if the trial intervention given to each subject is concealed.
Subjects or researchers may have expectations associated with a particular
treatment, and knowing which was given can create bias. This can affect how
14
Chapter 1
people respond to treatment, or how the researcher manages or assesses the
subject. In subjects, this bias is specifically referred to as the placebo effect.
Humans have a remarkable psychological ability to affect their own health
status. The effect of any of these biases could result in subjects receiving the
new intervention appearing to do better than those on the control treatment,
but the difference is not really due to the action of the new treatment.
Clinical trials are described as double-blind if neither the subject nor anyone involved in giving the treatment, or managing or assessing the subject, is
aware of which treatment was given. In single-blind trials, usually only the
subject is blind to the treatment they have received (see also page 61).
A placebo has no known active component. It is often referred to as a
‘sugar pill’ because many treatment trials involve swallowing tablets. However, a placebo could also be a saline injection, a sham surgical procedure,
sham medical device or any other intervention that is meant to resemble the
test intervention, but has no known effect on the disease of interest, and no
adverse effect. A recent example was based on patients with osteoarthritis of
the knee who often undergo surgery (arthroscopic lavage or débridement).
There were more than 650 000 procedures each year in the USA around 2002.
However, a randomised trial,17 comparing these two surgical procedures with
sham surgery (skin incision to the knee) provided no evidence that these procedures reduced knee pain. This trial was justified on the basis that patients
in uncontrolled studies reported less pain after having the procedure despite
there being no clear biological reason for this.
Using placebos needs to be fully justified in any clinical trial. While there
are some arguments against placebos such as sham surgery, these trials can
provide valuable evidence on the effectiveness of a new intervention. They
can be conducted as long as there is ethical approval, and patients are fully
aware that they may be assigned to the sham group.
When it is not possible to conceal the trial interventions, an outcome measure that does not depend on the personal opinion of the subject or researcher
is best. For example, in a trial evaluating hypnotherapy for smoking cessation,
a subjective measure would be to ask the subjects if they stopped smoking at,
say, 1 year. However, there could be some continuing smokers who misreport
their smoking status. An objective endpoint would be to measure serum or
urinary cotinine, as a marker of current smoking status, because this is specific
to tobacco smoke inhalation, and so less prone to bias than a questionnaire on
self-reported habits.
1.8 Small trials
Trials with a small number of subjects can be quick to conduct with regard
to enrolling patients, performing biochemical analyses, or asking subjects to
complete study questionnaires. A possible advantage is, therefore, that the
research question could be examined in a relatively short space of time. Furthermore, small studies are usually only conducted across a few centres, so
Fundamental concepts
15
obtaining all ethical and institutional approvals should be quicker compared
to large multi-centre studies.
It is often useful to examine a new intervention in a few subjects first (as in
a phase II trial). This avoids spending too many resources, such as subjects,
time and financial costs, on looking for a treatment effect when there really is
none. However, if a positive result is found it is important to make clear in the
conclusions that a larger confirmatory study is needed.
The main limitation of small trials is in interpreting their results, in particular confidence intervals and p-values (Chapter 7). They can often produce
false-positive results or over-estimate the magnitude of the treatment benefit. Overly small trials may yield results that are too unreliable and therefore
uninformative. While there is nothing wrong with conducting well-designed
small studies, they must be interpreted carefully, without making strong
conclusions.
1.9 Summary points
r Clinical trials are essential for evaluating new methods of disease detection,
prevention and treatment
r Observational studies can provide useful supporting evidence on the effectiveness of an intervention
r Clinical trials, especially when randomised, are considered to provide the
strongest evidence
r Randomisation minimises the effect of confounding and bias, and blinding
further reduces the potential for bias.
Key design features of clinical trials
1.
2.
3.
4.
Inclusion and exclusion criteria
Controlled (comparison/control arm)
Randomisation
Blinding (using placebo)
References
1. Laine C, Horton R, DeAngelis CD et al. Clinical Trial Registration: Looking Back and
Moving Ahead. Ann Intern Med 2007; 147(4):275–277.
2. World Health Organization. International Clinical Trials Registry Platform.
http://www.who.int/ictrp/about/details/en/index.html
3. http://www.jameslindlibrary.org/trial records/17th 18th Century/
lind/lind tp.html
4. Hróbjartsson A, Gøtzsche PC, Gluud C. The controlled clinical trial turns 100 years:
Fibiger’s trial of serum treatment of diphtheria. BMJ 1998; 317:1243–1245.
16
Chapter 1
5. Medical Research Council. Streptomycin treatment of pulmonary tuberculosis. BMJ 1948;
2:769–782.
6. Gross PA, Hermogenes H, Sacks HS, Lau J, Levandowski RA. The efficacy of influenza
vaccine in elderly persons. Ann Intern Med 1995; 123:518–527.
7. Govaert TME, Thijs CTMCN, Masurel N et al. The efficacy of influenza vaccination in
elderly individuals. JAMA 1994; 272(21):1661–1665.
8. Egger M, Schneider M, Davey Smith G. Meta-analysis: spurious precision? Meta-analysis
of observational studies. BMJ 1998; 316:140–144.
9. Patriarca PA, Weber JA, Parker RA et al. Efficacy of influenza vaccine in nursing homes.
Reduction in illness and complications during an influenza A (H3N2) epidemic. JAMA
1985; 253:1136–1139.
10. Benson K, Hartz AJ. A comparison of observational studies and randomised controlled
trials. N Eng J Med 2000; 342:1878–1886.
11. Concato J, Shah N, Horwitz RI. Randomized controlled trials, observational studies, and
the hierarchy of research designs. N Eng J Med 2000; 342:1887–1892.
12. Pocock SJ, Elbourne DR. Randomized trials or observational tribulations? N Eng J Med
2000; 342:1907–1909.
13. Collins R, MacMahon S. Reliable assessment of the effects of treatment on mortality and
major morbidity, I: clinical trials. The Lancet 2001; 357:373–380.
14. MacMahon S, Collins R. Reliable assessment of the effects of treatment on mortality and
major morbidity, II: observational studies. The Lancet 2001; 357:455–462.
15. Sacks H, Chalmers TC, Smith H. Randomized versus historical controls for clinical trials.
Am J Med 1982; 72:233–240.
16. Wald NJ, Rodeck CH, Hackshaw AK et al. First and second trimester antenatal screening
for Down’s syndrome: the results of the Serum, Urine and Ultrasound Screening Study
(SURUSS). Health Technology Assessment 2003; 7(11).
17. Moseley JB, O’Malley K, Petersen NJ et al. A Controlled Trial of Arthroscopic Surgery for
Osteoarthritis of the Knee. N Eng J Med 2002; 347(2):81–88.
CHAPTER 2
Types of outcome measures and
understanding them
When statin therapy was first shown to be an effective treatment for preventing heart disease, it would not have been sufficient just to say ‘statins are effective’. This statement is unclear. What does ‘effective’ actually mean? It could
be a reduction in the chance of having a first coronary event, a reduction in
the chance of having a subsequent coronary event in those who have already
suffered one, a reduction in serum cholesterol, or a reduction in the chance of
dying. Each of these is an outcome measure or endpoint, and when they are
clearly defined they contribute not only to the appropriate design of a clinical
trial, but also to an easier and clearer interpretation of the results.
2.1 ‘True’ versus surrogate outcome measures
Some outcome measures have an obvious and direct clinical relevance to participants, for example, whether they:
r Live or die
r Develop a disorder or not
r Recover from a disease or not
r Change their lifestyle or habits (e.g. stopped smoking)
r Have a change in body weight
A clear impact of statins is evident in a clinical trial using the outcome measure ‘coronary event or no coronary event’. Death, occurrence of a disease, and
other similar measures are sometimes referred to as ‘true’ outcomes or endpoints. For several disorders there is the concept of a surrogate endpoint.1–3
These are measures that do not often have an obvious impact that subjects are
able to identify. They are usually assumed to be a precursor to the true outcome, i.e. they lie along the causal pathway. Surrogate markers can be a blood
measurement, or examined by medical imaging tests (Box 2.1).
Sometimes, a trial would have to be impractically large, or take many years
to conduct, because a true endpoint would have too few events to allow a reliable evaluation of the intervention. A surrogate marker is attractive because
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
17
18
Chapter 2
Box 2.1 Examples of true and surrogate trial endpoints
Surrogate endpoint
Cholesterol level
True endpoint
Heart attack or death from heart attack
Blood pressure
Stroke or death from stroke
Tumour response (partial or
complete remission of tumour)
Survival
Time to cancer progression
Survival
Tooth pocket depth or
attachment level
Tooth loss (in periodontitis)
CD4 count
Death from AIDS
Total brain volume
Progression of Alzheimer’s disease
Hippocampal volume
Progression of Alzheimer’s disease
Loss of dopaminergic neurons
Progression of Parkinson’s disease
Intra-ocular pressure
Glaucoma
there are more events, possibly in a shorter space of time, so trials could
be conducted quicker or with fewer subjects, thus saving resources. Using a
surrogate might be the only feasible option to evaluate a new potential treatment. The surrogate and true endpoints need to be closely correlated: a change
in the surrogate outcome measure now is likely to produce a change in a more
clinically important outcome, such as death or prevention of a disorder, later.
Studies that show this validate the surrogate marker.
Statin therapy reduces serum cholesterol levels, which in turn reduces the
risk of a heart attack. Cholesterol is therefore an accepted surrogate endpoint
when examining some therapies for coronary heart disease; a claim in benefit of a new drug could come from a randomised trial in which cholesterol
levels have been significantly reduced. In other diseases, it is difficult to find
good surrogates. For example, tumour response# does not correlate well with
survival in several cancers, such as advanced breast cancer. Therefore, while
tumour response can provide useful information on the biological course of a
cancer, and be used in phase I or II studies, it would not be the main endpoint
in a phase III trial evaluating a new therapy.
It is essential to consider whether the measure used in a particular study
is meaningful and appropriate for addressing the primary objectives. There
is sometimes a danger that the true endpoint is not investigated thoroughly,
# Defined as a partial and/or complete response, in which the tumour has substantially
reduced in size or disappeared clinically.
Types of outcome measures
19
and it can be hard to arrive at firm conclusions on the effectiveness of a new
treatment when the evidence is based solely on surrogate measures. When
evaluating a new drug or medical device, it might be useful to check with the
regulatory authority that a proposed surrogate marker is acceptable. While
surrogate measures are commonly investigated in early phase trials (phase I
and II), their use in confirmatory phase III trials needs careful consideration
and validation.
2.2 Types of outcomes
Outcome measures fall into two basic categories: counting people and taking
measurements on people. There is a special case of ‘taking measurements’
that is based on time-to-event data. It is useful to distinguish between them
because it helps to define the trial objectives, and methods of sample size calculation and statistical analysis. First, the unit of interest is determined, usually a person. Second, consider what will be done to the unit of interest. The
outcome measure will involve either counting how many people have a particular characteristic (i.e. put them into mutually exclusive groups, such as
‘dead’ or ‘alive’), or taking measurements on them. In some situations, taking a measurement on someone involves counting something, but the unit of
interest is still a person. Box 2.2 shows examples of outcome measures.
Having measured the endpoint for each trial subject it is necessary to
summarise the data in a form that can be readily communicated to others.
Box 2.2 Examples of outcome measures when the unit of interest is
a person
Counting people (binary or categorical data)
Dead or alive
Admitted to hospital (yes or no)
Suffered a first heart attack (yes or no)
Recovered from disease (yes or no)
Severity of disease (mild, moderate, severe)
Ability to perform household duties (none, a little, some, moderate, high)
Taking measurements on people (continuous data)
Blood pressure
Body weight
Cholesterol level
Size of tumour
White blood cell count
Number of days in hospital
Number of units of alcohol intake per week
20
Chapter 2
Further details can be found in books on medical statistics (see reading list on
page 203).
Types of outcome measures
After defining the health outcome for a trial, what is to be done to the unit of
interest, i.e. people?
r Count people, i.e. how many have the health outcome of interest
r Take measurements on people
r Time-to-event measures.
2.3 Counting people
This type of outcome measure is easily summarised by calculating the percentage or proportion. For example, the effect of a flu vaccine can be examined
by counting how many developed flu in the vaccinated group, and dividing
this number by the total number of patients in that group. This proportion
(or percentage) is the risk, i.e. the risk of developing flu if vaccinated. The
same calculation is made in the unvaccinated group, i.e. the risk of developing flu if not vaccinated. In Figure 1.1 (page 5), the two risks are 21 and 33%.
The word ‘risk’ implies something negative, but it could be used for any outcome that involves counting people, for example, the risk of being alive after
5 years.
Percent
20
10
0
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
10.5
11.5
Cholesterol (mmol/L)
Figure 2.1 Histogram of the cholesterol values in 40 men, with a superimposed Normal
distribution curve.
Types of outcome measures
21
2.4 Taking measurements on people
This type of outcome measure will vary between people. Consider the following cholesterol levels (mmol/L) for 40 healthy men, all aged 45 years (ranked
in order of size):
3.6
5.4
6.4
7.5
3.8
5.4
6.5
7.7
3.9
5.6
6.6
8.0
4.1
5.8
6.8
8.1
4.2
5.9
6.9
8.1
4.5
6.0
7.1
8.2
4.5
6.1
7.2
8.3
4.8
6.1
7.2
9.0
5.1
6.2
7.3
9.1
5.3
6.3
7.4
10.0
These data are summarised by two parameters: the ‘average’ level and a measure of spread or variability. The average, often referred to as a measure of
central tendency, can be described by either the mean or median. It is where
the middle of the distribution lies. The mean is more commonly reported and
often taken to be the same as the average. Another measure of average is
the mode – the most frequently occurring value – but there are few instances
where this is the best summary measure.
The mean is the sum of all the values divided by the number of observations. In the example above, the mean is 256/40 = 6.4 mmol/L. The median is
the value that has half the observations above it and half below. In the example, it is halfway between the 20th and 21st values; median = (6.3 + 6.4)/2 =
6.35 mmol/L.
One measure of spread is the standard deviation (Box 2.3). It quantifies the
amount of variability in a group of people, i.e. how much the data spreads
about from the mean. It is calculated as:
Sum of (the distances of each data point from the mean)2
(Number of data values − 1)
In the example, the standard deviation is 1.57 mmol/L: the cholesterol levels
differ from the mean value of 6.4 by, on average, 1.57 mmol/L.
Another measure of spread is the interquartile range. This is the difference
between the 25th centile (the value that has a quarter of the data below it and
Box 2.3 Illustration of standard deviation for five values
Cholesterol (mmol/L)
Difference from the mean (5.36)
Sum of the differences = 0
So square the differences
4.5
−0.86
4.9
−0.46
5.5
+0.14
5.7
+0.34
6.2
+0.84
0.74
0.21
0.02
0.12
0.70
Sum of the square differences = 1.79
Divide by number of observations minus 1 = 1.79/(5 − 1) = 0.457
√
Take the square root to get standard deviation = 0.457 = 0.67 mmol/L
on the original scale
22
Chapter 2
Cholesterol
(mmol/L)
Number
of men
3.0–3.9
4.0–4.9
5.0–5.9
6.0–6.9
7.0–7.9
8.0–8.9
9.0–9.9
10.0–10.9
3
5
7
10
7
5
2
1
7.5
12.5
17.5
25.0
17.5
12.5
5.0
2.5
Total
40
100.0
Percentage
Table 2.1 Frequency distribution of
cholesterol levels of a sample of 40 men
(page 21).
three-quarters above it) and the 75th centile (the value that has three-quarters
of the data below it and a quarter above it). In the example, there are 40
observations so the 25th centile is between the 10th and 11th data points (i.e.
5.32 mmol/L) and the 75th centile is between the 30th and 31st data points
(i.e. 7.47 mmol/L).# The interquartile range is therefore 7.47 − 5.32 = 2.15
mmol/L. Sometimes, the actual 25th and 75th centiles are presented instead
of the interquartile range.
Deciding which measures of average and spread to use depends on whether
the distribution is symmetric or not. To help determine this, the data is
grouped into categories of cholesterol levels and the frequency distribution
is examined (Table 2.1). These proportions are used to create a histogram (the
shaded boxes in Figure 2.1). The shape is reasonably symmetric, indicating
that the distribution is Gaussian or Normal (‘N’ is in capital letters to avoid
confusion with the usual definition of the word normal, which can indicate
people without disease). This is more easily visualised by drawing a curve
around the histogram (Figure 2.1), which is said to be bell-shaped.
When data are Normally distributed, the mean and median are similar. The
preferred measures of average and spread are the mean and standard deviation, because they have useful mathematical properties which underlie many
statistical methods used to analyse this type of data. When the data are not
Normally distributed, the median and interquartile range are better measures.
To understand why, consider the outcome measure ‘number of days in hospital’
for 20 patients. The histogram is given in Figure 2.2. It is clear that the distribution is not symmetric. It is skewed to the right (this is where the tail of the
data is). When most of the data are towards the right, the distribution is said
to be skewed to the left.
# The 25th centile is the point at (n + 1)/4, i.e. the 10.25th observation. This is between the
10th and 11th value, i.e. 5.3 and 5.4, and found by adding 0.25 × difference between these
two observations (0.1) to 5.3. So the 25th centile is 5.3 + 0.025 = 5.325. A similar calculation
is made to obtain the 75th centile.
Types of outcome measures
23
25
Percent
20
15
10
5
0
0
10
20
30
40
50
60
70
80
Number of days in hospital
Figure 2.2 Histogram of the length of hospital stay for 20 patients.
The summary statistics that describe this data are:
Mean = 17 days
Median = 9 days
Standard deviation = 19 days
Interquartile range = 8 days
The middle of the data, and spread, are better represented by the median and
interquartile range. The mean and standard deviation are heavily influenced
by the few very high values.
When data are skewed it is sometimes possible to transform it, usually by
taking logarithms or the square root. Many biological measurements only
have a Normal (symmetric) distribution after the logarithm is taken, so using
the log of the values would produce a histogram that has a similar shape to
that in Figure 2.1. The mean is calculated using the log of the values, and the
result is back-transformed to the original scale, though this cannot be done
with standard deviation. For example, if the mean of the transformed values is 0.81, using log to the base 10, the calculation 100.81 = 6.5 produces the
mean value on the original scale. This is called a geometric mean. Sometimes
no transformation is possible that will turn a skewed distribution into a Normal one. In these situations, the median and interquartile range should be
used.
A probability (or centile) plot# can be used to determine whether data is
Normally distributed or not. Many statistical software packages can provide
this. Figure 2.3 is an example using the 40 cholesterol measurements above.
If the observations lie reasonably along a straight line, the data are Normally
distributed. Another simple check is to examine whether the mean ± 2 ×
# Textbooks listed on page 203 can provide a technical description of how the plot is
obtained, but what is useful here is how to interpret it.
24
Chapter 2
99
95
90
Percent
80
70
60
50
40
30
20
10
5
1
3
4
5
6
7
8
9
10
Cholesterol (mmol/L)
Figure 2.3 Normal probability plot for the 40 cholesterol measurements on page 22.
standard deviation produces sensible numbers. In the example from Figure
2.3, this would be 17 days ±(2 × 19): the lower limit of −21 days is implausible.
2.5 Time-to-event data
A specific category of ‘taking measurements on people’ involves examining
the time taken until an event has occurred, based on the difference between
two calendar dates. An event could be defined in many ways, and one of the
simplest and most commonly used is ‘death’, hence the term survival analysis
which is applied to this type of data. This definition of an event is used in this
section, but others are given in Section 2.6. In the following seven subjects, the
endpoint is ‘time from randomisation until death (in years)’, and all have died:
4.5
6.1
6.7
8.3
9.1
9.4
10.0
The mean (7.7 years) or median (8.3 years) are easily calculated. In another
group of nine subjects, not all have died at the time of statistical analysis:
2.7
2.9
3.3
4.7
5.1
6.8
dead dead alive dead alive alive
7.2
7.8
9.1
dead dead alive
The mean or median cannot be calculated in the usual way, until all the
subjects have died, which could take many years, and it is incorrect to ignore
those still alive because the summary measure would be biased downward.
An alternative is to obtain the survival rate at, say, 3 years. In the example,
2 people died before 3 years and 7 lived beyond, so the 3-year survival rate
is 7/9 = 78%. This is simply an example of ‘counting people’. However,
every subject needs to be followed up for at least 3 years, unless they died
beforehand, and the outcome (dead or alive) must be known at that point for
all of them. In many studies this is not possible, particularly with long follow
Types of outcome measures
25
up, because contact is lost with some subjects. This approach also ignores the
length of time before a subject dies.
In 1958 a statistical method was developed that changed the way this type
of data was displayed and analysed.4 In the example above, the time-to-event
variable is treated as ‘time from randomisation until death or last known to be
alive’ (instead of ‘time from randomisation until death’), and there is another
variable with the values 0 or 1 to indicate ‘still alive’ or ‘dead’. A subject who
is still alive, or last known to be alive at a certain date, is said to be censored.
The two variables are used in a life-table from which it is possible to construct
a Kaplan–Meier plot. This approach uses the last available information on
every subject and how long he/she has lived for, or has been in the study. It is
therefore less of a concern if contact with some subjects was lost because having the date when they were last known to be alive still provides information.
Table 2.2 and Figure 2.4 are based on the group of nine subjects above. The
plot looks like a series of steps. Every time a subjects dies, the step drops down
(the first drop is at 2.7 years). When subjects are censored, four in the example,
they contribute no further information to the analysis after that date. In large
studies with many deaths, the plot looks smoother.
It is possible to estimate survival rates at specific time points, and the
median survival. For the 5-year survival rate, a vertical line is drawn on
the x-axis at ‘5’ and the corresponding y-axis value is taken when the line hits
the curve: 65% (Figure 2.4). The median is the time at which half the subjects
have died. A horizontal line is drawn on the y-axis at ‘50%’ and the corresponding x-axis value is taken when the line hits the curve: 7.2 years. These
estimates are more accurately obtained from the life-table (Table 2.2).
Table 2.2 Life-table for the survival data of nine patients on page 24.
Time since
randomisation (years)
Censored
(0 = yes, 1 = dead)
Number of
patients at risk
Percentage alive
(survival rate %)
0
2.7
2.9
3.3
4.7
5.1
6.8
7.2
7.8
9.1
–
1
1
0
1
0
0
1
1
0
9
9
8
7
6
5
4
3
2
1
100
89
78
78
65
65
65
43
22
22
r To obtain the 5-year survival rate from the table it is necessary to ascertain whether there is a
value at exactly 5 years. Because there is not, the closest value from below is taken, i.e. at
4.7 years: 5-year survival rate is 65%.
r The median survival is the point at which 50% of patients are alive. The closest value from
below is 43%, so the median is 7.2 years.
26
Chapter 2
Figure 2.4 Kaplan–Meier plot of the survival data for nine patients, which can also be used to
estimate survival rates and median survival.
When some subjects are censored, i.e. not all have died, the Kaplan–Meier
median survival is not the same as finding the median from a ranked list of
numbers (as in the example on page 21). They are only identical when every
subject has died, which is rare in trials. The median is used instead of the
mean, because time-to-event data often has a skewed distribution.
The Kaplan–Meier plot starts off with every subject alive at time zero; this
is the most common form in the literature. This type of plot is useful when
deaths tend to occur early on. However, it is possible to have a plot in which
no subject has died at time zero. Figure 2.5 uses the same data as in Figure 2.4,
but the death (i.e. event) rate instead of the survival rate is shown on the y-axis
100
Percentage who have died
90
80
70
60
50
40
30
20
10
0
0
2
4
6
8
Time since randomisation (years)
10
Figure 2.5 Kaplan–Meier plot of the survival data for nine patients on page 24, based on
cumulative risk.
Types of outcome measures
27
(100 minus the fourth column in Table 2.2). This type of plot may be more
informative when deaths tend to occur later on. A curve based on the survival
rate has to start at 100% at time zero, but because the y-axis for the death rate
starts at zero, the upper limit can be chosen, allowing differences between two
treatments to be seen more clearly.
Different types of time-to-event outcome measures
In the section above, the ‘event’ in the time-to-event data is ‘death’; called
overall survival because it relates to death from any cause. The methods can
apply to any endpoint that involves measuring the time until a specified event
has occurred, for example, time from entry to a trial until the occurrence or
recurrence of a disorder, such as severe exacerbation of asthma, or any change
in health status, such as time until hospital discharge. The ‘event’ should be
clearly specified. Box 2.4 shows commonly used time-to-event endpoints.
Overall survival is simple because it only requires the date of death.
Cause-specific survival requires, in addition, accurate confirmation of cause
of death (such as pathology records), which is not always available or reliably recorded. Also, cause-specific survival means that deaths from causes
other than that of interest are not counted as an event (they are censored).
This may be inappropriate when the treatment has serious side-effects. A new
therapy may reduce the lung cancer death rate, but increase the risk of dying
from treatment-related side-effects, for example, cardiovascular disease. Here,
overall survival is probably more appropriate.
When an event is disease incidence,# recurrence or progression, the date
when this occurs is required. However, obtaining accurate dates is difficult
unless subjects are examined regularly. The date is usually when the disease
was first discovered. This is either the date when the subject was due to have
one of the regular examinations specified in the trial protocol (see page 161),
or after the subject developed symptoms and received clinical confirmation.
Subjects in the trial arms should therefore have their regular examinations at
a similar time. If, for example, Group A have their examinations earlier than
Group B, this could bias the endpoint in favour of Group B (Figure 2.6).
When the measure is based on two or more event types and a subject could
have both events, such as disease occurrence followed by death, it is usual to
consider only the date of the first event in the analysis. This is because the
patient may be managed differently afterwards: the trial treatment changes
or stops, non-trial therapies are given, or patients may be given the treatment
from the other trial arm. When this occurs, it is difficult dealing with subsequent events, and how to attribute differences in the endpoint to the trial
treatments. Unlike overall survival, disease-, progression- or event-free survival are unaffected by subsequent treatments because only the first event
matters in the analysis.
# The
first time the subject develops the disease of interest.
28
Chapter 2
Box 2.4 Time-to-event outcome measures in trials
Endpoint
An event is defined as
follows. All other
subjects are censored
Overall survival
Death from any cause
Comments
Easily defined
May mask the effects of an
intervention if it only affects a
specific disease
Disease-free survival
First recurrence of the
disease
Death from any cause
Useful when patients are
thought to be free from
disease after treatment, so
patients have a good
prognosis
Needs date of recurrence
Event-free survival
First recurrence of the
disease
Similar to disease-free
survival
First occurrence of other
specified diseases
Death from any cause
Progression-free
survival
First sign of disease
progression
Death from any cause
Useful for advanced disease,
where patients have not been
‘cured’ after treatment, and
are expected to get worse in
the near future
Needs date of progression
Disease (or
cause)-specific
survival
Death from the disease of
interest
Useful when examining
interventions that are not
expected to have an effect on
any disease apart from the one
of interest
Needs accurate recording and
confirmation of cause of death
Assumes treatment is not
associated with deaths from
other causes
Time-to-treatment
failure
First sign of disease
progression
Similar to progression-free
survival
Death from any cause
Stopped treatment
Recurrence: there was no clinical evidence of the disease shortly after treatment, but the disease returned
later on.
Progression (or relapse): the patient still had the disease after treatment, but it got worse later.
Disease and event-free survival may be used interchangeably, so it is useful to be clear about the precise
definition.
Types of outcome measures
Entry to Trial True start of progression
Group
A
Group
B
0
4
Time (Months)
29
Clinical Examination
5
9
Figure 2.6 Two hypothetical patients from Groups A and B, whose disease has the same
biological course but with different dates of first clinical examination.
Recorded time to progression is: 5 months for patient in Group A and 9 months for patient in
Group B. It would falsely appear that Group B has a greater benefit.
2.6 Summary points
r Trials should have clearly defined outcome measures (endpoints)
r Surrogate endpoints should be closely correlated with ‘true’ endpoints, and
have been validated, especially if they are used as the main trial endpoint
r Outcome measures could involve ‘counting people’, ‘taking measurements
on people’ or ‘time-to-event’ data
r Counting people: data are summarised by a percentage or proportion (risk)
r Taking measurements on people: data are summarised by average and
spread (mean and standard deviation if the data are Normally distributed,
median and interquartile range if the data are skewed)
r Time-to-event data: when not all patients have had the event of interest: the
data can be summarised using a Kaplan–Meier plot, median value, or survival
or event-rate at a specific time point.
References
1. Katz R. Biomarkers and surrogate markers: an FDA perspective. NeuroRx: J Am Soc Exp
NeuroTherap 2004; 1:189–195.
2. Temple R. Are surrogate markers adequate to assess cardiovascular disease drugs? JAMA
1999; 282(8):790–795.
3. Guidance for industry: clinical trial endpoints for the approval of cancer drugs and biologics. http://www.fda.gov/CbER/gdlns/clintrialend.htm.
4. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat
Assoc 1958; 53:457–481.
CHAPTER 3
Design and analysis of phase I trials
Phase I trials, often referred to as ‘first in man’ studies, are conducted to examine the biological and pharmacological actions of a new treatment (usually a
new drug), and its side-effects. They are almost always preceded by several
in vitro studies and studies in mammals. A more detailed discussion of the
design, conduct and analysis of phase I trials is found in the references.1–4
3.1 Design
Phase I studies are exploratory, and they usually aim to determine a sufficiently safe dose. They involve giving a certain dose to a few subjects, and if
tolerable, the next group receive a higher dose. This continues until the administered dose is associated with an unacceptable level of side-effects. This is not
the same as trying to find the most effective (optimal) dose, which is the objective of phase II and III trials. Although there needs to be a small number of
subjects in each dose group, the study should provide enough information
on safety and efficacy to determine whether a new drug should be investigated further. This can be a difficult balance to achieve. Few trials have formal
methods for estimating the total sample size because the number of subjects
recruited will largely depend on the design employed and how many doses
are evaluated until the trial stops. The trial protocol# could specify what might
be a maximum number of patients, based on the target range of doses.
Type of subjects
Healthy volunteers are often used, and if safe enough, there could follow
another phase I study in patients affected with the disorder of interest. An
exception is cancer drug trials, where traditional anti-cancer drugs are first
tested in cancer patients because the expected toxic effects make them inappropriate to test in healthy volunteers. Furthermore, healthy people may be
able to tolerate cancer drugs at higher doses than a cancer patient, who is
already ill. Cancer patients included in phase I studies have usually had
#
A detailed description of the trial design and conduct; see page 160.
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
31
32
Chapter 3
several previous therapies, but did not respond, so they tend to be less fit than
the target group of patients. Therefore, estimates of treatment effectiveness
need to be interpreted carefully. Several phase I studies may be conducted,
each looking at different aspects of a new therapy. For example, examining
the pharmacological effects when the drug is taken with and without food,
giving multiple doses, and renal impairment.
Trial subjects must be monitored very closely, and this is usually done by
admitting them to a special clinical trials facility, allowing regular examinations over 24 hours or longer, such as blood tests and physical examinations.
If there is already evidence on the drug’s safety profile, subjects may be seen
as outpatients, but they still need to be examined regularly (e.g. at least once
a week). Participants are often found through advertisements in the media,
and those accepted onto a trial programme are paid for taking part (usually
for commercial company trials).
Outcome measures
One or more measures of toxicity are often the common main endpoints. In
healthy volunteers a serious adverse event can be any reaction related to the
trial drug that requires treatment and the person to be taken off the new drug.
This is called a dose-limiting toxicity (DLT). A DLT should occur relatively
soon after the drug was administered. In phase I trials based on subjects who
are already ill, some adverse events are expected naturally, and so may not
be classified as a DLT. The trial protocol should provide clear definitions of
toxicity.
The principle aim is to find the maximum tolerated dose (MTD), which
can be defined differently. Sometimes, it is the dose at which a pre-specified
number of individuals suffer a severe adverse event, indicating that this dose
may be too unsafe, so the next lowest dose would be investigated further. This
definition can also be called the maximum administered dose. At other times,
the MTD could be the dose that has an acceptable number of side-effects and
is therefore used in further studies. It is useful to be clear about the definition
used in a particular trial report.
Many other trial endpoints are measured, including those which monitor drug uptake, metabolism and excretion, for example, body temperature,
blood pressure, plasma concentration of the drug and other biological and
physiological measurements. There could also be several surrogate markers
that provide an initial evaluation of treatment effect, particularly when the
study is conducted in patients affected with the disorder of interest. Many
variables are examined because the data will be used to determine whether
the drug is safe enough and worth investigating further. The timing of the
assessments (i.e. how often), especially blood samples, needs to be carefully
considered, and is usually fairly frequently early on.
Which doses?
The starting dose for many drug trials is based on animal experiments, and
is one that is associated with a specified mortality rate. Different countries
Design and analysis of phase I trials
33
Table 3.1 Fibonacci sequence of numbers and the possible doses for a hypothetical trial.
Fibonacci
sequence
1
1
2
3
5
8
13
21
34
55
Difference
between
successive
numbers
–
0
1
1
2
3
5
8
13
21
Ratio of
successive
numbers
Example of a
dose (mg) to
be used in a
phase I trial
Possible
modified
Fibonacci
doses∗
–
1
2
1.5
1.667
1.600
1.625
1.615
1.619
1.618
3
3
6
9
15
24
39
63
102
165
3
3
5
10
15
25
40
65
100
165
∗ observed Fibonacci dose rounded to the nearest 5 mg.
have different requirements, for example, the US Food and Drug Administration require evidence from at least two mammalian species, including a
non-rodent species.5,6 The starting dose may also be specified in the guidelines. For example, with anti-cancer drugs the initial dose is usually one-tenth
of the dose that is associated with 10% of rodents dying in laboratory studies.
If a non-rodent species indicates that this dose is too toxic then the starting
dose could be one-third or one-sixthof the lowest toxic dose in those species.
There are several methods for determining subsequent doses. One is based
on a Fibonacci sequence, a series of numbers found to occur naturally in many
biological and ecological systems, for example, the number of petals on flowers. The series starts off with a ‘0’ and ‘1’, then every successive number is the
sum of the preceding two numbers. The first 10 numbers in the series are: 0,
1, 1, 2, 3, 5, 8, 13, 21, 34.
While the numbers appear to increase quickly, the relative increase is
roughly constant (Table 3.1). After the third dose, each subsequent dose is
about two-thirds greater. In practice, the doses are rounded up or down
(Table 3.1). This could be referred to as a ‘modified Fibonacci’ sequence, but
the relative increases should still be about two-thirds.
Doses in a trial do not need to follow a Fibonacci sequence. The range could
be based on evidence from other studies or previous experience, or they could
come from a logrithmic scale (e.g. if the starting dose is 5 mg, subsequent
doses could be 10, 20 and 40 mg). The researcher could decide the dose range,
and the increase could be greater earlier on. In the example below, the dose
increases by about 50% in the three doses after the starting dose of 100 mg,
but at higher doses the relative increases are lower:
Dose (mg)
Relative increase
100
−
150
50%
225
50%
350
55%
450
29%
550
22%
34
Chapter 3
Dose given to 3 subjects
Number of subjects with DLT
0
Go to next higher dose
1
Treat 3 more at same dose
2 or 3
Dose = MTD
Number of subjects with DLT in the 6
1
Go to next higher dose
2 or 3
Dose = MTD
Figure 3.1 Flowchart for a phase I trial using a ‘3+3’ design. MTD (maximum tolerated dose);
DLT (dose-limiting toxicity). Doses are increased until the maximum planned dose or MTD is
reached.
Conducting the trial
Because the drug has not been previously tested in humans, the protocol
needs to be followed carefully to avoid unnecessary harm to the subjects. The
subjects who have agreed to participate could also be randomised to the different doses (possibly even a placebo), though subsequent doses should only
be given after the current cohort of subjects have been evaluated for safety,
after a sufficient time has elapsed. There is a range of designs, from simple to
complex. A simple dose-escalation design is a 3 + 3 design.
The ‘3+3’ dose escalation scheme is a classical approach. It is based on
observing how many subjects in each group have a DLT before deciding
whether to keep the current dose or move to a higher dose. It is called ‘3+3’
because subjects are recruited in groups of three or six, as shown in Figure 3.1.
In this design, the decision rules to stop or continue to a higher dose are
based on a conventional toxicity risk of 1 in 3. If a different risk were assumed,
such as 1 in 4, the decision rules would need to change. While this design is
simple, there are limitations. If the starting dose is too low, there may be no
DLTs until after several doses have been administered. Therefore several subjects would have been treated without providing much information about the
MTD of the new drug, and the trial would take longer. There also is a chance
that the true MTD could be higher than the one indicated in a particular trial,
i.e. the study stops too early. If the drug is not too toxic, the design can be
adapted to reduce the probability of stopping early.
There are several other variations on these designs,3 e.g. accelerated titration, but whichever is used, the safety stopping rules should be clearly
specified before the trial begins, to minimise the possibility of researcher
bias towards higher (and possibly more unsafe) doses. While these types of
Design and analysis of phase I trials
35
Dose given to 3 subjects
Number of patients with BA
0 or 1
Go to next higher dose
2 or 3
Treat 3 more subjects at same dose
Number of subjects with BA in the 6
2 to 4
Go to next higher dose
5 or 6
Dose = MBAD
Figure 3.2 Flowchart for a phase I trial based on examining a biological endpoint. MBAD
(minimum biologically active dose); BA (biological activity). Doses are increased until the
maximum planned dose or MBAD is reached.
designs are simple to use and easy to interpret, they have been criticised for
being inefficient. Sometimes the starting and subsequent early doses are too
low, so many subjects are treated before any activity (safety or efficacy) is
observed.
There are more complex dose-escalation designs that are believed to be
more efficient. These include the continuous reassessment method and those
based on Bayesian methods. They are based on statistical modelling and
assume a mathematical relationship between dose and the chance of having a
DLT at each dose; often a sigmoid (flattened S-shaped) curve. At early doses,
a lack of toxicities indicates that subsequent doses could be made greater than
those based on, say, a Fibonacci sequence. After each cohort of subjects has
been evaluated, the actual shape of the dose-response curve is re-estimated,
in order to reach the MTD quicker. Sometimes, there may only be one subject
per dose, so that fewer patients are needed than the simpler designs. However,
a limitation of these methods is that it may be difficult to get enough information about the pharmacological actions of the drug with only one subject
per dose.4
Once the MTD has been determined, it might be useful to test the dose on
a further group of, say, 10 subjects, to obtain a clearer view of the safety profile before proceeding to a larger study and perhaps also an examination of
efficacy.
3.2 Non-toxicity endpoints
The above designs are used to identify the maximum tolerated dose when
using drugs or exposures with expected toxicities. As new safer therapies are
36
Chapter 3
developed, biological endpoints or pharmacological measures, i.e. markers of
drug activity, may be as important. The objective of the trial could then be to
find the minimum dose that has a material effect on the biological endpoint.
This is sometimes called the minimum biologically active dose (MBAD).
Rather than identifying subjects who exhibit a DLT, an endpoint of biological
activity (BA) is specified. Toxicity must still be monitored closely, but there
may be other indicators that determine which dose is carried forward to a
phase II study.
A simple design associated with this type of endpoint is the ‘5/6 design’
(Figure 3.3).3 The MBAD is chosen when five out of six subjects exhibit a predefined biological activity. An example could be changes in Ki67, a marker of
tumour cell proliferation in cancer. If the Ki67 for a patient decreases from,
say, 50 to 25%, this could indicate biological activity of a new treatment.
3.3 Statistical analysis and reporting the trial results
There should be a summary of the characteristics of the subjects, details of the
side-effects observed (including severity), and a description of the following
pharmacological effects:
r Pharmacodynamics: physical or biological measures that show the effect
of the new drug on the body (this could include efficacy)
r Pharmacokinetics: physical or biological measures that show how the
body deals with the new drug.
Pharmacokinetics can be presented as a plasma concentration–time curve,
which plots blood levels of the new drug against time since administration,
showing how much of the drug gets into the blood and what happens to
these levels over time (Figure 3.3). The following measures can be obtained
from this type of curve, for each subject:2
r area under the curve (AUC), indicating total drug exposure
r Cmax, the highest concentration level
r Tmax, the time at which Cmax occurs
(a)
(b)
Figure 3.3 Plasma concentration-time curves for two trial subjects.
Design and analysis of phase I trials
37
r terminal half-life (t1/2 ), the time it takes for the plasma concentration to
decrease by 50% in the final part of the curve, when the drug is being eliminated (here, the curve may appear as a straight line if using a log transformation of the plasma levels).
Other measures are clearance (CL), the rate at which the drug is removed
from the plasma as it is metabolised or excreted (CL = dose/AUC); volume of
distribution (V), the amount of drug in the body divided by the plasma concentration; and bioavailability (F), the percentage of administered dose that
gets into the systemic circulation (e.g. an intravenous drug should have F =
100%).2
Summary curves and statistics can be produced across all subjects (e.g.
the mean AUC). Showing that AUC increases proportionally with dose (i.e.
AUC doubles as the dose doubles), makes it easier to describe and model the
effect of the drug, and plan further early phase studies. There could also be
a description of how the body metabolises the drug (i.e. what molecules the
drug changes to), and the process and speed of excretion.
Table 3.2 Example of a phase I trial.7
Study feature
Example
Target disease
Parkinson’s disease
Drug being investigated
BAY 63-9044, a new 5-HT1a -receptor agonist (has
neuroprotective and symptomatic effects)
Aim
To determine the maximum tolerated dose
Design
First-in-man trial of male healthy volunteers, aged
18–45 years (randomised study)
Treatment doses investigated
0.25, 0.50, 1.20, 2.50, 5.00 mg and placebo
Definition of dose-limiting toxicity, DLT
(i.e. treatment-related side-effects)
Any drug-related adverse event (graded mild,
moderate, severe)
Number of subjects
N = 45
Main result
There were no serious adverse events
The number of mild or moderate events out of the
number of subjects in the cohort were:
Placebo n = 0/14
0.25 mg n = 2/7
0.50
n = 0/7
1.20
n = 0/6
2.50
n = 1/5
5.00
n = 5/6
Conclusion
There were too many subjects with adverse events in
the 5 mg dose group.
A dose of 2.5 mg should be used in further studies.
38
Chapter 3
An example of a phase I trial is given in Table 3.2.7 Five out of six patients
suffered an adverse event at the highest dose of 5 mg, therefore the next lowest
dose (2.5 mg) would be recommended for further investigation.
3.4 Summary points
r Phase I studies are small and aim to provide a first assessment of safety in
human subjects
r There are simple designs for determining the dose of a new drug that has
an acceptable number of serious side-effects
r Trials of new, safer therapies may need to have different biological endpoints as well as toxicity
r Reports of phase I studies should provide clear information on the pharmacological properties of a new drug, including plasma concentration curves
over time, and details of adverse events.
References
1. O’Grady J, Joubert PH (Eds). Handbook of Phase 1 and 2 Clinical Drug Trials. CRC Press Inc.,
1997.
2. Griffin JP, O’Grady J (Eds). The textbook of pharmaceutical medicine. BMJ Books, Blackwell
Publishing, 5th edn, 2006.
3. Eisenhauer EA, Twelves C, Buyse M. Phase I cancer clinical trials: a practical guide. Oxford
University Press, 2006.
4. Eisenhauer EA, O’Dwyer PJ, Christian M, Humphrey JS. Phase I clinical trial design in
cancer drug development. J Clin Oncol 2000; 18:684–692.
5. www.fda.gov/cder/guidance/pt1.pdf
6. http://www.fda.gov/cder/guidance/7086fnl.pdf
7. Wensing G, Haase C, Brendel E, Bottcher MF. Pupillography as a sensitive, non-invasive
biomarker in healthy volunteers: first-in-man study of BAY 63-9044, a new 5-HT1a receptor agonist with dopamine agnostic properties. Eur J Clin Pharmacol 2007; 63:1123–
1128.
CHAPTER 4
Design and analysis of phase II trials
Phase II trials are useful in examining the potential effectiveness of an intervention before embarking on a large, expensive phase III trial. They are common in oncology, and many of the designs and statistical issues have been
based on cancer studies.
4.1 Purpose of phase II studies
The aim is to obtain preliminary evidence on whether a new treatment might be
effective, i.e. whether it can influence a clinically important outcome measure,
such as mortality, or reduce the severity of a disease. Safety should still be
monitored closely. The results of a phase II study often help design a phase III
trial.
Phase II studies may also be pilot (or feasibility) studies, used to assess
whether a phase III trial is likely to be successful. The study is designed and
conducted in a similar manner to a phase III trial (Chapter 5), but the protocol
specifies that an early assessment is made after a proportion of subjects have
been recruited first (e.g. 25%), or the trial has run for a fixed length of time.
A formal sample size calculation for this part of the study is not normally
necessary.
Pilot studies often raise issues that require investigation, for example, examining the proportion of eligible subjects approached who agree to participate
(i.e. the acceptance or uptake rate), and if accrual is low, what might be the
likely reasons for this. Consider a phase III trial requiring 600 subjects to be
recruited over four years. The pilot phase could be conducted to see whether
a recruitment rate of 15 subjects per month is likely. The endpoint is ‘monthly
accrual rate’ assessed, say, 12 months after recruitment started, ignoring the
expected low initial accrual rates during trial set up (say 60 in the first year).
If the uptake rate is low, ways could be found to encourage participation, perhaps by changing the wording of the patient information sheet (see page 161).
In the remainder of this chapter only phase II studies examining efficacy and
safety are discussed.
4.2 Design
There are several phase II designs, and a discussion is found in various sources
(though some are aimed at statisticians).1–8 Most methods are intended for
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
39
40
Chapter 4
Box 4.1 Example of a two-stage phase II design
The response rate for a new treatment should not be lower than 20%, the
rate associated with standard therapy. The new intervention should have a
response rate of at least 35%. Using these estimates, 5% level of statistical significance and 80% power (page 43) produces the following design:
Stage 1 : Recruit and treat 22 subjects
If ≥6 respond, continue trial to Stage 2 (treatment might be
effective enough)
If ≤5 respond, stop trial early (treatment unlikely to be effective
enough)
Stage 2 : Recruit a further 50 patients, to make 72 in total
If ≥20 respond consider further investigation
The method is described in reference 11.
studies examining whether a new intervention is likely to be better than current treatments, based on an improvement in disease status, or fewer sideeffects.
Single-arm study
The simplest design has only one arm: all subjects are given the new intervention. The advantage is that all resources, i.e. subjects and financial costs,
are concentrated on one group. Some designs also specify how many subjects
should respond to the new treatment in order to justify further investigation.
For example, if a new intervention has an expected treatment response rate of
35% and the percentage of subjects who currently respond is 20%, the sample
size would be 56 subjects, of which ≥17 need to respond to indicate that the
true response rate is greater than 20%.# If, however, there are only five responders it is unlikely that the treatment is effective. (The definition of ‘response’
will depend on the trial endpoint used.)
Single-arm two-stage study
Although single-arm phase II studies usually have about 30–70 subjects, it
may be preferable to stop the trial early. In a two-stage design, the intervention
is first tested on a small number of subjects, and the subjects are assessed at
the end of this stage (Box 4.1). If a certain number respond, the trial continues
and a second group of subjects is recruited, otherwise the trial stops: this is
# 17 out of 56 is 30%, but the calculated one-sided 95% confidence interval (discussed later
in this chapter), has a lower limit of 20.4%. The lower limit excludes the possibility of a true
underlying rate of 20% with sufficient certainty – i.e. the new treatment response rate is
likely to be greater than 20%.
Design and analysis of phase II trials
41
referred to as a stopping rule. This design is used when the outcome is based
on ‘counting people’ (i.e. binary data). Two-stage designs are useful when the
new intervention might have serious side-effects or is expensive, because only
a few subjects are given such a therapy, which may have no true benefit. A
practical limitation is that after the first stage is reached, centres probably need
to stop recruiting further patients until the initial assessment is made. They
then need to re-start recruiting if enough subjects respond. There are logistical
issues associated with temporarily halting a study.
However, the decision to continue to Stage 2 may hinge on the response of
only one or two subjects. In Box 4.1, suppose there are five responders during
Stage 1, but there really is a beneficial effect of the treatment. If the stopping
rule is strictly adhered to, an effective treatment would not be studied further
and future patients would not benefit. The alternative is also possible. A truly
ineffective treatment is investigated further because a sufficient number of
subjects happened to show a response in Stage 1, though this is probably of
less importance.
Randomised phase II trial with control arm
There are two trial groups; the new intervention and a control (standard treatment or placebo). The control arm is often used when it is not well known how
subjects respond generally. The results found in each arm are used to design
the corresponding arms in a phase III trial, in particular determining sample
size (see Chapter 5). By randomising subjects to the trial arms, some comparison could be made at the end of the study, although this will not determine
whether the new intervention is better. This design also provides information on recruitment rates, subjects’ willingness to participate in a randomised
study, and possible logistical problems, all of which could help future studies.
Randomised phase II trial with several intervention arms
Two or more new treatments could be examined simultaneously. Each arm
is designed as a single-arm study, and subjects are randomised to the different groups, with the same advantages as above. One or more of the new
treatments are identified that could be investigated further. This design is
sometimes called ‘pick the winner’, though there is not necessarily a single
‘winner’. The primary intention is not to directly compare the results between
the new treatment arms. Deciding which treatment should be taken further is
determined in the same way as with a single-arm phase II study, i.e. whether
the treatment response rate in each arm exceeds the expected response associated with standard treatments. This design could also include a control arm
using standard treatment or placebo.
Randomised phase II trial with several intervention arms:
two-stage design
This is an extension of the single-arm two-stage design. At the first stage, a few
subjects (specified by the sample size calculation) are randomised to each of
Chapter 4
42
the new treatments. An assessment of efficacy is made, and those treatments
that seem effective enough proceed to Stage 2, though not all will past the first
stage (another form of ‘pick the winner’).
Types of phase II trials
r
r
r
r
r
Single arm
Single arm, two-stage design
Randomised phase II with control arm
Randomised phase II with several new treatment arms∗
Randomised phase II with several new treatment arms, two-stage design.∗
∗ could include a control arm (standard treatment or placebo)
4.3 Choosing outcome measures
Phase II studies should be conducted in a relatively short space of time, and
the main endpoint should be compatible with this, as well as being clinically relevant. Therefore, several surrogate endpoints can be used (page 17).
Observed changes in a validated surrogate endpoint may indicate an effect on
a true endpoint. Similarly, if a new treatment appears to have no effect on a
surrogate marker, it is unlikely that it would have an effect on a true endpoint.
There may be several endpoints because the aim is to have a preliminary evaluation of the new intervention, and sufficient information is needed to decide
whether a larger phase III trial is justified.
4.4 Sample size
There are various methods for estimating how many subjects should be
recruited. This depends on the study design employed (single or two-stage),
and the type of outcome measure. Two treatment effects are specified (i.e. two
proportions or two mean values):
r One that is thought to be associated with the new intervention. This may
come from prior evidence, or it may be the minimum effect that would be
considered clinically important
r One that is considered to be the lowest acceptable level, usually the same
as that for current treatments or standard of care. The new treatment needs
to be more effective than this. The sample size method assumes that this
effect is known with certainty.
A fundamental difference between randomised phase II and III trials is that
the sample-size calculation for phase III studies assumes that the treatment
effect in each arm is not known with certainty, even though there is some
knowledge of this in the standard treatment group (see page 10). In phase
II studies, the sample-size calculation assumes there is only one area of
Design and analysis of phase II trials
43
uncertainty, i.e. the new intervention arm. This is why the sample size is
always larger in a phase III trial: the treatment effect in each trial arm will
have some imprecision when trying to estimate the true effects.
Information is required on two other factors:
Statistical significance level. This is often set at 5%. If a new treatment is
believed to have a response rate of 35%, but it really has a response rate which
is no greater than the standard treatment (e.g. 20% response rate), there would
be a 5% probability of finding a difference as large as 15 percentage points just
by chance.# (This is called a Type I error.) It is assumed that a mistake would
be made by concluding that the new intervention is better than standard treatments, when in fact it is not, so a one-sided significance level is used. In many
phase III trials, a two-sided significance level is used, because a mistake is
made by concluding that the new intervention is better or worse than the control group, when there really is no difference between them (see Chapter 5).
Power. This is the chance of finding an effect in the study if a true effect really
exists. This is set at a high level, 80 or 90%. (The converse, 20 or 10% is called a
Type II error, the chance of missing an effect if it exists.) In the example above,
there could be an 80% chance of finding a difference ≥ 15 percentage points,
if the true response rate is 35%.
Power
At the end of the trial we want to say:
‘A comparison of the observed response rate of 35% (new intervention)
compared with the known∗ response of 20% (control) is statistically significant at the 5% level.’
We want an 80% probability of being able to make this statement (80% power)
if there really is a difference of this magnitude.
∗ assumed to be known with certainty
There are various statistical formulae to calculate sample size7–11 , some of
which come with free software7 , and commercially software is available.12–13
Calculating the sample size
When the outcome measure involves counting people, the specified percentage (or proportion) associated with the new treatment and standard therapy
# ‘15 percentage points’ is a better way of describing the effect than ‘15%’ when comparing
two percentages. It avoids the possible confusion over whether the rate for the new
treatment is 20% + 15% = 35%, rather than 15% greater than 20%, which would be
20% × 1.15 = 23%.
44
Chapter 4
Table 4.1 Sample sizes for a phase II study, where the endpoint is based on ‘counting people’.
The table shows the number of subjects that need to be given the new treatment (based on a 5%
one-sided level of statistical significance), from A’Hern.9
80% power
Counting people
% standard treatment
(assumed to be known)
10
90% power
% expected in new intervention
20
25
30
78 (13)
40 (8)
25 (6)
109 (17)
55 (10)
33 (7)
30
40
45
50
141 (52)
67 (27)
39 (17)
193 (69)
93 (36)
53 (22)
50
60
65
70
158 (90)
69 (42)
37 (24)
213 (119)
93 (55)
53 (33)
If a randomised phase II trial with a control arm is used, the total study size is usually double the
number of subjects in the above table.
The numbers in brackets are the number of observed responses needed at the end of the trial to
help justify further investigation of the new treatment (it ensures that the lower limit of a
one-sided confidence interval exceeds the response rate in the standard treatment arm). Another
method is by Fleming10 , though both approaches give similar sample sizes as they get larger.
are used in the sample-size calculation. Examples are shown in Table 4.1.
When taking measurements on people, the specified means and the standard
deviation are required, which are converted into a standardised difference.
Standardised difference
(expected mean in intervention group) − (known mean using standard treatment)
=
standard deviation of the measurement
For example, suppose a new diet aims to reduce body weight to 83 kg. If
the usual average weight is 85 kg, with standard deviation of 5 kg, the standardised difference is (83 − 85)/5 = −0.4. The simplest sample size method
assumes that the endpoint has a Normal distribution. If it clearly does not,
then non-parametric methods should be used, which are more complex.
Sample size based on ‘taking measurements on people’ endpoints could be
calculated using a one-sample t-test. For example, the number of subjects that
need to be given a new therapy are 101, 40 and 27 for standardised differences
of 0.25, 0.40 and 0.50 respectively (80% power and one-sided 5% level).
When estimating sample size for a ‘time-to-event’ endpoint, a simple
approach is to use the ‘counting people’ category, allowing the use of standard
methods. Consider a new therapy for Alzheimer’s disease, where the trial
Design and analysis of phase II trials
45
endpoint is time to progression. The percentage of patients who have progressed (or not progressed) at a certain time point, say six months, is used
though all patients need to be followed for six months.
If the median times are known, they can be converted to an event rate at a
specific time point. Suppose the expected median time to progression using
the new treatment is eight months, but the six-month progression-free rate is
required, then:
r Progression-free rate at y months = exponential [(loge 0.5 × y)/median
progression]#
r Progression-free rate at six months = exponential [(loge 0.5 × 6)/8] = 0.59
or 59%
(the progression rate at six months = 100 − 59% = 41%).
If the median using standard therapy is five months (assumed to be known
with certainty), the six-month progression-free rate is 44%.
The sample size can be estimated using 59 vs 44% (about 70 patients).9
Table 4.2 shows examples of sample size descriptions for different trial
designs. In the last example, the researcher chooses the number of patients
in the control arm, and it often just happens to be the same as the intervention arm. There is no scientific justification for this. They are made the same
because this makes it easier to describe and conduct the trial, particularly if
a placebo group is used, and the trial will look more similar to the possible
subsequent phase III trial. The ratio of subjects in the new intervention and
control arm could be 2:1. In phase III trials the number of subjects in both
arms come from the statistical method used to estimate sample size.
Information needed to calculate sample size
r Expected effect in the new intervention group
r The effect in patients given standard treatments (assumed to be known with
certainty)
r Significance level (usually 5%, at the one-sided level)
r Power (usually 80% or 90%).
Sample-size estimation is not an exact science. It is dependent on the input
parameters, for example, the estimated effect in the test arm, and the known
effect using standard treatment. If either of these is far from the true values,
the sample size will be too small or too large. Few trials produce effects that
# Assumes that progression (or any other time-to-event measure) has an exponential
distribution.
46
Chapter 4
Table 4.2 Hypothetical examples of sample size descriptions that can be used in a grant
application or trial protocol.
Type of phase II study
Outcome measure category
Trial endpoint
Description (the numbers in bold are those needed for the
sample size calculation, using formulae or software, to produce
the number of subjects required, and sometimes the number of
required treatment responders)
Single arm
Counting people
Progression rate (%)
The percentage of Alzheimer’s patients who are expected to
progress after one year with a new Drug A is 15%. The
percentage who usually progress is 25%. Drug A should not
have a progression rate as high as this. A single-arm study
would require 103 patients to show a decrease from 25 to 15%
as statistically significant at the 5% level (one-sided) with 80%
power. If at most 15 patients progress, then a larger trial might
be justified.#
Single arm
Taking measurements on
people
Body weight (kg)
Diet B is expected to reduce body weight by 2 kg in women
aged 20–40 years. Body weight is Normally distributed and the
mean weight in women is generally about 70 kg, with a
standard deviation of 5 kg. The aim is to reduce body weight to
68 kg. A single-arm study would require 40 subjects to show a
standardised difference of 0.4 [(68–70-)/5], with 80% power
and a one-sided 5% level of statistical significance.
Single arm two-stage
Counting people
Tumour response rate (%)
The percentage of patients with an advanced sub-type of
ovarian cancer who are expected to have a partial/complete
tumour response after standard treatment is about 20%. A new
Therapy F is expected to increase this to 35%. Using a
two-stage design with (80% power and one-sided 5%
significance level) the following design is employed. 22 patients
are recruited in Stage 1. If 6 or more patients respond, then a
further 50 patients are recruited (Stage 2), to make 72 in total. If
20 or more respond out of 72 then a larger trial would be
worthwhile.
Randomised phase II with
control arm
Counting people
Progression-free survival
rate (%)
The percentage of patients with pancreatic cancer who are alive
and progression-free is normally 20% after 1 year. Therapy G is
expected to increase this to 35%. Fifty-six patients need to be
given Therapy G in a phase II study with 80% power and
one-sided test of statistical significance at the 5% level. If at
least 17 patients remain alive and progression-free, then a
larger trial might be justified.
Because this type of cancer is relatively uncommon, the
progression-free survival rate using standard treatments is not
known with sufficient reliability. A control arm that has the same
number of patients as the new treatment arm, i.e. 56 patients,
will be used. Therefore, the trial will have 112 patients in total.
To allow an unbiased comparison at the end of the study,
patients will be randomised to both arms, acknowledging that
the study is not powered for such a comparison.
#
Sample size methods for ‘counting people’ endpoints are often based on those that are
‘positive’ (e.g. respond to treatment or are alive). In this example, the endpoint is ‘negative’, so
85% and 75% are used in the calculation, instead of 15 and 25%.
It is worthwhile providing references to the effect using standard treatments from the literature or
unpublished work, where possible.
Design and analysis of phase II trials
47
are identical to the sample-size parameters. Therefore, given the natural variability in how subjects respond to treatments, it should not matter whether
the estimated sample size is 50 or 55, but rather whether it is 50 or 100.
Also, the results should be interpreted in the context of the type of patients
that were entered into the study since they might, for example, have a lower
response rate than that used in the sample-size calculation because they had
a poorer prognosis than originally anticipated.
4.5 Stopping early for toxicity
When testing a new drug or medical device, a stopping rule for toxicity could
be incorporated. The trial stops early if the number of subjects who suffer a
severe treatment-related adverse event exceeds a pre-specified level. The rule
can be estimated using the sample size for efficacy. Suppose the sample size
is 56 patients (to compare a response rate of 35% with 20%, new vs control
treatments respectively). It is then necessary to specify what is considered to
be an unacceptable toxicity rate – for example, more than 30% – and use a
calculation based on the ‘binomial distribution’. This gives the probability of
seeing ‘x’ or more people with an adverse event, by chance, assuming that the
underlying (true) toxicity rate is ‘p’ (e.g. 30%). In the example, the probabilities
of seeing at least 20 people with an adverse event, out of 56, are as follows:
Number of subjects with an adverse event 20 21 22 23
24
25
Probability of seeing this number or greater 0.21 0.14 0.09 0.051 0.028 0.01
Observing 20 or 21 affected subjects is consistent with a true rate of 30%
because this could occur by chance. The trial would not stop early. However,
as soon as 24 or 25 patients have an event, this is evidence that the true rate is
probably greater than 30%. For example, the likelihood of seeing ≥ 24 events
by chance, among 56 patients, is 0.028 if the underlying rate were 30%. Because
this is a small probability (less than 5%# ), it can be concluded that the true rate
is likely to exceed 30%. Consideration should then be given to stopping early.
4.6 Statistical analysis
A description of the subject population should be provided, usually as a table
summarising baseline characteristics, such as the age and gender distribution and other factors relevant to the disease of interest (for example, disease
stage). The following focuses on how to analyse and interpret the results for a
single trial arm. Statistical analyses for comparing two arms are discussed in
Chapter 7.
The data is often analysed on an intention-to-treat basis, i.e. trial subjects are included in the analysis whether or not they actually took the new
# 5%
is the accepted cut-off; see also statistical significance on page 112.
48
Chapter 4
treatment. It may also be useful to look at efficacy and toxicity in subjects who
did take the trial treatment (a per-protocol analysis). Both approaches are discussed on page 116.
In research, ‘population’ refers to the set of all people of interest, and to
whom a new intervention could be given. When conducting a trial, a sample
of subjects is taken from the population. Data from the sample is used to make
inferences, not just about the individuals in the sample, but about the whole
population of interest. For example, in examining a new drug to alleviate pain
in adults with arthritis, a sample of patients is selected for the trial, but the aim
is to determine the effect of the drug in all patients, now and in the future.
It is not possible to study every adult with arthritis, so there will always
be some uncertainty in what can be inferred about the population from the
sample in the trial:
r Would the same result emerge in another group of subjects given the new
intervention?
r Can the true treatment effect be estimated?
Natural variation between people and how they respond to the same treatment matters a great deal when interpreting research data (see page 4).
Two statistical parameters, called standard error and the confidence interval,
allow this variability to be taken into account.
Analysing outcome measures based on counting people
The summary statistic is a simple percentage or proportion. In a group of 50
subjects given the new intervention, if 28 responded (however defined), the
observed response rate is 56% (28/50). An estimate of the true or population
proportion is needed. The true value is unlikely to be 56% exactly, but it is
hoped that it would be close.
The standard error of the true proportion quantifies how far the observed
value is expected to be from the true value, given the results of the trial with
a certain sample size. (This is done using assumptions about the data and
established mathematical properties.) A standard error is used to produce a
confidence interval (CI). A trial based on every relevant subject ever would
yield the true proportion. There would be no uncertainty and the standard
error would be zero.
What are the implications of conducting a trial on a sample
of people?
A CI for the true proportion is a range within which the true proportion is
expected to lie:
95% CI = observed proportion ± 1.96 × standard error
If the response rate is 56% (28/50), the 95% CI is 42% to 70% (see page 205
for the calculation). From this particular trial, the best estimate of the true
proportion of responders is 56%, but there is 95% certainty that the true value
Design and analysis of phase II trials
49
True value
0
10
20
30
40
50
60
70
80
90
100
Percentage of subjects who respond
Figure 4.1 The percentage of subjects who respond to a new treatment in 20 hypothetical phase
II trials, each based on 50 subjects. Each dot represents the observed percentage, and the ends
of the line are the lower and upper limits of the 95% confidence interval (two-sided). It is
assumed that the true effect of the new treatment is known to be 55%, indicated by the vertical
dashed line.
lies somewhere between 42 and 70%.# This also means that the range could
get the wrong answer 5% of the time. A conservative estimate is that 42% of
all subjects are expected to respond, but as many as 70% could respond.
This is a two-sided CI, and one that is commonly reported. The new intervention could be better or worse than the standard therapy. In phase II studies
the main interest is in whether the new intervention is likely to be better, so
researchers may also examine a one-sided CI. For this, only the upper or lower
limit is needed, depending on which direction indicates benefit in relation to
standard therapy. In the example, the objective is for the proportion responding to be greater than that using the standard treatment, so the lower limit is
required. It is 44%, which should be higher than the response rate for standard
treatments to justify further investigation.
Figure 4.1 illustrates the concept of confidence intervals using the one given
above (shown at the top of the diagram) and 19 hypothetical studies. For illustrative purposes, the true response rate is assumed to be known: 55%. Each of
the 20 trials is trying to estimate this. Some will have an estimate above 55%,
# The
strict definition is that 95% of such intervals will contain the true proportion, but it is
often easier to interpret confidence intervals using the definition in the main text; little is
lost by this.
50
Chapter 4
others below, and occasionally 55% exactly, but all have CIs that include 55%,
except one trial (fourth from the bottom). Because 95% CIs are used, 5% of
them (1 in every 20) are expected to exclude the true effect, just by chance.
A 95% CI is commonly used because a 5% error rate is considered sufficiently low. There is nothing special about ‘95%’; sometimes 90% or 99% CIs
are used. For moderate to large studies, the multiplier ‘1.96’ is associated with
using a two-sided 95% range. Different multipliers are needed for different
levels of confidence.
95% Confidence interval for a proportion or percentage
A range of plausible values for the true value based on the observed data. It is
a range within which the true proportion is expected to lie with a high degree
of certainty. If confidence intervals were calculated from many different studies of the same size, 95% of them should contain the true proportion.
The standard error and, therefore, the width of the CI depend on the number
of subjects in the trial. Figure 4.2 shows 95% CIs for studies based on 10 to 500
Number of subjects
True effect
Every subject ever
500
150
75
50
25
10
0
10
20
30
40
50
60
70
80
90
100
Percentage of subjects who respond
Figure 4.2 Counting people: Estimates of the proportion who respond to a new treatment in
hypothetical phase II trials of different sizes. Each dot represents the estimate of the treatment
effect, and the ends of the line are the lower and upper limits of the 95% confidence interval
(two-sided). It is assumed that the true effect of the new treatment is known to be 55%, indicated
by the vertical dashed line.
Design and analysis of phase II trials
51
subjects. If the true effect were known there would be no confidence interval.
The larger the study, the greater the confidence that the observed estimate
is closer to the true value, so the range becomes narrower. A trial with few
subjects produces a wide CI, which reflects the lack of sufficient certainty over
the true value. Conclusions based on wide CIs (e.g. 95% CI 5% to 60%) are
difficult to interpret because the possible true proportion could be very low
or high. Such small studies can justly be described as uninformative because
they do not provide reliable information on the likely true value.
Large study
Small study
small standard error
large standard error
narrow confidence interval
wide confidence interval
Once the 95% CI is estimated, it is examined to see if it contains the response
rate for standard treatment. An example is given in Box 4.2. When there are
two or more new treatments, each 95% CI is examined to observe which
exclude and which include the expected effect for the standard treatment.
Box 4.2 Example of a phase II trial14
Objective: To examine the effect of using thalidomide in treating small cell
lung cancer, when added to standard chemotherapy
Trial design: Single-arm phase II study
Outcome measure: Tumour response rate (complete or partial remission)
Sample size: Thalidomide should have a response rate greater than 45% (standard treatments). A value as large as 70% would indicate that it would be
worthwhile investigating further in a large phase III trial. A sample size of 24
patients is required to detect this difference (with 80% power and 5% level of
statistical significance, one-sided test).
Results: 25 patients were recruited of whom 17 had a tumour response
Response rate = 68% (17/25)
One-sided 95% confidence interval (CI): lower limit is 50%
Two-sided 95% CI = 46 to 85%
Interpretation: The observed response rate was high (68%). The one-sided
lower limit is 50%, which means that enough patients had a tumour response
(17) to suggest that thalidomide could be associated with a true rate that is
greater than 45%. The observed rate is also close to the target rate of 70%. The
two-sided CI indicates that the true rate could be as high as 85%.
Recommendation: Thalidomide is worth further investigation
(95% CIs were calculated using an exact method)
52
Chapter 4
Sometimes a greater level of confidence is used, such as a 97.5%, to partly
allow for having multiple comparisons, which increases the chance of finding
an effect, when there really is none. In deciding which merit further study, the
one(s) with the largest observed effect might be selected. If they all appear to
be better than the standard treatment, the side-effects of each treatment, and
the feasibility of conducting a larger trial with several groups, may be considered in choosing which to take forward.
Analysing outcome measures based on taking measurements
on people
When the endpoint involves taking measurements on people, the data can
be summarised by the mean and standard deviation (Chapter 2). In the same
way that a single proportion observed in a trial will be an estimate of the true
proportion, an observed mean value from a trial will be an estimate of the true
mean. The standard error of the mean quantifies how far the observed mean
is expected to be from the true value, and is used to estimate a CI for the true
mean:
95% CI = observed mean ± 1.96 × standard error
What are the implications of conducting a trial on a sample
of patients?
Suppose a new pain killer in adults with chronic back pain is evaluated using
a phase II study of 40 patients, and the endpoint is pain score (using a visual
analogue scale, 0 to 100 mm; 0 represents no pain and 100 is maximum pain).
At the end of the trial the observed mean pain score is 34 mm, with a standard
deviation of 18 mm (see page 205 for calculation).
Using the results from this particular study, the true mean VAS score associated with the new drug could be 34 mm, but whatever it is, there is 95%
certainty that the true value is somewhere between 28 and 40 mm. For a onesided CI the VAS score should be lower than with standard treatment, so the
upper limit is required: 39 mm (34 + 1.645 × 2.8). If the mean VAS associated
with standard treatment is 50 mm, the trial results indicate that the new treatment could be better, because the value is lower.
If there were 20 phase II trials examining the new pain killer, each based on
40 patients, they would look similar to those in Figure 4.1, in that 19 would
contain the true mean value. But by chance, 1 in 20 studies (5%) could get the
wrong answer; i.e. it will miss the true mean value.
Standard deviation and standard error are sometimes confused, but they
have very different meanings:
r Standard deviation indicates how much the data is naturally spread out
about the mean value (i.e. natural variability between people)
r Standard error relates not to the spread of the data, but to the accuracy
with which it is possible to estimate the true mean, given a trial of a certain
sample size.
Design and analysis of phase II trials
53
Figure 4.3 Kaplan–Meier plot of the survival times for 50 patients (each death is shown by a
vertical drop).
Analysing outcome measures based on time-to-event data
When the endpoint involves measuring the time until an event occurs, a
Kaplan–Meier plot, median survival or survival rate at a specific time point
are used.# Figure 4.3 shows the survival curve for 50 subjects. The median
survival is 107 days, and the 95% CI is 70 to 115 days (calculated using statistical software, because the formula is not simple). Using the results from
this particular trial, the true median is estimated as 107 days, but there is 95%
certainty that the true value is somewhere between 70 and 115 days.
The median time is useful when there are many events, and they occur continuously throughout the follow-up period, such as in studies of patients with
advanced disease. Otherwise, it can be skewed by only one or two events, and
therefore be unreliable.
The survival rate at 50 days is 83%, and the 95% CI for the true survival
rate is 72 to 94%. It should be noted that a rate only applies to a single time
point, so could be affected by chance variation. When the median survival is
not reached, or it is too dependent on one or two events, the survival rate at
a critical time point is more appropriate. The specified time point should be
one that is clinically relevant and chosen before the trial starts. The event rate
(here, death rate) could also be reported. It is 100 minus the survival rate: 17%,
95% CI: 6 to 28%.
# The word ‘survival’ used here refers to any event of interest occurring, not just death (see
page 27).
54
Chapter 4
4.7 Interpreting and reporting phase II studies
Phase II trial reports should include a description of the characteristics of
the trial subjects, and summary tables on efficacy and side-effects. Confidence intervals should be reported for the main endpoints. The results of
a phase II trial are used to guide researchers on whether a phase III trial is
needed, which will eventually confirm or refute the early evidence that the
new treatment might be effective. A phase II study can also help to design a
subsequent larger trial, in terms of outcome measures, sample size and trial
conduct.
Many phase II studies are conducted in a few specialist centres, and by
experienced health professionals. Therefore, an observed beneficial treatment
effect, especially if the trial is not blind, may not be found in routine practice,
or the size of the effect is over-estimated. Natural variation in how patients
respond to the same treatment, and the possible effect of bias, mean that phase
II data, which are based on a relatively small number of patients, should be
interpreted carefully.
When the outcome measure involves counting people, the statistical methods used to estimate sample size can also indicate how many events need to
be observed to justify further studies. From Table 4.1, 78 subjects are required
if the expected response of the new treatment is 20%, and the response using
standard therapy is 10%. Seeing at least 13 responders should provide sufficient evidence to warrant further investigation. However, if 12 or even 11
respond, further study should not automatically be ruled out, particularly if
the subjects had a poorer prognosis than originally anticipated. Similarly, 13
or 14 responders may not necessarily lead to further studies. The decision
to proceed to a phase III trial should be based on other endpoints, such as
side-effects, recruitment and patient acceptability, in addition to the response
rate.
When phase II trials involve randomising patients to the new and standard treatments, researchers almost always directly compare the outcome
measure between the trial groups, and report effect sizes and p-values (see
Chapter 7). Although this can be informative, there is sometimes a temptation to conclude that the new treatment is effective. Phase II studies are
not designed to provide this kind of definitive evidence. The results could
be a chance finding or, more likely, the treatment effect is over-estimated.
Care should be taken not to report a randomised phase II study that shows
a statistically significant effect as if it were a confirmatory phase III trial,
and make undue claims about efficacy. This could prevent further study, and
some health professionals may wrongly choose to change practice on the basis
of insufficient evidence, and consider conducting a larger phase III study
unethical. However, because the number of subjects in the study is relatively
small, other professionals will remain unconvinced, and the clinical community as a whole could be left in a state of uncertainty – an unsatisfactory
position.
Design and analysis of phase II trials
55
Box 4.3 Example of comparing evidence from phase II and III
trials14,15
Thalidomide and advanced small cell lung cancer
Two small single-arm phase II trials and a small randomised placebocontrolled trial consistently suggested that thalidomide could greatly increase
survival time when used with standard chemotherapy; patients were living
noticeably longer than expected.
The percentages of patients surviving to one year in these three studies were
46% (n = 25), 52% (n = 30) and 49% (n = 49); all higher than the expected
value of 20–30%.
In the small randomised trial (based on giving thalidomide to patients who
had already responded to standard chemotherapy), the median survival was
11.7 (n = 49) and 8.7 (n = 43) months in the thalidomide and placebo arms
respectively; a substantial difference for this disorder.
However, a large double-blind placebo-controlled phase III trial (724 patients)
of thalidomide vs placebo was conducted. The results showed no evidence of
an effect. The median survival was 10.1 and 10.5 months in the thalidomide
and placebo arms respectively. The one-year survival rates were 37% and 41%
respectively.
Some treatments that appear effective in phase II studies are shown to be
ineffective when tested in a phase III trial. An example is shown in Box 4.3.
Conversely, there are likely to be some effective treatments that are not investigated further because phase II data were not supportive of an effect.
Phase II studies provide valuable initial information about a new treatment. ‘Positive’ results are used to underpin the justification for a larger trial,
thereby making such a trial more likely to be funded, and for it to obtain
approval from a regulatory authority and ethics committee. Even if the data
were negative, indicating that there is unlikely to be a beneficial effect, it is useful to have this information because it means valuable subjects and resources
have not been wasted by having a larger study.
4.8 Summary points
r Phase II studies are a useful way of obtaining preliminary information
about a new intervention in a relatively small number of subjects.
r There are several different designs, including those that have a comparison
arm. The design should be specified before the trial commences.
r Subjects should be monitored closely, especially for side-effects.
r The results of phase II studies are generally descriptive, focusing on the size
of the effect of the new intervention and the 95% confidence interval.
56
Chapter 4
r The characteristics of patients entered into the trial should be described in
sufficient detail.
r Careful consideration should be given to interpreting the data from randomised phase II studies that contain a control arm, particularly if they produce positive results.
r The decision to conduct a larger, confirmatory trial should depend on several factors: efficacy, safety and feasibility.
References
1. Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treat Rep
1985; 69:1375–1381.
2. Scher HI, Heller G. Picking the winners in a sea of plenty. Clinical Cancer Research 2002;
8:400–404.
3. Steinberg SM, Venzon DJ. Early selection in a randomised phase II clinical trial. Statist
Med 2002; 21:1711–1726.
4. Lee JJ, Feng L. Randomized phase II designs in cancer clinical trials: current status and
future directions. J Clin Oncol 2005; 23:4450–4457.
5. Rubinstein LV, Korn EL, Friedlin B et al. Design issues of randomised phase II trials and
a proposal for phase II screening trials. J Clin Oncol 2005; 23:7199–7206.
6. Wieand HS. Randomized phase II trials: what does randomisation gain? J Clin Oncol
2005; 23:1794–1795.
7. Machin D, Campbell M, Fayers P, Pinol A. Sample size tables for clinical studies. 2nd
edn. Blackwell Science, 1997.
8. Tan, SB & Machin, D. Bayesian two-stage designs for phase II clinical trials. Stat Med
2002; 21:1991–2012.
9. A’Hern R. Sample size tables for exact single-stage phase II designs. Stat Med 2001;
20:859–866.
10. Fleming TR. One-sample multiple testing procedure for phase II clinical trials. Biometrics
1982; 38(1): 143–151.
11. Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials
1989; 10:1–10.
12. PASS (Power Analysis and Sample Size software): http://www.ncss.com/pass.
html
13. nQuery: http://www.statsol.ie/html/nquery/nquery home.html
14. Lee S-M, Buchler T, James L et al. Phase II trial of carboplatin and etoposide with thalidomide in patients with poor prognosis small-cell lung cancer. Lung Cancer 2008; 59(3):3648.
15. Lee SM, Rudd RM, Woll PJ et al. Two randomised phase III, double blind, placebo
controlled trials of thalidomide in patients with advanced non-small cell lung cancer
(NSCLC) and small cell lung cancer (SCLC). J Clin Oncol 2008; 26 suppl; abstr 8045.
CHAPTER 5
Design of phase III trials
A randomised controlled trial (phase III trial) should provide enough evidence to warrant a change in practice. There are various types of randomised
trials, and the design depends on the objectives. The principles of minimising bias and confounding, and the advantages of blinding are presented in
Chapter 1.
5.1 Objectives of phase III trials
The main objective of a phase III study is usually based on efficacy or safety,
or both. Box 5.1 summarises common trial objectives in relation to two interventions. The method of sample size estimation depends on the appropriate
objective. Defining what is ‘better’ or ‘worse’ depends on the outcome measure used. Common efficacy endpoints are mortality, occurrence of the disease
of interest, further advancement (progression) of a disease being treated, cure
or relief of chronic symptoms, or change in lifestyle or behaviour. In conducting equivalence or non-inferiority trials, the aim is usually to show that two
interventions have a similar efficacy, but one is safer, more cost-effective or
easier to administer.
There are also bioequivalence drug trials, in which two forms of the same
drug, for example, produced using a new method or a different formulation,
are compared, rather than two different drugs. All that is required is to determine that a similar amount of drug is taken into the body (i.e. similar bioavailability), and this can be done using a biochemical marker or other surrogate.
A completely new trial with one of the common true efficacy endpoints such
as mortality or disease cure is unnecessary. If bioequivalence is demonstrated,
it is assumed that there would be the same effect on a true endpoint.
5.2 Types of phase III trials
Common trial designs are illustrated in Figure 5.1. There are several key considerations:
r What are the interventions?
r What is the main objective and corresponding outcome measure?
r Do the researchers or subjects know which intervention has been allocated (single or double blinding)?
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
57
58
Chapter 5
Box 5.1 Trial objectives
Comparing two interventions, A and B
(B could be the standard treatment, placebo or no intervention)
Superiority
Equivalence
Non-inferiority
A is more effective than B
A has a similar effect to B
A is not less effective than B (i.e. it could have a similar
effect or be better)
‘Effect’ is associated with any primary trial endpoint, such as death, or occurrence or recurrence of a disorder
Equivalence and non-inferiority trials are usually conducted when the new
intervention is expected to have fewer side-effects, be more cost-effective or
be more convenient to administer.
r Are there independent groups of subjects, where each subject receives
only one treatment (parallel groups or unpaired data), or does each subject
receive all the trial treatments (crossover trial or paired data)?
Most trials have parallel groups: each group of subjects receives only one
intervention. They are used when treatments have long-lasting effects, such
as life-threatening disorders, or for disease prevention or cure. For chronic
disorders, where the desired outcome is relief of symptoms rather than disease cure, it is possible to allocate both the new and standard treatment to the
same subject in sequence in a crossover trial. This design is also used for bioequivalence trials (page 57). Instead of randomly allocating subjects to treatment arms, the ordering of treatments is random, so that a similar number
of people given treatment A first are given treatment B first. If at least three
treatments are evaluated, a latin square design could be used. The strength of
a crossover study is that there are essentially identical treatment groups: each
subject is his/her own comparison.
Occasionally, it is possible to administer the two interventions at the same
time (split-person design). For example, in dentistry, in comparing the effect
of two types of fissure sealants on future caries risk, one sealant method could
be applied to the left side of the mouth and the other sealant to the right side
(called a split-mouth design). In medicine, a new topical cream for psoriasis
could be evaluated by being applied to one arm and a standard cream applied
to the other arm.
Crossover designs have limitations. There should be no residual (carryover)
effect from the first treatment that influences the response to the second treatment, because this could make it difficult to compare them and distinguish
their effects reliably. To minimise this problem, a sufficiently long washout
period is required – a length of time between the two trial treatments when
neither are given. Deciding how long the washout period needs to be depends
Design of phase III trials
Parallel group
59
500 Subjects
New treatment
N = 250
Crossover
Standard treatment
N = 250
500 Subjects
N = 250
New treatment
N = 250
Standard treatment
Standard treatment
New treatment
Six months later
Split-person
500 Subjects
N = 250
Left side: new treatment
Right side: standard treatment
N = 250
Left side: standard treatment
Right side: new treatment
Figure 5.1 Illustration of phase III trial designs using unpaired (parallel) or paired data (crossover
or split-person). The solid arrows indicate where the randomisation process takes place.
on the aetiology of the disorder of interest and the pharmacological properties of the trial treatments. Also, the extent of the disorder should reverse
back to what it was at baseline after the washout period, i.e. the subject is not
cured after the first treatment. In crossover studies, there may also be a period
effect, in that the ordering of the treatments matters: people who have A then
B respond differently to those who have B then A. This can be allowed for in
the statistical analysis. If there is uncertainty over the strength of the carryover effect, or period effect, it may be preferable to use a standard two-arm
trial.
Several different treatment combinations or several doses of the same drug
can be evaluated in three or more arms (Figure 5.2). A special case of a
60
Chapter 5
Comparing different interventations
900 patients with early breast cancer
300
chemotherapy
+ surgery
300
radiotherapy
+ surgery
300
surgery
Comparing different doses
10 000 people at high risk of stroke
2500
placebo
2500
100 mg
2500
200 mg
2500
300 mg
aspirin
Comparing two new interventions with a control (Factorial)
1800 women planning a pregnancy
450
Folic acid?
No
Multivitamins? No
Group
A
450
Yes
No
B
450
No
Yes
C
450
Yes
Yes
D
Figure 5.2 Examples of multi-arm trials.
multi-arm study is a factorial trial. There are two new interventions, and each
is to be compared with a control arm. This is an efficient design because it
avoids having two separate two-arm trials, which would mean many more
trial subjects in total. There are two distinct contexts; one in which the treatments should not interact with each other, and the other in which an interaction is expected. An interaction occurs if the combined effect of A and B
differs from what would be expected by combining the effects seen when A
and B are given separately (see page 107). To examine an interaction effect
the trial would have to be larger than if no interaction were assumed. Figure
5.2 shows an example of a trial that evaluated folic acid and a multi-vitamin
combination for preventing neural tube defects among pregnant women. The
following comparisons could be made:
r B + D vs A + C (is folic acid effective?)
r C + D vs A + B (are multivitamins effective?)
Design of phase III trials
61
r D vs B (is folic acid plus multivitamins better than folic acid alone?)
r D vs C (is folic acid plus multivitamins better than multivitamins alone?).
A factorial trial can only be conducted if both interventions can be given
to a subject, and safety should be monitored closely, especially in the
combined arm.
Allocating individuals or groups of individuals to the trial groups
Most trials involve randomising individual subjects to different arms, and
this is the preferable approach. However, there are occasions when this is
not practical, and groups of subjects are randomised instead. An example
would be a trial to determine whether a new educational programme aimed
at teenagers could reduce their prevalence of smoking and alcohol drinking.
The trial could compare the programme with no intervention (control). If
children were randomised within the same school to either the programme
or control, the effect of the programme could be diluted because children will
mix with each other and share their experiences of the programme. Also, the
children would have to be separated out to deliver the programme to some
and not others, and this may have practical difficulties. An alternative is to
randomise schools. All children in one school receive the new programme
and all those in another school become the control group. Several schools
would be randomised in this way, and it is often a more practical way of
delivering both interventions. This is called cluster randomisation or a
cluster randomised trial. Both the sample-size calculation and statistical
analysis at the end of the trial should allow for this type of design (see page
109). Data is still obtained from each trial subject.
5.3 Choosing outcome measures
There is some flexibility in the choice of outcome measures in phase II studies,
including surrogate endpoints (see page 17). This is done on the understanding that a subsequent, larger trial will use a true endpoint. In phase III trials
the main outcome measure needs to be chosen carefully and well defined so
that the trial objectives are met, and the results persuade health professionals
to change practice. For many trials, the choice of endpoint will be easy, for
example death, the occurrence or recurrence of a specific disease, or a change
in habits. The main endpoint should be clinically relevant to both the trial
subject and the researcher.
When a trial is not blind, the endpoint should be chosen such that the lack
of blinding has a minimal effect, because knowing which treatment is received
could affect the value of the outcome measures (page 13). The main endpoint
should therefore ideally be an objective, rather than a subjective, measure.
Possible examples of objective measures are some blood values, radiological measurements (such as from an X-ray or CT scan) and physiological
62
Chapter 5
measurements (such as motor function). If a trial is double-blind, neither
objective nor subjective measures should be affected, so either could be used.
Many subjective measures are those reported by the subject, such as pain level
or health-related quality of life.
Below are possible endpoints in a randomised trial of a flu vaccine in the
elderly to prevent flu, presented on page 92:1
r Self-reported flu-like symptoms, using a questionnaire completed by
patients
r Serological evidence of the flu virus, i.e. an increase in antibodies against
influenza detected in a blood sample
r Diagnosis by a clinician after the patient presented with flu-like symptoms
r Hospital admission for respiratory disease (not used in the trial1 ).
All are valid outcome measures, though they have their strengths and limitations.
‘Self-reported’ flu is easy to measure because there is no need for a clinical
assessment by a clinician or a blood test. Patients complete a questionnaire at
home and send it to the co-ordinating centre where the responses are examined and subjects are classified as having flu or not according to a set of criteria. However, it is a subjective measure that may have wide variability, which
could mask a moderate treatment effect. If patients are not blind to treatment
they could easily affect this type of outcome measure. Those given the vaccine may be less likely to report fever and headaches, because they think that
these symptoms are unrelated to flu. While, those given placebo may be more
likely to report these symptoms, which may be unrelated to flu. This bias
would make the vaccine seem effective when it is not, or to over-estimate the
effect.
‘Serological evidence’ is an objective measure, which should be unaffected
by a lack of blinding. However, there will be some individuals who have evidence of the flu virus in their blood but may not feel unwell enough to go to
their doctor, or are unaffected, so the clinical importance of this endpoint is
uncertain. This measure also involves taking a blood sample, which needs to
be stored appropriately and analysed in a laboratory, both of which have a
cost implication.
‘Diagnosis by clinician’ is perhaps in between the previously mentioned
subjective and objective measures. It might be considered clinically important
because these are the people who have felt so unwell that they decided to go
to their doctor. They are more likely to seek medication to relieve their symptoms, go to hospital for respiratory problems, or die from flu. The clinician
uses standardised criteria to help classify patients as having flu or not, but
this still requires some judgement. Again, knowing whether the patient had
the vaccine or placebo could affect the clinical diagnosis of flu, in a similar
way to the self-reported outcome.
‘Hospital admission for respiratory disease’ would be associated with the
more severely affected patients. It might be less affected by a lack of blinding.
However, it relies on the trial researchers being notified of all admissions of
Design of phase III trials
63
the trial participants. This endpoint can also be used to evaluate the financial
costs to the health service provider.
Determining which is the best outcome measure needs careful thought. One
of the reasons for having a public health vaccination programme in the elderly
is to reduce the morbidity and mortality caused by acquiring the flu virus, and
examining hospital admission would address this directly. However, because
the proportion of elderly people who are admitted to hospital for flu-related
respiratory disorders is low, a large trial is needed in order to see a sufficient
number of admissions to be able to conclude reliably that the vaccine had an
effect. Serology and diagnosis by clinician are perhaps the most appropriate
and complementary endpoints. One is objective, while the other indicates the
impact on part of the health service. There are situations when no single endpoint is ideal. The choice of outcome measure will also depend on the aim of
the trial, the disorder of interest, the interventions being tested, and whether
it would change practice.
5.4 Composite outcome measures
While some trial endpoints are associated with the occurrence of a single
event, others consist of several events combined into one; called a composite
endpoint. An example comes from trials of primary or secondary prevention
of cardiovascular disease that have evaluated statin therapy using an endpoint with four components: fatal or non-fatal coronary heart disease, or fatal
or non-fatal stroke. Composite endpoints avoid having to deal with several
separate outcome measures at the same time, and it increases the number of
events, making it easier to detect a treatment effect, if it exists, and to achieve
statistical significance.# Figure 5.3 is a hypothetical example, in which it is
assumed that Treatment A has a similar effect for each of the events (the point
estimates for the relative risk are similar), but on their own the events are
not statistically significant (the 95% confidence interval line includes the no
effect value of 1.0). The composite endpoint is statistically significant, because
it is based on more events. The events have equal ‘weight’, for example it is
assumed that subjects consider a non-fatal myocardial infarction as important
as non-fatal stroke. Where this is unlikely to be true, it is possible to give different numerical weights to each constituent event.
A limitation of composite endpoints is that a new intervention could work
for some but not all of the constituent events. Figure 5.4 shows the results
from a randomised trial in patients with angina, comparing invasive with
medical therapy.2 The composite endpoint result is clearly driven by hospital
admission for acute coronary syndrome. There is no clear evidence of a benefit
associated with death or non-fatal myocardial infarction: the 95% confidence
intervals contain 1.0, i.e. there is a possibility of no real difference between the
# Statistical
significance and confidence intervals are presented in Chapter 7.
64
Chapter 5
Death (25)
Non-fatal stroke (50)
Non-fatal MI (115)
Composite (190)
0.3 0.4 0.5
1.5 2.0
1.0
Treatment A better
3.0
Placebo better
Relative risk (95% confidence interval)
Figure 5.3 Hypothetical results of a trial of treatment A versus placebo. The number of events is
shown in brackets for each endpoint. The relative risk is the proportion of patients who had an
event with Treatment A divided by the proportion on placebo. A relative risk of 1.0 (the no-effect
value) means that Treatment A and placebo had the same effect. MI: myocardial infarction.
interventions. It may then be difficult to make claims of effectiveness. A solution is to present analyses in the final report based on the composite and each
of the constituent events, and discuss the implications of the results.
When using composite endpoints the first occurrence of any of the events is
used. This is because the clinical management of the patient may affect the risk
of any subsequent event that occurred after the first event, making it difficult
to distinguish the effects of the interventions.
Defining a composite endpoint should be done at the start of the trial, i.e.
in the protocol, with clear justification for the constituent events. If the trial
Death
Non-fatal MI
Admission for ACS
Composite
0.1
0.2
0.5
Invasive better
1.0
2
4
Medical better
Hazard ration (95% confidence interval)
Figure 5.4 Randomised trial comparing invasive with medical therapy for patients with angina.
The hazard ratio is the risk of having an event in ‘invasive therapy’ divided by the risk in ‘medical
therapy’, at any time point (discussed in reference 2). MI: myocardial infarction; ACS: acute
coronary syndrome.
Design of phase III trials
65
results are to be used for licensing, the validity of the composite endpoint
should be first verified with the regulatory body (e.g. the US Food and Drug
Administration) to ensure it will provide the necessary evidence needed for
a successful application. However, difficulties may still arise if differences in
the composite endpoint appear to be the result of differences in only one of
the constituent endpoints. See Montori et al. for a concise discussion.2
5.5 Having several outcome measures (multiple endpoints)
Having one primary endpoint is often easier with trials of life-threatening disorders such as coronary heart disease, stroke or cancer, in which a common
endpoint is mortality, or occurrence or recurrence of a disorder. However, for
many chronic diseases there could be a range of possible endpoints and the
temptation exists to include most or all of them in a trial. A new intervention
may appear to work for some endpoints but not others, making it difficult to
interpret the results of the trial. It may also be viewed as a ‘fishing expedition’, i.e. deliberately choosing many endpoints in the hope that at least one
will show an effect. Having multiple endpoints increases the chance of finding
a spurious effect, unless the sample size is increased. Given these considerations, it is preferable to focus on one or two primary endpoints, and stipulate
at the start of the trial that these will be used to determine whether practice
should change. The other endpoints should be treated as secondary outcome
measures, used to provide further information about the effect of the intervention on the disorder. If there are two or more primary endpoints the samplesize calculation and statistical analysis may need to allow for this (see page
115).
5.6 Fundamental information needed to estimate
sample size
The sample size for a phase III trial is based on directly comparing two or
more groups of subjects.
Information needed to calculate sample size
r
r
r
r
Expected effect in the new intervention group
Expected effect in the control (comparison) group
Significance level (usually 5%, at the two-sided level∗ )
Power (usually 80% or 90%).
∗ One-sided if the trial objective is to examine whether the new treatment is
not worse than the control (i.e. non-inferiority), or that it can only be better
Deciding how many subjects to recruit is an important aspect of design for a
phase III trial, because these studies aim to change practice. If there are too
66
Chapter 5
Box 5.2 Steps in estimating sample size
Specify the effect size to
be detected*
Power
(usually 80% or 90%)
Calculation Sample Size
Significance level
(usually p=0.05)
∗ The effect size depends on the type of outcome measure used (see Chapter 7):
r ‘Counting people’ – two proportions (for relative risk or risk difference)
r ‘Taking measurements on people’ – standardised difference (for the difference between two means)
r ‘Time-to-event data’ – two survival rates at a specific time point, or two
median survival times (for hazard ratio)
r Any other statistic that is associated with making comparisons
few subjects, a clinically important difference may be missed. If there are too
many subjects, resources could be wasted and a delay may occur in offering a
superior treatment to future patients. There are three elements that determine
sample size (Box 5.2):
Expected effect size
The term ‘effect size’ is used to compare an endpoint between two trial groups.
It could be summarised by a relative risk, risk difference, hazard ratio or difference between two means. These are presented in Chapter 7 (Sections 7.1 to
7.4). The magnitude of the effect size could be based on previous knowledge,
for example, from a phase II trial, or one that is judged to be associated with
a minimum clinically important effect. For equivalence and non-inferiority
trials a range is specified within which it is possible to say that a new intervention has a similar effect (maximum allowable difference, see page 69).
Level of statistical significance
Statistical significance is often set at 5% (0.05): the results will be determined
to be statistically significant at this level.# This is the chance of finding an
effect when in reality one does not exist, so the conclusion of the trial would
be wrong. If there are multiple primary endpoints (e.g. three) a lower level can
be specified (e.g. 0.017, calculated as 0.05/3). In conducting superiority trials
a two-sided level is typically used, usually at the 5% level, to allow the new
intervention to be better or worse than the control group. Sometimes, a more
# Some
textbooks refer to statistical significance as α, or Type I error.
Design of phase III trials
67
Maximum allowable difference
‘Control’ is better
‘A’ is better
Superiority
Equivalence
Non-inferiority
--D
0
+D
True Effect
Figure 5.5 Illustration of three possible comparisons between Treatment A and a control group.
The effect could be the difference between two percentages, for example, the percentage of
patients who recover from the disorder. ‘D’ is the maximum allowable difference. The true
difference could be any point on the horizontal bars.
stringent 1% level is specified. One-sided tests should only be used when the
new intervention can only be better. For equivalence trials, it is important to
exclude a possible difference that is more extreme than the maximum allowable difference in either direction (Figure 5.5). The total significance level can
be set at 2.5% or 5%. For non-inferiority studies, the aim is to reliably exclude
the possibility that the new treatment is worse than the control group, and
therefore a one-sided level of 2.5 or 5% is usually specified.
Power
Power can be interpreted as the chance of finding an effect of the magnitude specified, if it really exists. A high power is required, such as 80 or 90%,
used by most trials# . However, if the trial will be unique or expected to have
a significant impact on health practice, researchers may choose 95% power,
though this greatly increases sample size and the feasibility of this needs to be
considered.
Power
At the end of the trial the following statement needs to be true:
‘The observed difference of 50% vs 65% is statistically significant at the 5%
level’
There needs to be an 80% chance of being able to make this statement (80%
power), if there really is a difference of at least 15 percentage points.
Changing any of these three elements affects sample size (Box 5.3).
# Some
textbooks refer to 100 minus power as β, or Type II error.
68
Chapter 5
Box 5.3 Why sample size would get larger
Sample size goes up when:
Effect size gets smaller
Power goes up
Significance level goes down
Implications:
Harder to detect small differences
than large ones
Increases the chance of picking up the
effect if it really exists
Decreases the chance of saying there
is an effect when there is no true effect
What is also important is the number of events when the endpoint is ‘taking measurements on people’ or ‘time-to-event’ data. While it is expected that
large studies should have many events, a large study with few events can have
low power to detect small or moderate treatment effects.
5.7 Method of sample-size calculation
To determine the method of sample-size calculation, one option from each of
the three following features should be chosen:
r The type of outcome measure used:
❜ Counting people
❜ Taking measurements on people
❜ Time-to-event data
r What is being sought when comparing the two interventions:
❜ Superiority
❜ Equivalence
❜ Non-inferiority
❜ Factorial (if looking for an interaction)
r Having separate patient groups or one group receives all treatments:
❜ Parallel group
❜ Crossover (split-person).
There are several methods.3–6 Free or commercially available software
are also available,3,7–9 and statistical software packages have sample-size
facilities.10–13 It is, however, worth working with a statistician when designing
the trial and estimating sample size.
Type of outcome measure
When the outcome measure is based on counting people the two expected
percentages (or risks) are specified for each trial arm. The sample size for a
difference depends on the actual value of the percentages. For example, the
sample size comparing 10 vs 15% is 1372 subjects, but for 50 vs 55% it is 3130,
even though the difference is 5 percentage points in both cases.
Design of phase III trials
69
When taking measurements on people the expected mean value of the
outcome measure in each group, and the standard deviation must be specified, and the standardised difference calculated. The standard deviation is
assumed to be similar between the groups.
Standardised difference =
Mean value in Group 1 − Mean in Group 2
Standard deviation of the measurement
The effect size is therefore defined in terms of the number of standard deviation units. An advantage of working with the standardised difference is that it
has no specific unit of measurement, and it does not depend on the actual
values of the two individual means (unlike ‘counting people’ measures). This
means that the same standardised difference associated with any comparison
yields the same sample size.
For time-to-event data, there are various methods to estimate sample size
depending on how the effect size is specified. For example:
r The expected survival rate at a specific time point in each trial arm
r The expected survival rate in the control arm, and the expected hazard ratio
r The expected median survival time in each arm, with additional information on the length of the recruitment and follow-up periods.
What is being sought when comparing the two interventions?
When examining two interventions, it is necessary to determine whether they
are likely to have a different or similar effect. To show that one treatment is
better than another is relatively easy. Only the two expected percentages in
each trial arm, or standardised difference, is required in the sample-size calculation. To show that two interventions have a similar effect, an acceptable
degree of difference needs to be specified, i.e. the maximum allowable difference (or equivalence range or equivalence limit). Figure 5.5 illustrates three
possible scenarios. There is no rule for determining the maximum allowable
difference (D). It could be a third or half of what is considered a clinically
important effect. As long as the effect size is within this equivalence range, it
can be concluded that the two interventions have a similar effect.
Equivalence studies aim to show that the observed effect size and its confidence interval is within a relatively narrow range around the no-effect value
(e.g. page 108), but not more extreme than D at either end. For example, if
the trial endpoint is the percentage of patients alive at one year, two treatments could be compared by taking the difference between the two percentages (new treatment minus standard). Specifying a value of D = 10% means
that the corresponding sample size should produce a confidence interval for
the difference that is completely within ±10% to conclude equivalence (if they
really have a similar effect).
70
Chapter 5
For non-inferiority studies, only one end of the confidence interval for the
effect size should not exceed D. It allows for the possibility that one treatment
is actually better than the other, or that they are similar. Using the above example, this simply means that the sample size estimated for this study design
should not produce a confidence interval that has a lower limit that exceeds
–10%.
Specifying a value for D is sometimes difficult. For example, if the main
endpoint is the cure rate for a disorder at one year and it is expected that
a new intervention has the same rate as standard therapy (say 40%), what
maximum allowable difference could be taken to conclude equivalence? If the
trial is designed to detect a true cure rate within ±1 percentage points, it is
possible to be very confident that the new treatment has an equivalent effect.
However, obtaining such a narrow confidence interval requires a very large
study (75 000 subjects, 80% power and two-sided level of 5%). Specifying that
the cure rate must be within a wide range of ±15 percentage points, requires a
much smaller trial (330 subjects). However, with a possible cure rate of 25% (40
minus 15%) it will probably not be considered by many health professionals
that the new treatment has a similar effect. The value for D is therefore a balance between something that is not too small and something that would persuade the health community to change practice, but feasible within a trial. In
the example, perhaps D between 5 and 10 percentage points would be acceptable. A similar principal applies to non-inferiority trials.
Factorial trials are often used to examine superiority, and the sample size
for the comparison of each main effect can be treated as if it were from a twoarm trial. However, if the researcher wishes to have enough statistical power
to look at the interaction the sample size needs to be increased.
Having separate subject groups (parallel) or one group receiving all
treatments (crossover)
Generally, crossover trials need approximately half the number of subjects
required for parallel group trials, because there is not as much natural variation as having two separate groups of people; each subject acts as his/her
own control. It becomes easier to detect treatment effects by greatly reducing
variability, so a smaller number of subjects is needed.
5.8 Examples of sample-size calculations
Table 5.1 shows sample sizes based on ‘counting people’ endpoints in a superiority trial. It gives an indication of how large a phase III trial needs to be, and
how sample size depends on the effect size and power. A similar observation
is seen for ‘taking measurements on people’, or time-to-event data. Choosing
a sample size that seems feasible in a certain timeframe and then specifying
the effect size is not a good approach, because the effect size is probably quite
different from reality. Trials should set out to detect the minimum clinically
important difference. The sample-size estimate only reflects the contributing
Design of phase III trials
71
Table 5.1 Examples of sample sizes when the outcome measure is based on ‘counting people’.
The table shows the total number of trial subjects required for a two-arm parallel-group study.∗
% expected with
the outcome in
the control arm
% expected with the
outcome in the new
intervention arm
Effect size %
(difference)
5
10
15
20
25
10
50
Power
80%
90%
5
10
15
20
870
280
150
100
1160
380
200
130
15
20
25
30
5
10
15
20
1370
400
200
120
1840
530
270
160
55
60
65
70
5
10
15
20
3130
780
340
190
4190
1040
450
250
∗ rounded to the nearest 10
assumptions. If the assumptions are unrealistic, the size of the trial will be too
small or too large.
Sample sizes for non-inferiority and equivalence studies are larger than
superiority trials, because the effect size associated with the maximum allowable difference is usually smaller than what is considered to be a clinically
important effect, or the significance level is smaller than the 5% level (Box
5.3). Specifying a large maximum allowable difference to minimise the number of trial subjects should be avoided because it could produce results that
are too imprecise, making it difficult to draw reliable conclusions.
Table 5.2 provides examples of sample-size descriptions that could be used
in a grant application or trial protocol. It is useful to justify the specified effects
in each trial arm, or the effect size, using previous studies or unpublished
evidence. Box 5.4 shows two quick formulae for estimating sample size for
superiority trials.
5.9 The importance of having large enough trials, and
specifying realistic effect sizes
There is nothing precise about a sample-size estimate. It provides an approximate size of the trial. It does not matter if one set of assumptions yields 500 subjects but another gives 520, because this represents only an extra 10 subjects
per trial group. What is more important is whether 500 or 1000 subjects are
needed. There is always some guesswork involved in specifying the assumptions for sample size, particularly when determining the effect size, which is
often quite different from what is observed at the end of the trial.
72
Chapter 5
Table 5.2 Hypothetical sample size descriptions that could be used in a grant application or
protocol.
Type of
outcome
measure
Trial objective
Description (The numbers in bold are the ones used in
the calculation to produce the sample size.)
Counting
people
Superiority
The proportion who develop flu by 5 months is 10%. It is
expected that the flu vaccine would decrease the incidence
to 5%. To detect a difference of 10 vs 5% requires a trial of
580 subjects in each arm (vaccine and placebo), with 90%
power and two-sided test of statistical significance at the
5% level. Total trial size is 1160 subjects.
Counting
people
Equivalence
The proportion of patients who normally respond to
standard treatment is 55%. Drug A is expected to have an
equivalent effect. A maximum allowable difference of ±10
percentage points will be used to conclude Drug A and
standard treatment are equivalent. To show this requires a
trial of 520 subjects per arm, with 80% power and 2.5%
level of statistical significance, two-sided test. Total trial
size is 1040 patients.
Counting
people
Noninferiority
The proportion of patients who usually respond to treatment
is 50%. Therapy B should not have a response rate that is
much worse than this. A maximum allowable difference of
up to –5 percentage points (i.e. a response rate not below
45%) would indicate that Therapy B is not inferior. To show
this requires a trial with 1570 patients in each arm, with
80% power and one-sided test of statistical significance at
the 2.5% level. Total trial size is 3140 patients.
Taking
measurements
on people
Superiority
The mean loss in body weight using conventional diets is
5 kg. It is expected that Diet K would be associated with a
mean weight loss of 8 kg. The standard deviation of weight
change is 4 kg. To detect a standardised difference of 0.75
[(8–5)/4] requires a trial of 40 patients in each arm with 90%
power and two-sided test of statistical significance at the
5% level. Total trial size is 80 subjects.
Taking
measurements
on people
Noninferiority
A pain killer, with fewer side-effects, is expected to not be
worse than standard treatments. The usual mean pain score
on a visual analogue scale (VAS) is 75 mm, with a standard
deviation of 40 mm. The new drug should not be worse
than 85 mm (i.e. the mean VAS needs to be lower than this),
corresponding to a maximum allowable difference of
10 mm). To show this requires a trial of 340 patients in each
arm, with 90% power and one-sided test of statistical
significance at the 2.5% level. Total trial size is 680 patients.
Time-to-event
data
Superiority
The median survival associated with the standard treatment
is 18 months. Therapy A is expected to have a median
survival of 24 months. It is expected that the recruitment of
patients would take 36 months, after which there will be
12 months of follow up (i.e. the total length of the trial is
4 years). To detect a difference of 18 vs 24 months requires
315 patients in each treatment arm, with 80% power and
two-sided test of statistical significance at the 5% level.
Total trial size is 630 patients.
Design of phase III trials
73
Box 5.4 Quick formulae for estimating sample size for superiority
trials (two-sided 5% level of statistical significance)
Example
‘Counting people’:
Expected percentage on new treatment = P2
Expected percentage on standard treatment = P1
Number of subjects in each arm =
[P1 × (1 − P1) + P2 × (1 − P2)]
×F
(P2 − P1)2
‘Taking measurements on people’:
Expected mean value on new treatment = M2
Expected mean value on standard treatment = M1
Standard deviation = SD
Standardised difference = = (M2 − M1)/SD
Number of subjects in each arm = F × 2/2
0.20 (20%)
0.30 (30%)
80% power, N = 296
90% power, N = 390
7.0
5.0
3.5
0.57
80% power, N = 50
90% power, N = 65
F = 8 for 80% power
F = 10.5 for 90% power
The smaller the true effect size, the larger the study needs to be, because
it is more difficult to distinguish between a real difference and random variation. Consider mortality as the main endpoint in a trial comparing Drug A
and placebo, with 100 subjects per group. If the one-year death rate is 15%
for Drug A and 20% for placebo, the effect size, expressed as a risk difference, is five percentage points# – this represents only five fewer deaths among
100 subjects given Drug A. It is not easy to tell whether this difference is
real, i.e. a true treatment effect, or simply due to chance. There could just
happen to be five fewer deaths in this trial arm. However, if the death rates
were 5% versus 40%, this would be a difference of 35 percentage points, or 35
fewer deaths among 100 subjects on Drug A, and this is unlikely to be due to
chance.
In a trial 10 times as big as the one above (i.e. 1000 subjects per arm), a
comparison of one-year death rates of 15% and 20% is still five percentage
points, but it is based on 150 versus 200 deaths; a difference of 50 deaths.
# ‘five
percentage points’ is a better way of describing the effect than ‘5%’ when comparing
two percentages. It avoids the possible confusion over whether the death rate for Drug A is
5% + 15% = 20% rather than 5% greater than 15%, which would be 1.05 × 15% = 16%.
74
Chapter 5
Again a difference as large as this is unlikely to be due to chance, and likely to
be due to a real treatment effect of Drug A.
In the past, large treatment effects were often sought for many disorders
including cardiovascular disease and cancer, because new treatments at the
time were being compared with placebo or minimal treatment. Significant
improvements in treatments and prevention have since occurred, and these
are the current standard of care against which new treatments now need to
be compared. This means that moderate or even small effects are often now
expected, requiring larger study sizes.
A very large study should give a clear answer to the research question.
Resources may be saved by conducting a small trial, but a clinically important difference between two treatments may be missed because the result is
not statistically significant (see page 113), when in fact there is a real effect,
but the study is too small to detect it. This can occur when the true effect
is smaller than that specified in the sample size calculation. When this happens, it is difficult to make reliable conclusions. This is why it is important to
try to detect the smallest clinically worthwhile effect, with 80 or 90% power,
in order to have an appropriately sized study. For example, if the aim is to
only recruit about 130 subjects, an effect size (standardised difference) of 0.5
could be specified (Box 5.4). But if the real effect size were 0.2, 790 subjects are
needed; a smaller trial is likely to miss this.
When the trial objective is to examine equivalence or non-inferiority, the
new treatment is sometimes expected to be associated with fewer side-effects.
It is sometimes useful to ensure that the target sample size is also large enough
to reliably detect a difference in adverse events.
5.10 Reasons for increasing the sample size
The sample-size estimate assumes that there will be a measure of the trial
endpoint on every subject at the end of the trial. This is not possible in many
trials because some subjects will withdraw from the trial (patient drop-outs
or patient withdrawal, see page 118). However, a certain proportion of dropouts can be allowed for. If the estimated sample size were 500 subjects and 10%
were expected to withdraw, the trial would aim to recruit about 556 patients,
because 556 less 10% is 500[500/(1 − 0.10)].
Some trials have one or more interim analyses, i.e. early looks at the data,
with the aim of possibly stopping the trial early if a large treatment effect is
found, or the effect observed is so small that there is unlikely to be a clinically
important effect if the trial continued (see page 122). When this is planned, the
sample size can be increased to allow for having several analyses.
Other reasons for increasing sample size could be to allow for having two
or more primary endpoints (page 115), unequal randomisation (page 83), subgroup analysis (page 119), or to examine an interaction between two new treatments in a factorial trial (page 107).
Design of phase III trials
75
5.11 Other considerations in designing phase III trials
Once the main trial endpoints and estimated sample size are determined it is
useful to assess the feasibility of recruitment, the number of recruiting centres
that might be needed and the duration of the trial. This will provide an idea
of the financial costs and how the study could be conducted.
An issue that may arise is whether subjects could simultaneously enter
more than one clinical trial. Alternatively, subjects finishing one trial may be
asked to enter a subsequent trial soon after. Neither is encouraged because it
might be difficult to separate out the different treatment effects. If there are
situations when this might occur it is necessary to ensure that it will still be
possible to address the research question of each trial, and that no bias or confounding has been introduced. Researchers of both studies should be aware
of this. The worst scenario is where there is a serious imbalance in the arms
of the second trial (Figure 5.6). This can be avoided by ensuring that subjects
who enter Trial 2 are stratified at the time of randomisation (see Chapter 6)
for the allocated treatment arm from Trial 1.
In some disease areas, usually uncommon disorders, it is possible to conduct a phase II/III trial. Here, a phase III randomised trial is designed and
conducted, but an assessment of efficacy is made early on (e.g. after a quarter of subjects have been recruited), similar to an analysis in a phase II trial.
Sometimes, the study is temporarily halted so that further patients are not
recruited and treated until the phase II assessment is complete. The purpose
is to judge whether the new treatment is unlikely to be effective, so the trial
could stop early. It could also be used to investigate several new treatments
200 Subjects
Trial 1
Trial 2
100
Drug A
100
Drug B
70
30
25 75
70
25
30 75
95
Drug C
105
Drug D
Figure 5.6 Hypothetical situation where subjects can enter one trial after another and
confounding of treatments has occurred. The Drug C group is dominated by patients who
previously had Drug A, and the Drug D group is dominated by patients who previously had
Drug B.
76
Chapter 5
simultaneously to decide which merited continued investigation. The results
based on the interim data are not published, and should only be seen by the
trial statistician and an independent Data Monitoring Committee (see page
179). As well as being an efficient use of subjects (those in the phase II part can
be included in the full phase III trial), a practical advantage of not having completely separate phase II and III trials is that the ‘seamless’ approach does not
need two separate clinical trial applications, approval from two ethics committees and two set-up procedures at centres. This reduces the time taken to
evaluate a new intervention.
5.12 Summary
r Phase III trials are considered the ‘gold standard’ for evaluating a new intervention
r They should be designed to be sufficiently large to provide reliable evidence
r There are several types of objectives: superiority, equivalence and noninferiority
r The main outcome measure should be relevant to the trial subjects,
researchers and those who may benefit in the future
r The methods for estimating sample size depend on the type of outcome
measure, the trial objective and whether there are separate groups of subjects,
or subjects get all treatments.
References
1. Govaert TME, Thijs CTMCN, Masurel N et al. The efficacy of influenza vaccination in
elderly individuals. JAMA 1994; 272(21):1661–1665.
2. Montori VM, Permanyer-Miralda G, Ferreira-Goonzales I et al. Validity of composite end
points in clinical trials. BMJ 2005; 330:594–596.
3. Machin D, Campbell M, Fayers P, Pinol A. Sample Size Tables for Clinical Studies, 2nd Edn.
Blackwell Science, 1997.
4. Pocock S. Clinical Trials: A Practical Approach. John Wiley & Sons, Ltd, 1983.
5. Julious SA. Tutorial in biostatistics. Sample sizes for clinical trials with Normal data. Stat
Med 2004; 23:1921–1986.
6. Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of
rigorous methods. BMJ 1996; 13:36–39.
7. Power and Sample Size calculation. DuPont WD, Plummer WD. http://medipe.
psu.ac.th/episoft/pssamplesize/
8. PASS (Power Analysis and Sample Size software): http://www.ncss.com/pass.
html
9. nQuery: http://www.statsol.ie/html/nquery/nquery home.html
10. STATA: http://www.stata.com
11. MINITAB: http://www.minitab.com/
12. SAS (Statistical Analysis Software): http://www.sas.com/
13. SPSS (Statistical Package for the Social Sciences): http://www.spss.com
CHAPTER 6
Randomisation
Randomly allocating individuals, or groups of individuals, to two or more
interventions is the key design feature of all phase III clinical trials. It should
ensure that the characteristics of individuals are similar between the trial
groups (i.e. minimises confounding), and minimises bias (Chapter 1, page
12). This is achieved by ensuring that trial staff entering each subject, or the
subjects themselves, cannot predict the treatment allocation (Box 6.1). It is also
expected that the intervention arms have similar numbers of subjects, unless
otherwise specified.
For relatively small trials, say less than 100 subjects, there are simple randomisation methods that can be carried out by hand, but for large multi-centre
trials it is preferable to use a computer. By removing all human influence from
the random allocation process, possible biases are minimised. There are different methods of randomisation, and their strengths and limitations should be
considered.
6.1 Simple randomisation
In its most basic form, randomisation can be done by simply throwing a coin:
if Heads, give Treatment A; if Tails, give Treatment B. However, the coin could
be thrown until a preferred treatment allocation is obtained for a particular
subject. Using a random number list is better (the numbers 0 to 9 in a random order). This can be obtained from statistical tables or random number
generator functions within software such as Microsoft Excel.
Table 6.1 provides an example of a random list of 12 numbers for allocating
subjects to two interventions. The first and second subjects recruited receive A,
the third receives B, and so on. In the table, the number of subjects in each arm
is identical (six in each). With large trials (several hundred or several thousand
subjects) simple randomisation should produce similarly sized groups. However, it is possible to get a noticeable imbalance when the trial size is small
(say <30 subjects), just by chance. Among the first six subjects in Table 6.1,
two received Treatment A and four received Treatment B. Extending this to a
trial of 20 subjects, there could be 13 on one arm and 7 on the other, simply
because of the ordering of the random numbers. Although the allocation process has been truly random, having unequal treatment groups could affect the
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
77
78
Chapter 6
Box 6.1 Randomisation
r Randomisation produces treatment groups that have similar characteristics
other than the trial intervention
r The only systematic difference between the trial arms is the treatments
given
r Any observed difference in the trial endpoints should be due to the effect of
the treatment and not to any other factors.
Randomness = unpredictability
What is important is that the next treatment allocation cannot be predicted by
the person entering the subject.
statistical analysis when making comparisons between groups – it can reduce
statistical power. Furthermore, it would be unfortunate if there were fewer
subjects on the new intervention arm, the arm of most interest.
To ensure that treatment groups have a similar size, random permuted
blocks can be used. A block size, which must be divisible by the number of
interventions is specified. For two treatments the block size is often four or
six, sometimes eight or greater. In every consecutive group of four subjects
there are equal numbers in the trial arms. Each treatment should appear at
least twice in each block. Table 6.2 illustrates one way of using random permuted blocks. Each random number determines the allocation for the next
four subjects, not one subject as before.
For three treatments, the block size needs to be divisible by three, such as
six or nine. With a block size of nine, the numbers 1 to 9 could be randomly
ordered:
r 1–3 give Treatment A
r 4–6 give Treatment B
r 7–9 give Treatment C.
For four treatments, as in 2 × 2 factorial trials, the block size could be 8 or 12.
Using a block size of 12, and a random ordering of the numbers 1 to 12:
r 1–3 give Treatment A
r 4–6 give Treatment B
r 7–9 give Treatment C
r 10–12 give Treatment D.
A limitation of random permuted blocks is that the allocation for the last subject in the block can be predicted if the previous allocations are known. This
Table 6.1 Random number list used to allocate 12 subjects to Treatment A or B.
Random number list
4
1
6
5
8
5
2
3
0
7
9
2
Subject identifier∗
Treatment allocation
1
A
2
A
3
B
4
B
5
B
6
B
7
A
8
A
9
A
10
B
11
B
12
A
If random number is: 0–4: give Treatment A; 5–9: give Treatment B
∗ This is also the ordering in which subjects are recruited
Randomisation
79
Table 6.2 Random number list used to allocate 12 subjects to Treatment A or B using a block
size of four. The first three random numbers are 4, 1 and 6 (from Table 6.1).
Random number list
Subject identifier∗
Treatment allocation
4
1
B
2
A
1
3
B
4
A
5
A
6
A
6
7
B
8
B
9
A
10
B
11
B
12
A
If random number is: 1 – AABB; 2 – BBAA; 3 – ABAB; 4 – BABA; 5 – BAAB; 6 – ABBA (ignore
random numbers 0, 7, 8 and 9)
∗ This is also the ordering in which subjects are recruited
can be avoided by having a mixture of block sizes, so that the person randomising the subjects is unaware of whether the next subject is in a block size
of, say, four or six. For double-blind trials, knowing the block size should not
matter.
For crossover or split-person trials, where each subject receives all interventions in sequence or at the same time, the ordering of the treatments needs to
be randomised, so that a similar proportion of subjects receive ‘A’ or ‘B’ first.
This is achieved by randomly allocating subjects to receive either ‘A followed
by B’ or ‘B followed by A’ (Table 6.3). In a split-person design, patients are
randomised to receive ‘A to the left side and B to the right side’ or ‘B to the
left side and A to the right side’.
Simple randomisation is easy to implement by hand or by computer, but it
ignores important prognostic baseline factors that can affect the value of the
trial endpoint. This should not matter for large trials, because getting large
chance imbalances in these factors will be rare. However, for small or moderately sized trials, these differences could affect the trial results. If, for example,
the percentage of patients with severe disease happened to be 20% on Treatment A and 35% on Treatment B, this difference may partly explain a difference in the trial endpoints, by making B appear worse. Although imbalances
can be allowed for in the statistical analysis, the adjusted results could still
be viewed with caution. It is better to avoid large imbalances during recruitment. The two following methods of randomisation achieve this (stratified
randomisation and minimisation).
Table 6.3 Random number list used to allocate 10 subjects to Treatment A and B in sequence in
a crossover trial.
Random number list
4
1
6
5
8
5
2
3
0
7
Subject identifier∗
Treatment allocation
1
AB
2
AB
3
BA
4
BA
5
BA
6
BA
7
AB
8
AB
9
AB
10
BA
If random number is: 0–4: give Treatment A then B; 5–9: give Treatment B then A
∗ This is also the ordering in which subjects are recruited
80
Chapter 6
6.2 Stratified randomisation
Stratified randomisation attempts to guarantee balance for some important
stratification factors. These could include age, gender and severity of disease,
but the number and type of factors will vary between trials, and should be
selected carefully. Recruiting centre may also be included if subjects are spread
over a wide geographical region, and there are clear differences in local practice. In surgical trials, a stratification factor could be surgeon, in which case
it may not necessary to also include centre. A stratification factor should not
have levels with few expected subjects. For example, if disease severity (moderate and severe) were included, but <1% of patients are expected to fall in
the ‘severe’ category, it would not be worth including this factor in the randomisation process.
Stratification involves using simple randomisation within each level of the factor. For categorical variables, such as gender, centre or disease severity (mild,
moderate, severe), the factor levels are already defined. For continuous measurements, such as age, weight and many blood values, the range must be
converted into categories.
Table 6.4 illustrates stratified randomisation using age alone. A random
number list is generated for each age group. For example, the first subject,
who is aged 52 years, is randomised using the random number list under ‘50–
59’ years, and the fourth patient, aged 68 years, is randomised under the ‘60–
69 years’ list. Block sizes of four or more can be used to ensure that the number of subjects is similar between the treatment arms. With two stratification
Age group
40–49 years
Random number list∗
Treatment allocated
Subject identifier∗
Age (years)
50–59 years
Random number list
Treatment allocated
Subject identifier∗
Age (years)
60-69 years
Random number list
Treatment allocated
Subject identifier∗
Age (years)
4
A
3
48
1
A
8
40
6
B
9
43
5
B
11
41
0
A
1
52
7
B
2
55
4
A
6
53
1
A
12
58
9
B
4
68
6
B
5
65
3
A
7
65
1
A
10
61
If random number is: 0–4: give Treatment A;
5–9: give Treatment B
∗ This is also the ordering by which subjects
are recruited
Table 6.4 Illustration of stratified
randomisation using one factor (age). Twelve
patients have been randomised using the
appropriate strata for their age.
Randomisation
81
Table 6.5 Illustration of stratified randomisation using two factors, age and gender.
Age group (years)
Gender
40–49
50–59
60–69
Male
Random number list
Treatment allocated
4
A
6
B
2
A
5
B
7
B
5
B
3
A
1
A
4
A
1
A
5
B
9
B
Female
Random number list
Treatment allocated
0
B
7
A
4
A
1
B
8
B
3
A
7
B
4
A
4
A
8
B
9
B
1
A
If random number is: 0–4: give Treatment A; 5–9: give Treatment B
factors, for example age and gender, simple randomisation is performed
within each combination of the factor levels (Table 6.5). There is a random
number list in each cell of the table.
Stratified randomisation is generally a good way of balancing important
prognostic factors, and can be relatively easy to do by hand. Problems arise
when one or more stratification factors have many levels, or several factors
are specified. While two to four factors may be necessary, sometimes far too
many are used.
In a real example of a double-blind treatment trial of lung cancer, comparing
placebo with a drug called Tarceva, the following stratification factors were
included:
r Quality of life performance status (three groups)
r Smoking status (two groups)
r Tumour stage (two groups)
r Recruiting centre (50 centres).
A randomisation list is needed for every combination of these four factors, i.e.
3 × 2 × 2 × 50 = 600 lists. Randomising patients would be cumbersome to do
by hand, and there are likely to be many cells with only one or two patients,
particularly if the trial size is not large. There could be more cells than patients,
leading to a chance imbalance in the number of patients in the trial arms, or
in one or more of the stratification factors.
After the first 301 patients were recruited out of a target sample size of
664, by chance alone the first patient randomised in each of several centres
was allocated to Tarceva, and many centres had only recruited one patient.
There were 168 patients in the Tarceva arm, 26% greater than that in the
placebo arm (n = 133). Although a difference of this magnitude would have
a minimal effect on the statistical analysis, it can give the false impression
that the randomisation process did not work properly. As expected, the size
of the difference diminished as more patients were randomised within each
of the stratification factors. The limitations of stratified randomisation can be
largely overcome by careful selection and justification of the stratification factors, or by using a method called minimisation.
82
Chapter 6
6.3 Minimisation
Minimisation also aims to ensure balance between the treatment groups for
pre-specified prognostic factors. The treatment allocation for the first few
subjects (e.g. 20) can be made using a single random number list (as in
simple randomisation). However, the allocation for each subsequent subject
depends on the distribution of the stratification factors among those who have
already been randomised, and not using random numbers. Minimisation is
also referred to as dynamic allocation.
Table 6.6 illustrates one method of minimisation. It is based on a hypothetical trial in which 20 patients have already been recruited, and a random
number list used. The distribution of each factor is obtained. The next (21st)
subject, who is 45 years old, female and with moderate disease, needs to be
randomised. The total for Arm A is less than for Arm B so the 21st patient is
allocated to Arm A. If the total for B was lower, the subject is allocated to Arm
B. If the totals are identical, allocation can be made using a random number
list. This method of minimisation only considers the balance in the categories
which apply to the patient being randomised, but there are more sophisticated
methods that consider the overall balance across all categories.
An advantage of minimisation is that it can cope easily with any number of stratification factors, including those with many levels, but it is best
implemented using a computer program. Though this should not encourage
researchers to use as many as they can.
It is sometimes argued that minimisation is not truly random because the
next allocation is predictable and could be susceptible to bias. This can be
partly overcome by using a high probability of allocating to the next treatment. In the example, this would mean that the 21st patient has an 80 or 90%
chance of being allocated Arm A, rather than a 100% chance; allocation to Arm
A is not completely certain.
Table 6.6 Illustration of a simple method of minimisation using three
stratification factors.
Number of subjects
Factor
Level
Arm A
Arm B
21st subject
Age
40–49 years
50–59
60–69
4
4
2
3
2
5
Age 45
Gender
Male
Female
5
4
5
6
Disease status
Mild
Moderate
Severe
3
2
4
3
4
4
10
13
Sum
Female
Moderate
Sum for A is less than sum for B, so 21st subject receives Treatment A
Randomisation
83
For single centre studies, minimisation should not be used in case the person allocating the treatments is aware of the allocations. However, for multicentre trials it is difficult for trial staff in one centre to know of the treatment
allocations from all centres and all stratification factors, and therefore correctly
guess the next allocation. Given that the randomisation process means unpredictability (Box 6.1), minimisation is an acceptable method for trials with at
least two centres.
Methods of randomisation
r Simple randomisation (with or without a specified block size)
r Stratified randomisation (with or without a specified block size)
r Minimisation
– A block size of k means that after every k subjects have been recruited, the
number of subjects in each treatment arm is the same
– Stratified randomisation and minimisation ensures the trial arms are well
balanced for specified important prognostic factors
– In simple and stratified randomisation, the allocation of one subject to the
trial groups is independent of the allocation of all other subjects
– In minimisation, the allocation of a subject depends on the previous allocations.
6.4 Unequal randomisation
Most trials aim to have a similar number of subjects in each treatment arm
(equal or 1:1 randomisation). Sometimes, more subjects are required in one
arm (unequal randomisation), usually the new intervention, such as in the
ratio 2:1. This may be because more reliable data is needed on the effects of
the new treatment (e.g. side-effects). Also, subjects may be more likely to participate in a trial if they have a 2 in 3 chance of getting a potentially more
effective treatment, rather than a 50% chance.
For the same sample size, the statistical power associated with comparing
the results of two trial arms decreases as the number of subjects in each arm
becomes more unequal. However, the loss in power is only considered unacceptable if the ratio exceeds 3:1. In the example of the lung cancer trial on
page 81, the imbalance is noticeable to the eye (168:133), but the ratio (1.3:1)
would not mean a great loss of power in the statistical analysis. To avoid loss
of power, unequal randomisation can be allowed for in the sample size calculation by having a larger study size.
6.5 Which method of randomisation to use?
The choice of randomisation method depends on the size of the trial, the number of stratification factors, and availability of a computerised randomisation
84
Chapter 6
Table 6.7 Crude guide to choosing the randomisation method in a two-arm trial.
Number of stratification factors
Few
(all with a few levels)
Many
(some with many levels)∗
Simple
- Stratified
- Minimisation
Minimisation
Moderate (50–199)
Simple
- Stratified
- Minimisation
Minimisation
Medium (200–999)
Simple
- Stratified
- Minimisation
Minimisation
Large (>1000)
Simple
- Simple
- Stratified
- Minimisation
- Stratified
- Minimisation
Very large (>10 000)
Simple
- Simple
- Stratified
- Minimisation
- Simple
- Stratified
- Minimisation
Size of trial
(total number of subjects)
None
Small (<50)
∗ or a few factors, of which some have many levels
program. The randomisation process is often performed by hand in small trials, because the development and maintenance costs associated with a computer program are not worthwhile. Table 6.7 is a crude guide to how trial
size and number of stratification factors might influence the randomisation
method used.
The aim of achieving balance in important prognostic factors justifies using
methods such as stratified randomisation or minimisation. However, having
trial arms with equal numbers is not a necessary outcome of randomisation,
chance variation should produce arms with slightly different sizes.
For crossover or split-person studies, subject characteristics and prognostic
factors are, by design, identical between treatment arms because each subject
acts as his/her own control. Simple randomisation should therefore be acceptable. Stratified randomisation might be considered to ensure that, for example,
a similar proportion of males and females have Treatment A followed B, and
vice versa.
When stratification factors are used, it is recommended that adjustment is
made for the factors in the statistical analysis (using the multi-variate methods
in Box 7.11, page 114). This might seem counter-intuitive because these factors
were used specifically to ensure balance. The reason is that the randomisation process has been ‘restricted’ by incorporating these stratification factors,
compared with simple randomisation. Adjusting for them in the analysis can
increase the precision of the results (narrower confidence interval), though the
effect size does not usually change much. Both the unadjusted and adjusted
effect sizes (and 95% confidence intervals) could be presented for the main
endpoint.
Randomisation
85
6.6 Eligibility
Inclusion and exclusion criteria (eligibility list) are always defined (see page
11). When potential subjects are identified, this list is examined to ensure that
the subject is suitable for the trial. The subject can then be randomised after
giving consent. The trial interventions should be administered soon after the
allocation has been made. The eligibility list for each subject can be filed in
the recruiting centre so that it can be examined during a monitoring visit (see
page 179).
Some trials have a long eligibility list, which may make it difficult to recruit
the target sample size in a timely fashion. A cut-off for each criterion must
be specified, but there should be some degree of flexibility. When a subject’s
value is just outside of the range, a judgement could be made whether to randomise or not. For example, if the required age range is 50 to 80 years, but
a potential subject is aged 49 years and was eligible in relation to all other
factors, it would be reasonable to randomise them; the age is very close to
the cut-off. A subject aged 35 years should probably not be randomised. The
decision to randomise will depend on the importance of the criteria used, and
how close to the limit for inclusion the subject falls.
It is useful to have a screening log in each participating centre, which
records each eligible subject approached, and whether they declined to participate in the trial and why. This could be used to identify problems with
recruitment.
6.7 Randomising in practice
The logistical aspects of randomly allocating subjects varies according to
the size of the trial and the resources available. Trials co-ordinating within
dedicated clinical trials units or established research departments should
already have the computing expertise to implement any type of randomisation method requested. Outside such settings, methods such as simple or
stratified randomisation (with one or two stratification factors) can be done
by hand. A statistician or computer programmer sometimes produces the
randomisation list, which can be created using a random number generator
available in many software packages. When allocating subjects, it is often
best for a computer program to read off the list rather than a person (e.g. trial
co-ordinator), because it avoids human error, which could occur when using
stratified randomisation or minimisation. This is especially so for large trials.
Tossing a coin for all trial subjects should be avoided because bias cannot be
detected, and there is no formal record of a randomisation list until after the
subject has been randomised. A randomisation list or minimisation process
can show the regulatory authorities, or the sponsor’s auditor, that treatment
allocation has been properly conducted. However, if a computer randomisation program is used and there is a rare occasion when the system is not
functioning, tossing a coin is a simple solution at the time. Other methods of
86
Chapter 6
Table 6.8 Randomisation lists using simple randomisation in two trials; an unblinded trial
(surgery vs chemotherapy) and a blinded trial (aspirin vs placebo).
Unblinded trial
Blinded trial
Random
number
Treatment
allocation
Patient
number
Treatment
allocation
Pack
code
Subject
number
3
7
8
0
1
3
8
2
4
5
Surgery
Chemotherapy
Chemotherapy
Surgery
Surgery
Surgery
Chemotherapy
Surgery
Surgery
Chemotherapy
1
2
3
4
5
6
7
8
9
10
Aspirin
Placebo
Placebo
Aspirin
Aspirin
Aspirin
Placebo
Aspirin
Aspirin
Placebo
M1001
M1002
M1003
M1004
M1005
M1006
M1007
M1008
M1009
M1010
1
2
3
4
5
6
7
8
9
10
Allocation: Treatment A if random number is 0–4 and Treatment B if 5–9
randomising subjects include sealed envelopes, each one containing the next
allocation, based on a random number list (e.g. surgical trials).
For blind trials, the randomisation list will not show the interventions. The
list of subjects and their actual treatment is not revealed until the end of the
trial. Only the randomisation programmer and trial statistician should have
access to this list during the trial because neither have direct contact with subjects or staff who recruit and manage subjects. The programmer and statistician cannot, therefore, influence the treatment allocation, or the trial outcome
measures. The randomisation list visible to other trial staff only contains a
treatment code, sometimes called medication or ‘med’ number, or ‘pack code’.
This code could be created by the trial co-ordinating centre, or the drug supplier, but it is essential that the supplier labels the drugs correctly. Named trial
staff can obtain the actual allocation for a particular subject when, for example, there is a serious adverse event, and knowing what trial treatment was
given will help (see page 182).
Table 6.8 is an example of a randomisation list using simple randomisation.
Working down the list, the first patient (ID code 1) is allocated to surgery,
the second patient receives chemotherapy and so on. In the blinded study,
trial staff or anyone else involved in randomisation would only see the pack
code and patient identifier. The first patient randomised would be sent drugs
labelled M1001 by the supplier. The supplier would need to ensure that the
aspirin packets are labelled M1001, M1004, M1005 etc., and the placebo packets are labelled M1002, M1003, M1007.
Multi-centre trials usually have a trial co-ordination centre with dedicated
staff. Subjects are randomised after recruiting centres contact the trial centre
(usually by telephone or fax) who, after checking eligibility, uses a computer
randomisation program to inform the centre of the treatment allocation. For
international trials, where subjects could be recruited at any time of the day
in relation to the co-ordinating centre, it is often impractical to have 24-hour
Randomisation
87
trial staff. Instead, an internet randomisation system, or automated telephone
service with voice recognition, can be used, neither of which require direct
contact with trial staff. These systems can be expensive to set up and require
expert IT staff to develop and maintain. Central randomisation has the advantage that the treatment allocation is performed by someone who has no direct
contact with the subject, thus minimising the potential for bias.
6.8 Checking that the randomisation process worked:
examining baseline characteristics
All trial reports should have a table comparing the baseline characteristics
between the interventions. The aim is to show that randomisation produced
similar arms, indicating that the results are valid and unlikely to be explained
by any factor other than the treatments being tested. If the characteristic is
based on ‘taking measurements on people’, the mean (or median) values
should be similar between the groups. If based on ‘counting people’, the proportions should be similar.
P-values (discussed in Chapter 7) are sometimes provided for the baseline
comparisons. They indicate whether an observed difference could have arisen
by chance, assuming that the distribution of the factor in one group is identical
to the distribution in another group. However, it is inappropriate to examine
baseline differences in this way.1 P-values test whether a baseline factor for ≥2
groups came from the same distribution, but this is known to be true because
the randomisation was made from the same group of subjects in the first place.
Reporting and interpreting p-values for baseline characteristics should therefore be avoided. Randomisation should produce small imbalances. What matters is whether the size of the difference is likely to distort the comparison of
the treatments.
Table 6.9 shows the baseline characteristics in a trial comparing different
methods of inhaled sedation during oral surgery among anxious children, in
Table 6.9 Baseline characteristics and main outcome measure of a trial comparing methods of
inhaled sedation during oral surgery among anxious children.2
Baseline characteristic
Males (%)
Mean age (years)
Mean body weight (kg)
Level of anxiety (mean score)
Main trial endpoint
Percentage of children who
completed surgery
P-value for the
difference
between the
three groups
Air
N = 174
Nitrous oxide
N = 256
Nitrous oxide
+ sevoflurane
N = 267
47%
9.1
36.3
5.6
50%
9.5
37.8
6.1
39%
9.6
37.7
6.0
0.03
0.11
0.50
0.01
54%
80%
93%
<0.001
All children received intravenous midazolam
88
Chapter 6
which p-values for baseline factors were reported.2 The main endpoint was
whether the dentist was able to complete the surgery. While most factors
were similar between the groups, there appeared to be a difference in gender
and anxiety level, indicated by a p-value <0.05. There was a lower percentage of males in the ‘nitrous oxide plus sevoflurane’ group, and children who
received air tended to have lower anxiety levels. Instead of focusing on the
p-value, the extent to which the outcome measure of the trial could be affected
by these imbalances needs to be considered, as well as plausibility.
The trial results could be affected if gender or anxiety levels were associated with the chance of completing treatment. For example, the percentage
of males in the ‘nitrous oxide plus sevoflurane’ group was eight percentage points lower than in the group who received air (39 vs 47%). Gender
could have an effect if males are less likely than females to complete surgery,
because the lower completion rate in the ‘air’ group (54%) could be due to
the higher proportion of males. This might not be plausible. Furthermore, it is
unlikely that such a large treatment effect (93 vs 54%, a difference of 39 percentage points) could be explained by a difference of only eight percentage
points. Similarly, the average anxiety level in the ‘nitrous oxide plus sevoflurane’ group was 0.4 units higher than that in the ‘air’ group. This could affect
the endpoint if children with higher anxiety levels are more likely to complete surgery, but this again is questionable, and a difference of 39 percentage
points is unlikely to be due to a difference of only 0.4 units. Despite an apparent statistically significant difference in these two factors, the treatment effect
is unlikely to be materially affected. The trial results are valid.
When the number of subjects in a trial is very large, even small and unimportant baseline differences could be highly statistically significant. It would
be incorrect to conclude that the randomisation process failed. When observed
differences appear large enough to matter, checks can be done. First, it must be
established that subjects were correctly allocated from the randomisation list,
by looking for human error or error in the programming code. Second, selection or allocation bias needs to be eliminated as a possible cause (probably not
as necessary for double-blind trials). For example, screening logs within centres could be examined to determine whether certain eligible subjects were
not randomised to one of the trial arms, or were withdrawn soon after randomisation, but not included in the trial. Whatever these checks show, there
are statistical methods that can allow for differences in baseline characteristics when analysing the main trial endpoints, (the multi-variate methods on
page 114). However, it is best to ensure similarity during recruitment.
6.9 Summary
r The three commonly used methods of randomisation are: simple and
stratified randomisation, and minimisation.
r There is no perfect method of randomisation; one may be more appropriate than another for a particular trial.
Randomisation
89
r The choice of method can depend on the trial size, the number of important prognostic factors that need to be allowed for, and logistical and
resource issues.
r Randomisation does not need to produce trial arms with equal numbers,
and the distribution of baseline characteristics needs to be only similar, not
identical, between the groups.
References
1. Senn SJ. Testing for baseline balance in clinical trials. Stat Med 1994; 13:1715–1726.
2. Averley PA, Girdler NM, Bond S, Steen N, Steele J. A randomised controlled trial of paediatric conscious sedation for dental treatment using intravenous midazolam combined
with inhaled nitrous oxide or nitrous oxide/sevoflurane. Anaesthesia, 2004; 59:844–852.
CHAPTER 7
Analysis and interpretation of
phase III trials
Randomised controlled trials aim to change practice so their data needs careful analysis and interpretation. Phase III trials always compare at least two
intervention groups. The data can be interpreted using the following fundamental questions:
r Is there a difference? Examine the effect size.
r How big is it?
❜ What are the implications of conducting a trial on a sample of people
(confidence interval)?
r Is the effect real?
❜ Could the observed effect size be a chance finding in this particular trial
(p-value or statistical significance)?
r How good is the evidence?
❜ Are the results clinically important?
An effect size is a single quantitative summary measure used to interpret clinical trial data, and to communicate the results. It is obtained by comparing a
trial endpoint between two intervention arms. Types of effect sizes depend on
the outcome measure used: ‘counting people’, ‘taking measurements on people’ or ‘time-to-event’ data (see Chapter 2), which also determines the method
of statistical analysis. This chapter presents the commonly used analyses, but
there are more complex ones that are appropriate when necessary.
7.1 Outcome measures based on counting people
Consider a trial that evaluated the effect of an influenza vaccine in the elderly
(Box 7.1).1
What are the main results?
Each percentage (or proportion) indicates the risk of developing flu. For
example, the risk of being diagnosed with flu by the family doctor after five
months in the placebo arm is 3.4%. The effect size is either the ratio (relative
risk or risk ratio), or the difference (absolute risk difference) between the
two risks. Using the results in Box 7.1, these are interpreted as follows:
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
91
92
Chapter 7
Box 7.1 Example: Phase III trial of the flu vaccine in the elderly1
Location: 15 family health practices in the Netherlands
Subjects: 1838 men and women aged ≥60 years
Design: Double-blind placebo-controlled randomised trial
Interventions: A single flu vaccine or placebo (saline) injection
Main outcome measures: The proportion who developed flu up to 5 months
after the injection; diagnosed by (i) serology or (ii) family doctor
1838 individuals aged >
_ 60 years
927
vaccine
911
placebo
Flu diagnosis five months later
By serology
41 (4.4%)
80 (8.8%)
By family doctor
17 (1.8%)
31 (3.4%)
r Relative risk = 4.4% ÷ 8.8% = 0.50: Vaccinated people are half as likely to
develop serological flu than those given placebo, after five months
r Risk difference = 4.4% − 8.8% = −4.4%: Among vaccinated people, 4.4%
fewer cases of serological flu are expected compared to those given placebo.
Alternatively, in 100 vaccinated subjects there could be 4.4 fewer cases than in
a group of 100 given placebo. (The minus sign indicates fewer cases.)
The comparison, or reference group must always be made clear.# They usually receive standard treatment, placebo or no intervention. It is insufficient to
say ‘Vaccinated subjects are half as likely to develop serological flu’. What is
correct is: ‘Vaccinated subjects are half as likely to develop serological flu compared with placebo subjects’. If the reference group were vaccinated subjects, the
relative risk would be 2.0 (8.8% ÷ 4.4%): placebo subjects are twice as likely
to develop serological flu as vaccinated subjects. The risk difference would be
+4.4% (8.8% − 4.4%): in 100 subjects given placebo there could be 4.4 more
cases of serological flu than in a group of 100 vaccinated subjects.
The ‘no-effect’ value
If the vaccine had no effect, both groups would have the same risk of developing flu. The relative risk (the ratio of the two risks) is one, and the risk
#
It might also be useful to know the actual risk in this group.
Analysis of phase III trials
93
difference is zero. These are called the no-effect value. They help interpret
confidence intervals and p-values.
Is the new intervention better or worse?
The relative risk or risk difference indicates the magnitude of the effect. Determining whether the intervention is more beneficial or harmful depends on
what is measured. ‘Risk’ implies something bad, but in research it can be used
for any endpoint. If the outcome measure is ‘positive’, for example, the percentage of people who are alive, an increased relative risk (i.e. >1), or positive
risk difference (i.e. >0), indicates that the new intervention is more beneficial. However, if the outcome measure is ‘negative’, such as the percentage
of people who have died, an increased relative risk or positive risk difference
indicates that the intervention is more harmful.
Further interpretations of risk difference and relative risk
There are other ways to help explain the treatment effect. A risk difference
can be converted to Number Needed to Treat (NNT). The risk difference for
serological flu is −4.4%, meaning that in 100 vaccinated subjects there were
4.4 fewer flu cases. To avoid one case of serological flu, 23 people need to be
vaccinated (100 ÷ 4.4). The NNT is 23 (Box 7.2).
Relative risks of 0.5 or 2.0 are easy to interpret: the risk is half or twice as
large as in the reference group. However, values of 0.85 or 1.30 are less intuitive: the risks are 0.85 or 1.30 times as large as that in the reference group.
Converting relative risk to a percentage change in risk can be useful (Box 7.3):
called either a risk reduction, or an excess risk. From Box 7.1, the relative risk
of 0.53 (1.8% ÷ 3.4%) means that the risk of developing clinician-diagnosed flu
is reduced by 47% in those who were vaccinated, compared with the placebo
group. The no-effect value for a percentage change in risk is zero.
Generally, a relative risk below 2.0 is changed to a risk reduction or excess
risk. If above 2.0 it is better left alone, to avoid looking cumbersome. A
relative risk of 12 is an excess risk of 1100% ([12 − 1] × 100). It is acceptable to
say the risk is 12 times greater, or increased 12-fold.
Box 7.2 Calculating the number needed to treat (NNT) to avoid one
affected individual
Risk of developing serological flu∗
Vaccine
Placebo
4.4%
8.8%
0.044
0.088
Risk difference
−4.4%
−0.044
∗ expressed as a percentage or proportion
NNT
23 (100 ÷ 4.4)
23 (1 ÷ 0.044)
94
Chapter 7
Box 7.3 Converting a relative risk to percentage change in risk
Relative
risk (RR)
Subtract the
no-effect value
of 1 (RR − 1)
Multiply by
100
(RR − 1) × 100
0.85
−0.15
−15%
relative risk reduction (or risk
reduction)
1.30
+0.30
+30%
excess relative risk (or excess
risk)
A positive sign indicates the risk is increased, compared to the reference group
A negative sign indicates the risk is decreased
Relative risk or risk difference?
Relative risks tend to be similar across different populations, indicating the
effect of a new intervention generally. They do not usually depend on the
underlying rate of disease. A relative risk of 0.5 associated with a flu vaccine means that the risk is halved from whatever it is in a particular population, whether the flu incidence is 1 per 1000, or 20 per 1000. However, risk
difference always reflects the underlying rate, and so will vary between populations. It indicates the treatment effect in a particular population. As the disease becomes more common, the relative risk is not expected to change much,
but the risk difference will increase and the NNT decreases (Table 7.1). An
intervention has a greater effect in a population when the disease is common.
Because risk difference can vary, relative risk is the most commonly reported
effect size. Risk difference could be given in addition.
What are the implications of conducting a trial on
a sample of people?
When using a sample of people in a trial to estimate the true effect size among
all subjects who could benefit from the intervention, there is uncertainty over
Table 7.1 Relative risk and risk difference according to different underlying
disease rates (i.e. in the placebo group).
Risk of flu
Placebo
(per 100)
Vaccine
(per 100)
Relative risk
Risk difference
(per 1,000)
Number needed
to treat
1
2
5
10
20
0.5
1.0
2.5
5.0
10.0
0.50
0.50
0.50
0.50
0.50
5
10
25
50
100
200
100
40
20
10
Analysis of phase III trials
95
Table 7.2 Effect size and 95% confidence intervals (CI) associated with comparing
two proportions (or percentages).
Risk of flu
Outcome
measure
Effect size
Vaccine
N = 927
Placebo
N = 911
Relative risk
95% CI
Risk difference
95% CI
Flu diagnosis by:
Serology
4.4%
8.8%
0.50
0.35 to 0.72
−4.4%
−6.6 to −2.1%
Family doctor
1.8%
3.4%
0.53
0.30 to 0.97
−1.6%
−3.0 to −0.1%
how close the observed effect size will be to the true value. This is quantified
by a standard error, used to calculate a 95% confidence interval (CI) for a
relative risk or risk difference. The basic principle is the same as that for a
single proportion or mean (Chapter 4 and Figure 4.1).
95% Confidence interval for a relative risk or risk difference
A range within which the true relative risk, or true risk difference, is expected
to lie with a high degree of certainty. If confidence intervals were calculated
from many different studies of the same size, 95% of them would contain the
true value, and 5% would not.
Table 7.2 shows the 95% CIs from the flu vaccine trial. The relative risk for
serological flu is 0.50 with 95% CI 0.35 to 0.72. The true relative risk is thought
to be 0.50, but there is 95% certainty, given the results of this trial, that the
true value lies somewhere between 0.35 and 0.72. The interval excludes the
no-effect value (relative risk of one), so there is likely to be a real effect of
the vaccine. The corresponding percentage risk reduction is 50%, and the true
value is likely to lie between 28 and 65% (calculation in Box 7.3). The ends of
the CI provide a conservative and optimistic estimate of the effect size, and
in this example, even a 28% risk reduction may be considered worthwhile.
The CI for the risk difference indicates that the true effect could be anywhere
between 2.1 and 6.6 fewer cases of flu in every 100 vaccinated people.
There is likely to be a real treatment effect if:
The 95% CI for the relative risk excludes the no-effect value of 1
The 95% CI for the excess risk or risk reduction excludes the no-effect value
of 0
The 95% CI for the risk difference excludes the no-effect value of 0
96
Chapter 7
Describing a CI sometimes implies that the true effect lies anywhere within
the range with the same likelihood. However, it is more likely to be close to the
point estimate used to derive the interval (i.e. the middle) than at the extreme
ends. This is an important consideration when the interval just overlaps the
no-effect value. With a relative risk of 0.75, and 95% CI 0.55 to 1.03, most of
the range is below the no-effect value. The possibility of ‘no effect’ cannot be
reliably excluded because the interval just includes 1.0, but the true relative
risk is more likely to lie around 0.75 than 1.0. A treatment effect should not be
dismissed completely with this kind of result, because there is a suggestion of
an effect.
Could the observed effect size be a chance finding in this
particular study?
The observed risk difference associated with serological flu, in the 1838 trial
subjects, was −4.4%. But what is of interest is the effect in all subjects that
could benefit from the vaccine. If the trial included every elderly person ever,
might the difference be as large as −4.4%, or even greater, or could there in fact
be no difference at all?# Could the observed difference of −4.4% be a chance
finding in this particular trial, due to natural variation? To help determine
this, a p-value is calculated. The size of the p-value depends on the difference
between the two risks (effect size), the sample size on which each risk is based,
and the number of events (i.e. cases of flu).
The p-value for a risk difference of −4.4%, with each risk based on 41/927
and 80/911, is <0.001. Even if there were really no effect (i.e. the true difference
were zero), a value of −4.4% could be seen in some studies, just by chance. But
how often? The p-value of <0.001 indicates that a difference as large as 4.4% or
greater, in favour of either the vaccine or placebo, would occur in less than 1 in
1,000 studies of the same size by chance alone, assuming that there really were
no effect. The observed effect size (4.4% risk difference) is therefore unlikely
to have arisen by chance, and so reflects a real effect. The p-value for cliniciandiagnosed flu is 0.035: a risk difference of 1.6% or greater, in either direction,
could be seen in 35 in 1,000 studies of the same size, if the true risk difference
were zero. The Appendix provides more details.
These are two-tailed p-values. Effect sizes of −4.4% or lower (vaccine better than placebo), or +4.4% or greater (placebo better than vaccine), are both
assumed to be plausible. Any departure from the no-effect value is allowed
for. The p-value is twice as large as a one-tailed p-value, which is based on
looking only in a single direction. The more conservative two-tailed p-value
is reported for most trials, unless there is a clear justification for having a onetailed value (i.e. the new treatment can only be better).
# Or if many other trials were conducted, each with 1838 subjects, would a difference as
large as −4.4%, or greater, be observed in many of them, if the vaccine really had no effect?
Analysis of phase III trials
97
A new intervention is considered to have a real effect if the p-value is <0.05;
a cut-off judged to be sufficient low (see page 112). The result is said to be
statistically significant (Section 7.6).
How good is the evidence?
The flu vaccine trial was double blind, so the clinician giving the vaccine or
placebo injection could not have influenced the allocation. Subjects were also
unaware of what they received, so neither group should be no more or less
likely to report flu-like symptoms (see page 62). It is possible, therefore, to
be confident about the relative risk of 0.50; a large and clinically important
effect, which was similar whether flu was determined by serology or clinician
diagnosis. In fact, it is close to the effect seen in observational studies. Treatment compliance was not an issue, because the interventions involved a single
injection; no subject refused the injection after being randomised.
Relative risk or odds ratio?
A relative risk is easy to calculate. Sometimes, the odds ratio is reported. It has
some useful mathematical properties used by many statistical methods. ‘Risk’
and ‘odds’ are different ways of presenting the chance of having a disorder.
Risk is the number with disease out of all subjects, while an odds expresses
the number with disease to the number without. If there is one affected subject
among n people, there must be n − 1 unaffected subjects. So a risk of 1/n is the
same as an odds of 1/(n − 1), or 1 : n − 1.
When the disease is fairly uncommon (say <20%), relative risk and odds
ratio are similar, and so interpreted in the same way (Table 7.3A). However,
when the disease is common, they will be noticeably different (Table 7.3B).
Odds ratios need careful interpretation because a ratio of two odds is not the
same as a ratio of two risks. In Table 7.3B it would be incorrect to interpret the
odds ratio as a risk reduction of 75%. The risk has been reduced by 51%, indicated by the relative risk, but the odds in the vaccine group is 0.25 times the
odds in the placebo group. This is difficult to explain easily. When a disorder
is common, describing an odds ratio as if it were a relative risk could greatly
over-estimate the treatment effect; relative risk is preferable.
7.2 Outcome measures based on taking measurements
on people
Here, the trial endpoint in each arm is summarised by the mean value and the
standard deviation (Chapter 2). An appropriate effect size is the difference
between two means (or mean difference). It often has a Normal distribution,
so simple statistical analyses can be used. However, this is not usually the case
when taking the ratio of two means, so this tends not to be used.
What are the main results?
Box 7.4 shows an example, and the trial endpoint is body weight.2 The two
groups had a similar weight at baseline: the mean values were 99 and 98 kg in
98
Chapter 7
Table 7.3 Calculation of relative risk and odds ratio using the flu vaccine trial
(serological diagnosis).
Table A Incidence of flu is uncommon (from the trial in Box 7.1).
Vaccine group
Placebo group
Developed flu
Did not develop flu
Total
41 (a)
80 (c)
886 (b)
831 (d)
927 (n1 )
911 (n2 )
Relative risk = a/n1 ÷ c/n2 (41/927) ÷ (80/911) = 0.50
Odds of developing flu in the vaccine group = 41/886 (a/b)
Odds of developing flu in the placebo group = 80/831 (c/d)
The ratio of the odds is (41/886) ÷ (80/831) = (a × d) ÷ (c × b) = 0.48
Table B Incidence of flu is common (hypothetical results).
Vaccine group
Placebo group
Developed flu
Did not develop flu
Total
300
600
627
311
927
911
Relative risk = (300/927) ÷ (600/911) = 0.49
Odds ratio = (300 × 311) ÷ (627 × 600) = 0.25
the Atkins and Conventional diet groups respectively. Each subject’s weight
at a specified time point is compared with that at the start of the trial (baseline
value). The analysis is based on ‘weight at time T minus weight at baseline’,
i.e. the change in weight.# The Atkins diet group lost an average of 6.8 kg
from baseline to three months, compared with 2.7 kg in the Conventional diet
group.
The effect size (mean difference) is −6.8 − (−2.7) = −4.1 kg: the Atkins diet
group lost an average of 4.1 kg more than Conventional diet subjects. The mean
differences in weight change at 6 and 12 months were −3.8 and −1.9 kg respectively. The effect of the Atkins diet seemed largest in the first few months.
The effect size is associated with the average change in weight. In the Atkins
diet group, some individuals lost more than 6.8 kg, some less, and some could
have gained weight or had no weight change. However, the aim is to summarise weight change for a group of people, and not to predict weight change
for an individual.
Using the change in endpoint is a simple approach. It is acceptable if the
baseline values are similar between the trial arms. If they are not, a multivariate linear regression (or analysis of covariance) is preferable. This statistical method uses weight at time T as the ‘outcome variable’, and the baseline
value and treatment groups are ‘covariates’. This analysis also produces the
# See footnote to Box 7.4. Results presented in this chapter are associated with people who
weighed 100 kg at baseline.
Analysis of phase III trials
99
Box 7.4 Example: Phase III trial of the Atkins diet2
Location: 3 centres in the United States
Subjects: 63 obese men and women
Design: Randomised controlled trial
Interventions: Atkins diet (low-carbohydrate, high-protein, high-fat) or
Conventional diet (low-calorie, high-carbohydrate, low-fat) for up to 1 year
Main outcome measures: Change in body weight (from baseline) at 3, 6 and
12 months after starting the diet
63 obese individuals
Change in body weight, kg:
3 months later
6
12
33
Atkins
30
Conventional
Mean (SD)
–6.8 (5.0)
–7.0 (6.5)
–4.4 (6.7)
Mean (SD)
–2.7 (3.7)
–3.2 (5.6)
–2.5 (6.3)
SD: standard deviation; change in body weight = weight at 3, 6 or 12 months
minus baseline weight
(The published paper was actually based on the percentage change in body weight from baseline for each subject. To avoid confusion with percentages used in Section 7.1, the effect of
the diets was expressed in kg for the purposes of this chapter, by assuming a baseline weight
of 100 kg, close to the mean value in the trial subjects. For example, at three months the
reported percentage weight loss in the Atkins group was 6.8%, which is the same as a loss of
6.8 kg in someone who initially weighed 100 kg.)
mean difference in the endpoint at time T between the trial arms, but after
allowing for each subject’s baseline value.3
In judging whether a particular change in weight is clinically worthwhile, a
loss of about 7 kg is intuitive, but a weight loss of 1 kg is probably not worthwhile in someone whose initial weight was 100 kg. When trial endpoints are
on a restricted scale, the effect size should be interpreted in relation to the
scale. For example, pain score is often measured on a visual analogue scale, 0
to 100. A difference in scores of +5 units is a small, perhaps clinically unimportant effect (5/100), but a difference of −30 is not (−30/100).
100
Chapter 7
Table 7.4 Effect sizes, 95% confidence intervals (CI) and p-values from the randomised trial
comparing the Atkins and Conventional diets.2
Mean change in weight, kg∗
Effect size, kg
Months
after
baseline
Atkins
N = 33
A
Conventional
N = 30
B
Difference
in means
A−B
95% CI
p-value
3
6
12
−6.8
−7.0
−4.4
−2.7
−3.2
−2.5
−4.1
−3.8
−1.9
−6.3 to −1.9
−6.8 to −0.8
−5.1 to +1.3
0.001
0.02
0.26
∗ the minus signs simply indicates a reduction in weight compared with baseline (weight at time
T minus weight at baseline). [See footnote to Box 7.4]
What are the implications of conducting a trial on a sample
of people?
A standard error is estimated using the means and standard deviations (see
page 206). It is a measure of uncertainty over how far the observed mean difference is from the true value. The standard error is used to calculate a 95%
confidence interval (CI) – a range within which the true mean difference is
likely to lie. Three months after starting the diet, the true mean weight loss
in the Atkins group is likely to be greater than that in the Conventional diet,
by 1.9 to 6.3 kg (Table 7.4). The optimistic estimate of 6.3 kg is a large effect,
while the conservative estimate of 1.9 kg may or may not be considered worthwhile. At six months, the effect is less certain because of the wider CI. The
lower estimate of 0.8 kg is unlikely to be worthwhile. At 12 months, the 95%
CI includes the no-effect value of zero: it is possible that there is no real difference in weight change.
Could the observed effect size be a chance finding in this
particular study?
Changes in body weight will vary naturally between people. Some on the
Atkins diet will gain weight and some on the Conventional diet will lose
weight, and vice versa. A p-value helps determine whether the observed effect
size (e.g. −4.1 kg) could be a chance finding that is consistent with this natural
variation. At three months the p-value is 0.001 (Table 7.4). An observed effect
size as large as −4.1 kg, or greater, could be seen in 1 in 1,000 trials of the same
size if there were really no difference between the two diets. The effect is therefore unlikely to be due to chance. The benefit of the Atkins diet is likely to be
real. However, at 12 months the p-value of 0.26 indicates that a difference of
−1.9 kg, or greater, could be seen in 26 in 100 studies of the same size, just due
to chance. This is insufficient evidence of a real effect at 12 months.
The p-values in Table 7.4 are two-tailed, because the average weight loss
could plausibly be greater on either diet. A one-tailed p-value should only be
used if the Atkins diet can cause weight loss, but not weight gain, which is not
true. The p-value of 0.001 at three months is therefore based on a difference
Analysis of phase III trials
101
in the average weight loss as extreme as 4.1 kg (or greater), in favour of either
the Atkins or Conventional diets.
How good is the evidence?
Although an initial weight loss of 4.1 kg might be considered worthwhile,
the effect reduced over time probably because more subjects came off the
diet (or it took longer for the effect of the conventional diet to be seen). The
trial subjects knew which diet they were on, which could affect the results.
Atkins diet subjects may have started to exercise more, which led to some
of the weight loss. It is difficult to determine what confounders and biases
may be present, but a judgement could be made on whether the magnitude
of the observed effect could be largely explained by these factors. The mean
difference in weight loss at three months was 4.1 kg. Bias or confounding are
unlikely to account for all of this. Changes in behaviour, which may influence
the trial endpoint, could have been monitored and used to examine whether
they could affect the results. Patients could have been asked to record their
exercise levels during the trial.
41% of subjects were unavailable for assessment of body weight at the end
of the trial (43% Atkins and 39% Conventional). The authors suggest this is
largely due to the lack of direct dietary supervision during the trial (subjects
were just given written instructions on what to do after an initial meeting
with a dietician). This could explain why the effect was greatest in the first
few months. The initial large effect might have been maintained if there was
more contact with a dietician.
Effect sizes with a skewed distribution
When using the mean difference, calculating CIs and p-values is simple
because it is assumed that the difference follows a Normal distribution. If
many trials provided a mean difference, the distribution of these differences
would appear symmetrical, or bell-shaped (see Figure 2.1, page 20). In practice, the distribution of the endpoint in each trial arm could be examined using
a probability plot (Figure 2.3, page 24) to check the Normality assumption.
If the endpoint has a skewed (asymmetric) distribution, applying transformations such as logarithms or square root, may produce a Normal distribution. If the distribution remains very skewed, the difference between two
means may not be a good measure of treatment effect. Instead, the difference
between two medians is better. The calculation of the p-value is based on
the ranks of the data, not the actual values, and calculating a 95% confidence
interval is more complex.
7.3 Outcome measures based on time-to-event data
In trials with time-to-event data, for example time to death, or time to first
stroke, the approach described in Section 7.2 can be used if everyone has had
the event of interest. Otherwise, specific methods are available.
102
Chapter 7
Box 7.5 Example: Phase III trial of Herceptin in treating breast
cancer4
Location: Several centres in the United States
Subjects: 3351 women with early breast cancer and HER2 positive tumours
Design: Randomised controlled trial
Interventions: Standard chemotherapy with or without 1 year of Herceptin
Main events of interest: The number who had a breast cancer recurrence, a
new tumour or died.
3351 women
1672
Herceptin
About 2 years later
Number of:
Recurrences
New tumour*
Deaths
117
5
62
1679
control
235
20
92
∗ Not a new breast cancer in the opposite breast
What are the main results?
Box 7.5 is an example of a trial with several time-to-event endpoints.4 The
events of interest were breast cancer recurrence, a new tumour or death. Some
women could have more than one of these, for example, a recurrence before
dying. The main endpoint is therefore the time to whichever event occurred
first, called disease-free survival (DFS) in the published paper. Another endpoint was time to death (overall survival); see Chapter 2.
Time-to-event data is presented graphically, using a Kaplan–Meier plot
(Figure 7.1). DFS rates at specific time points can be read off the graph. For
example, at 3 years 87.1% of women in the Herceptin group were alive and
free from disease, compared with 75.4% in the control group. Alternatively,
12.9% of women had an event in the Herceptin group compared with 24.6%
in the control group; the event rate was approximately halved.
The effect size is the hazard ratio: the risk of an event in one trial arm compared with the risk in the other arm, at the same time point. It can be interpreted in a way similar to relative risk, but is more difficult to calculate by
hand because the time to each event needs to be allowed for (statistical software should therefore be used). For a ‘negative’ event such as death, or disease
occurrence, a hazard ratio <1 means that the new intervention is better than
the control, because the risk of having an event is lower (i.e. subjects have
Analysis of phase III trials
(a)
(b)
100
Trastuzumab
(133 events)
87.1%
85.3%
90
80
Overall Survival (%)
Disease-free Survival (%)
100
Control
(261 events)
70
75.4%
67.1%
60
P<0.0001
Hazand ratio, 0.48
50
0
Control
(92 deaths)
90
103
Trastuzumab
94.3% (62 deaths)
91.4%
91.7%
80
86.6%
70
60
P=0.015
Hazand ratio, 0.67
50
0
0
1
2
3
4
5
0
Years after Randomization
No. at Risk
3351
Control
1679
Trastuzumab 1672
2379 1455
1162 689
1217 766
301
374
427
133
59
74
1
2
3
4
5
Years after Randomization
0
0
0
No. at Risk
3351
Control
1679
Trastuzumab 1672
2441 1571
1200 766
1241 805
908
448
460
165
83
82
0
0
0
Figure 7.1 Survival curves for disease-free and overall survival for the trial of Trastuzumab
(Herceptin) and breast cancer4 (Note that the vertical axes have been truncated below 50%, so
the curves appear more separated than if the full axis had been shown). Reproduced with kind
permission from the New England Journal of Medicine.
taken longer to develop the event). For a ‘positive’ event, such as time until
hospital discharge, a hazard ratio >1 indicates benefit, because patients on the
new treatment have spent less time in hospital.
In Table 7.5, the DFS hazard ratio is 0.48; the risk of having an event (recurrence, new tumour or dying) in the Herceptin group was about half that in
the control group (risk was reduced by 52%). There were also large effects on
overall survival (hazard ratio 0.67, or 33% reduction in the risk of dying) and
new tumour risk (hazard ratio 0.24, or 76% reduction in risk).
An alternative effect size is the risk difference at a single time point (see also
page 105). The absolute risk difference at three years for DFS is 11.7%: 87.1
Table 7.5 Summary results of the trial of Herceptin and breast cancer.4
Number of events
Effect size
Endpoint
Herceptin
N = 1672
Control
N = 1679
Hazard
ratio
95% CI
p-value
Disease-free survival1
Overall survival2
New tumour3
133
62
5
261
92
20
0.48
0.67
0.24
0.39 to 0.59
0.48 to 0.93
0.09 to 0.64
<0.0001
0.015
0.002
1. Time to breast cancer recurrence, new tumour or death, whichever occurred first
2. Time to death from any cause
3. Time to diagnosis of a new cancer, excluding new breast cancers in the opposite breast
If patients did not have an event, they were censored at the date last seen
104
Chapter 7
minus 75.4% (Figure 7.1). Among 100 women given Herceptin, about 12 more
are expected to be alive and disease-free three years after randomisation, compared with 100 in the control group. The NNT is eight, i.e. to avoid one patient
dying, or having a recurrence or new tumour at three years, eight patients
need to be given Herceptin (same calculation as in Box 7.2). The time point
should be pre-specified in the protocol to avoid selecting one that appears to
show the greatest benefit for the intervention.
A risk difference, while useful to the trial report, has limitations because
it is specific to a single time point, and so be affected by chance variation.
A hazard ratio is preferable because it compares the whole survival curve
between the trial arms. However, it assumes the treatment effect is similar
over time: if there is a 25% reduction at three years, there should be a similar
reduction at six years.# When this is clearly not true, the risk difference
at pre-specified time points might be more appropriate. Sometimes, the
median survival time (and 95% CI) in each group is reported (if available).
Median survival is reliable when many events have occurred continuously
throughout the trial, otherwise it can be skewed by the timing of only one or
two events. If the distribution of the time-to-event endpoint is ‘exponential’
(i.e. the event rate is constant over time), the hazard ratio could be estimated
by the ratio of the two median survival times. If the median survival times are
M1=9 months in the new treatment group and M2=6 months in the control
group, the hazard ratio for new vs control is 0.67 (M2/M1).
What are the implications of conducting a trial on a
sample of people?
95% confidence interval (CI) for the true hazard ratio (HR) is a range within
which the true hazard ratio is likely to lie
95% CI = observed loge HR ± 1.96 × standard error of the loge HR
The results are anti-logged. (The formula for the standard error is not simple,
so statistical software should be used to provide the 95% CI; see also page
207.)
The 95% CI for DFS is narrow, so the estimate of treatment effect is precise;
i.e. it is likely to lie between 0.39 and 0.59, or a risk reduction between 41
and 61% (Table 7.5). Even the most conservative estimate (41% reduction) is a
large effect, so it is possible to be confident that Herceptin is highly beneficial.
There were fewer events (i.e. deaths) associated with overall survival, so the
# Referred to as an assumption of proportional hazards, which appears to hold for most
situations.
Analysis of phase III trials
105
standard error of the hazard ratio will be larger, contributing to a wider CI:
7 to 52% reduction in risk. The unexpected effect on new tumours seems large
(76% reduction in risk), but there is a very wide CI (36 to 91%).
The difference between two survival rates at a pre-specified time point,
and 95% CI, can be calculated using the survival rate in the control arm, and
the hazard ratio (this is more reliable than using two rates, each of which is
affected by chance variability):
Three-year DFS rate in control arm (P) = 75.4% (0.754)
Hazard ratio (HR) = 0.48, 95% CI 0.39 to 0.59
Difference in three-year DFS rate (Herceptin − control) = e HR×logeP − P
= e0.48×loge0.754 − 0.754 = +0.119 (11.9%)
95% CI for the difference = 9.2 to 14.2% (by substituting the ends of the CI
into the above equation)
Could the observed effect size be a chance finding in this
particular study?
The p-values in Table 7.5 are all small: the three effect sizes are unlikely to
have arisen by chance, if there really were no effect. The p-value for diseasefree survival is particularly small (<0.0001), providing strong evidence that
Herceptin is effective.
How good is the evidence?
This large trial has clear results on DFS and overall survival, which are clinically important. Although the trial was not blind it is highly unlikely that
these considerable effects could be explained by bias or confounding. The
95% CI for the main endpoint is narrow with a very small p-value. It provides sound evidence that Herceptin is beneficial in women with early breast
cancer with HER2 positive tumours. The effect on new tumours is less certain.
Although the 95% CI was far from the no-effect value of one, there are only 5
and 20 new tumours in the Herceptin and control arms respectively. Longer
follow-up data, or confirmatory results from other trials, are needed before
making firm conclusions on this endpoint.
Disease- or Cause-specific survival curves
A new intervention is sometimes expected to only affect the disease of interest. For example, in trials of mammography screening, the aim is to detect
breast cancers when they are small, which should only affect breast cancer
mortality. Cause-specific survival curves can then be used. An event is ‘death
from breast cancer’, and all other deaths are grouped with people who are
still alive or lost to follow up (i.e. censored). Such curves may show a beneficial treatment effect that would otherwise be masked by using all causes of
death. In other trials this may not be appropriate. Many oncology trials evaluate toxic anti-cancer drugs, which could cause deaths other than from the
cancer of interest. Curves based on overall survival can provide a clearer picture of the treatment effect (see page 28). If cause-specific curves are presented,
106
Chapter 7
Risk of
dying
30
%
CV death
25
Placebo
20
Candesartan
15
Non-CV death
10
Candesartan
5
Placebo
0
Number at risk 0
Candesartan 3803
Placebo
3796
1
2
3563
3464
3271
3170
3.5 years
3
2215 761
2157 743
Figure 7.2 Cause-specific survival curves from a trial comparing Candesartan with placebo, and
the effect on cardiovascular (CV) death, and other causes of death.5 The data indicates that
Candesartan had a beneficial effect on CV death, but no effect on other causes. Reproduced
with kind permission from the American Heart Journal.
the curves based on all other causes of death should also be shown, to confirm
that the new intervention has not affected these (Figure 7.2).
7.4 Interpreting different types of phase III trials
The examples in Sections 7.1 to 7.3 were based on two-arm trials, which aimed
to show whether one intervention was more effective than another (i.e. superiority). These approaches can be applied to other trial designs.
Crossover trials
Here, all subjects receive all the interventions. When the endpoint is ‘taking
measurements on people’, each subject has two values (new intervention and
control), and the difference is taken. The effect size is the mean of these differences over all subjects. Interpretation is the same as a mean difference from
a two-arm trial (Section 7.2), and a 95% CI and p-value are calculated. If the
trial endpoint is based on counting people, a 2 × 2 table can be constructed
(Table 7.6). Patients who had an exacerbation on both interventions (n = 8) or
Table 7.6 Hypothetical results from a crossover trial comparing Treatment A
with placebo in 100 patients with asthma. The outcome measure is the
occurrence of an exacerbation or not.
Placebo
Treatment A
Exacerbation
No exacerbation
Exacerbation
No exacerbation
8
6
12
74
Analysis of phase III trials
107
Table 7.7 Randomised double-blind factorial trial comparing folic acid and other multivitamins in
preventing neural tube defect (NTD) pregnancies in 1195 women.6
Folic
acid
Other
vitamins
Number with an
NTD pregnancy/
number in trial
arm (%)
Yes
No
2/298 (0.7)
Yes
No
No
Yes
No
Yes
4/295 (1.4)
13/300 (4.3)
8/302 (2.6)
Trial treatment
Relative risk (RR) calculation
Folic acid vs no folic acid
(2 + 4)/(298 + 295)
(13 + 8)/(300 + 302)
= 0.29
95% CI = 0.12 to 0.71
p-value <0.0001
RR =
Other vitamins vs no other
vitamins
(4 + 8)/(295 + 302)
(2 + 13)/(298 + 300)
= 0.80
95% CI = 0.38 to 1.70
p-value = 0.70
RR =
had no exacerbations at all (n = 74) reveal nothing about whether Treatment
A is better than placebo or not. However, the numbers on the diagonal are
informative (6 vs 12); the odds ratio is 0.5 (6/12).# The odds of suffering an
exacerbation on Treatment A is half that on placebo. Statistical methods are
available to calculate a 95% CI and p-value for the odds ratio. Time-to-event
endpoints are rarely, if ever, used in crossover trials.
The analysis of crossover trials can allow for a period effect; i.e. whether the
effect size for Treatment B when preceded by A is different from Treatment A
when preceded by B. Although the time interval between treatments should
be long enough to minimise a carryover effect from one treatment to the next
(see page 58), there are statistical methods that can allow for this.
Factorial trials
A factorial trial can efficiently compare two or more new interventions.
Table 7.7 shows the results from a trial evaluating folic acid and other multivitamins in preventing neural-tube defect pregnancies.6 A large (71%) risk
reduction was associated with folic acid (relative risk 0.29), with a small pvalue, but there was no evidence of an effect with other vitamins. The conclusion was to recommend folic acid only.
Factorial trials can also be used to detect an interaction between two interventions, i.e. if the effect size for one treatment depends on whether the subject
has received the other treatment or not. In Figure 7.3, Treatment A increases
the response rate by two percentage points (from 1 to 3% or 2 to 4%). Whether
subjects also received Treatment B or not does not matter. There is no interaction. However, the effect of Treatment C depends on whether D was given,
and vice versa: there is an interaction. Statistical methods can be used to investigate interactions (see Section 7.6), and provide p-values for them. When a
clear interaction exists, the effect size for each treatment combination should be
reported, as well as the main effect for each treatment.
#
This has a different calculation to the odds ratio from a two-arm trial (Table 7.3).
108
Chapter 7
No interaction between A and B
% who respond
5
4
Yes
3
No
Treatment B
2
1
0
No
Yes
Treatment A
Interaction between C and D
% who respond
5
Yes
4
Treatment D
3
2
No
1
0
No
Yes
Treatment C
Figure 7.3 Illustration of an interaction between two treatments.
Equivalence and non-inferiority trials
For superiority trials one treatment is considered better than another when
the CI for the effect size excludes the no-effect value, and the p-value is small
(Box 7.6). For equivalence and non-inferiority trials the maximum allowable
Box 7.6 Interpreting different trials comparing interventions A and B
Objective:
Objective is met when:
Superiority
(A is better than B)
95% confidence interval excludes the no-effect
value
Equivalence
(A is similar to B)
95% confidence interval includes the no-effect
value and the interval is completely within the
MAD range
Non-inferiority
(A is not worse than B)
95% confidence interval does not cross one end
of the MAD range (i.e. the end that indicates ‘A’
is worse)
MAD: maximum allowable difference
Analysis of phase III trials
109
difference (MAD) is considered. This is a clinically important effect size,
above which it is concluded that one intervention is better than the other (see
page 67). When comparing two interventions which are expected to be similar, a p-value alone is of limited use. Although it needs to be ≥0.05, any trial
with few subjects can produce large p-values, even when there is a real treatment effect. If the p-value is <0.05, it is likely that the two interventions have
a different effect, and one might be chosen over the other.
Box 7.7 and Figure 7.4 show how to interpret data for equivalence or noninferiority studies using 95% CIs. For equivalence trials, it is easiest to interpret results where the CI is completely within or completely outside of the
MAD range. When the CI overlaps the MAD limit, it is not possible to reliably conclude whether the interventions have an equivalent effect or not. For
non-inferiority studies, the new treatment is not considered worse than the
control, if the CI does not cross the end of the MAD range associated with
the new treatment being worse. Unless these trial types are large enough to
produce precise estimates of treatment effect, CIs may be difficult to interpret.
Cluster randomised trial
The analyses described above apply to trials in which individual subjects are
randomised to the trial interventions. In a cluster randomised trial, groups of
people are randomised to each intervention (see page 61). A trial comparing
two educational programmes could randomise schools to programme A or B.
All children in the same school receive the same intervention. Variability exists
between children in the same school, and between schools. Analysing the trial
as if the children themselves were randomised, assumes independence in their
responses. However, children within a school may be more similar than children between schools. Allowance should be made for this within-school variability (the intra-class correlation).
Suppose all children in a particular school have the same test score. Assessing more than one child from each school adds no information, and the number of independent observations would equal the number of schools. However, in reality, there would be variability within a school. By ignoring the
within-school (intra-class) correlation, the p-value for an effect size could be
smaller than it should be, producing a statistically significant result and an
incorrect conclusion.8,9 However, if the number of people within a cluster is
small, the within-cluster variation will have a minimal effect, and the results
of the trial should be similar to those obtained by assuming the data came
from a standard trial where subjects themselves were randomised.
Repeated measures
When several measurements of the same endpoint are taken on each subject,
they are likely to be correlated, and the effect size and p-value need to allow
for this. A repeated measures analysis of variance or covariance can be performed. In the Atkins diet trial (Box 7.4), body weight was measured at three
time points, and the data were analysed using this approach. The analysis
110
Chapter 7
Box 7.7 Example of a phase III non-inferiority trial comparing two
methods of delivering cognitive behavioural therapy to people with
obsessive compulsive disorder (OCD)7
Location: 2 psychology outpatient departments in the UK
Subjects: 72 individuals aged ≥16 years with obsessive compulsive disorder
Design: Randomised controlled trial
Interventions: Cognitive behaviour therapy (10 weekly sessions) delivered
either by telephone or face-to-face
Justification for trial: ‘Face-to-face’ therapy involves waiting lists and some
people are unable to attend clinic appointments. Delivering therapy by telephone should increase access to treatment
Trial objective: ‘Telephone’ is not worse than ‘face-to-face’
Main outcome measure: Score on the Yale Brown obsessive compulsive
checklist (range 0 to 40, high score indicates more severe symptoms)
Maximum allowable difference (MAD): 5 units on the checklist (if the true
mean difference is at least +5 units then ‘telephone’ is judged to be worse; if
it is less than +5 then ‘telephone’ is not inferior)
72 individuals with OCD
Six months later
Yale Brown score
Mean (standard deviation)
Effect size:
36
telephone
36
face-to-face
14.2 (7.8)
13.3 (8.6)
Adjusted mean difference +0.55, 95% CI (-3.15 to +4.26)
‘Telephone’ minus ‘face-to-face’; adjusted for baseline score, hospital site and
depression score
r The 95% CI for the true mean difference is between –3.15 and +4.26 units
r This is below +5 units, so ‘telephone’ is considered not inferior
r However, the 95% CI is completely within the MAD range of ±5 units, so it
can also be concluded that the two interventions have an equivalent effect
Analysis of phase III trials
111
Figure 7.4 Illustration of interpreting effect sizes and confidence intervals from equivalence or
non-inferiority trials using the example described in Box 7.7.
Objective: Comparing ‘telephone’ with ‘face-to-face’ delivery
Maximum allowable difference (MAD): 5 units on the Yale Brown obsessive compulsive
checklist, indicated by the shaded region.
Mean difference = mean score using ‘telephone’ minus mean score using ‘face-to-face’ delivery.
could produce a single p-value for comparing the two diets, but also one for
each time point (Table 7.4). Mixed modelling is another statistical method for
this type of data. If multiple time points are analysed separately, the p-value
for each should be inflated to allow for this.
There is sometimes a view that having a large trial can be avoided by measuring the same endpoint many times on fewer subjects. However, a study of
10 subjects, each with 10 measurements of the endpoint, is not the same as one
measurement on 100 subjects. Although both produce 100 data values, there
are still only 10 subjects in one study.
7.5 More on confidence intervals
The CI width depends on the standard error, which is derived from the trial
size, and the number of events (when ‘counting people’ or using time-to-event
endpoints), or the standard deviation (when ‘taking measurements on people’). With few events, even a large trial could produce a large standard error.
Generally, there is a relationship between study size and the strength of the
conclusions that can be made (Figure 7.5).
Treatment effects are clearer, and the precision is higher, with large studies (Table 7.8). It is important, therefore, that the effect size used to calculate
sample size is realistic enough to produce a big enough trial. Suppose the
expected relative risk were 0.75, but the observed result was 0.85, with 95% CI
0.69 to 1.05. The upper limit is just above the no-effect value. The true effect
size is probably closer to a 15% reduction in risk than a 25% reduction, but
it requires a larger sample size to reliably show this, i.e. for the result to be
statistically significant.
112
Chapter 7
Figure 7.5 How study size affects conclusions.
7.6 More on p-values
The size of a p-value is influenced by the effect size and standard error. Large
effect sizes or small standard errors produce small p-values. A small trial, or
those with few events or large standard deviation, can each contribute to a
large p-value (>0.05).
By convention, if the p-value is <0.05, the observed effect size is considered
statistically significant; it is unlikely to have arisen by chance. If the p-value
is ≥0.05, the effect size is not statistically significant: there is insufficient evidence of a true effect (Box 7.8). There is nothing very scientific about the cutoff of 0.05. It is generally accepted to indicate that a real effect is likely to exist
because there is only a 1 in 20 likelihood that the results could have arisen by
chance, assuming no true effect (i.e. a treatment effect could be falsely concluded 5% of the time). There is always some possibility, however small, that
any observed effect size could be a chance finding, rather than reflect a real
difference, but the smaller the p-value, the less likely that this is the case.
It is incorrect to conclude ‘there is no effect’ when the effect size is not statistically significant. It only means that there is insufficient evidence to claim
an effect. P-values should not be reported as ‘<0.05’, or ‘≥0.05’ or ‘Not statistically significant’, because it is not possible to distinguish a p-value of
0.045 from <0.0001, or 0.06 from 0.57, yet they provide very different levels
of evidence. If an effect size is not statistically significant, there are several
reasons:
Table 7.8 Confidence intervals for trials with the same estimate of relative risk as in Box 7.1, but
with different sample sizes.
Trial 1/10 as big
Observed trial
Trial 10 times as big
Trial 100 times as big
Risk in
vaccine group
Risk in
placebo group
Relative
risk
Confidence
interval
4/90
41/927
410/9,270
4,100/92,700
8/90
80/911
800/9,100
8,000/91,000
0.50
0.50
0.50
0.50
0.16 to 1.60
0.35 to 0.72
0.44 to 0.56
0.48 to 0.52
Analysis of phase III trials
113
Box 7.8 P-values
Definition: The probability that an effect as large as that observed, or more
extreme, is due to chance assuming there really were no effect
All p-values are between 0 and 1
Definitely
no effect
Definitely
an effect
1
0.05
Not statistically significant
0.01
0.001
0
Statistically significant
Evidence of a real effect gets stronger
All p-values should be two-sided, except when one-sided tests are required
because of the study design, such as in non-inferiority trials. In general,
p-values larger than 0.01 should be reported to two decimal places, those
between 0.01 and 0.001 to three decimal places, and those smaller than 0.001
should be reported as p < 0.001.
1. There really is no difference
2. There is a real difference, but by chance the sample of subjects did not
show this
3. There is a real difference, but the trial had too few subjects, and therefore
insufficient power, to detect it.
Large effect sizes can have p-values just above 0.05, such as 0.06. Although
strictly not statistically significant, according to the 0.05 cut-off, a possible real
treatment effect should not be dismissed. The trial was probably too small.
Had it been larger, the p-value may have been smaller. Furthermore, a p-value
of 0.048, while considered statistically significant, does not provide strong evidence of a treatment effect.
Calculating p-values
P-values come from performing a statistical test. The choice of test depends
on the type of outcome measure. Some simple tests can be done by hand, but
using a statistical software package avoids error, and they can cope easily with
large datasets, providing all the information needed to interpret trial results,
i.e. the effect size, 95% confidence interval and p-value. Box 7.9 shows some
common statistical tests (details in the references on page 203). The multivariate (or multivariable) methods can be used to:
r Adjust for imbalances in baseline characteristics or other potential confounders (see page 88)
114
Chapter 7
Box 7.9 Statistical methods that produce p-values according to type
of endpoint (the multivariate and Cox’s regression also provide effect
sizes)
Counting people
(binary/
categorical data)
Two arm trial
Chi-square test (or
(unpaired data) Fisher’s exact test
if the trial is small,
<30)
Taking measurements Time-to-event
on people (continuous data4
data)
Unpaired or
two-sample t-test if the
difference between the
means is Normally
distributed1
Log rank test
Mann-Whitney U test if
distribution of the
difference is skewed2
Crossover or
split-person
trial (paired
data)
McNemar’s test
Paired t-test if the
difference is Normally
distributed
Not applicable
Wilcoxon Matched
pairs test if the
distribution of the
difference is skewed
Allow for other Multivariate
factors such as logistic regression
baseline
imbalances
Multivariate linear
regression3
Cox’s
regression
1. With more than two trial arms the test is analysis of variance (ANOVA)
2. With more than two trial arms the test is Kruskal–Wallis ANOVA
3. Outcome measure should be approximately Normally distributed
4. If all subjects have the event of interest, tests for ‘taking measurements on
people’ can be used
r Adjust for the stratification factors used in randomising subjects (see page
84)
r Investigate an interaction between a treatment and a prognostic factor
(sub-group analyses, see page 119), or between two treatments (factorial
trial).
Other methods, such as Bayesian statistics, attempt to incorporate prior evidence, but they are not commonly used to analyse clinical trials because of
Analysis of phase III trials
115
their complexity and the difficulty in determining how much importance
should be given to the previous evidence.
Multiple endpoints
Some trials have several primary endpoints. Using the 0.05 p-value cut-off,
an error rate of 5% is allowed, meaning that one spurious effect (false-positive
result) is expected in every 20 comparisons. The more comparisons performed
on the same data, the more likely that a spurious effect is found, i.e. an effect
size with a p-value <0.05, but the effect was due to chance. When there are
multiple primary outcome measures, p-values <0.05 might be adjusted using
methods such as a Bonferroni correction. A p-value of 0.02 becomes 0.06 if
there are three comparisons (0.02 × 3). However, this assumes the outcome
measures are uncorrelated, which may not be true. Adjusting p-values in this
way could inflate them too much, and a real treatment effect could be missed.
A very small p-value (e.g. <0.001) is unlikely to be affected by several comparisons. It may be preferable to present the unadjusted p-values, with a suitable
note of caution if they are just below 0.05, and 97.5% confidence intervals for
say 2-3 comparisons, and 99% limits for ≥3, because they provide more conservative estimates of the range of the true effect.
7.7 Relationship between confidence intervals, the
no-effect value and p-values
It can be inferred from the p-value whether the CI contains the no-effect value.
The CI also indicates whether the effect size is statistically significant (Box
7.10). If the CI excludes the no-effect value, the result is statistically significant,
otherwise, it is not statistically significant. This is because using a 95% CI and
p-value cut-off of 5%, both allow an error rate of 5%.
Box 7.10 Relationship between confidence intervals and statistical
significance
Effect size
• Relative risk
• Odds ratio
• Hazard ratio
Noeffect
value
1
• Risk difference
• % excess risk
0
• % risk reduction
• Difference
between two means
(or medians)
95%
confidence
interval
p-value
includes 1
Effect size is not statistically
significant (p-value ≥0.05)
excludes 1
Effect size is statistically
significant (p-value <0.05)
includes 0
Effect size is not statistically
significant (p-value ≥0.05)
excludes 0
Effect size is statistically
significant (p-value <0.05)
116
Chapter 7
Figure 7.6 Relationship between confidence intervals and p-values. Sometimes it helps to
consider how far the effect size is from the no-effect value in terms of number of standard errors.
As this distance increases, the p-value gets smaller.
When the results are statistically significant, the size of the p-value can indicate how far the 95% CI is from the no-effect value (Figure 7.6). The smaller
the p-value, the further away the CI. The results in Table 7.4 show this. The
effect size at three months (−4.1 kg) has a small p-value (0.001) so the CI is
far from the no-effect value. At six months, the p-value is larger (0.02) and the
interval is closer to the no-effect value. A p-value of 0.05 indicates one of the
limits is the no-effect value.
7.8 Intention-to-treat and per-protocol analyses
In Box 7.1, all randomised subjects received the allocated intervention (the
flu vaccine). Treatment compliance was complete, i.e. 100%. In other studies, especially those that involve taking drugs or using medical devices at
home, some people may not start treatment at all, and others will start, but
stop before the protocol specifies they should. Also, the drug dose could be
reduced, or subjects switch over to the other trial arm. These are all called noncompliers, and they often have different characteristics to compliers. There
may be good reasons why subjects did not comply, such as intolerable sideeffects. Any change from the protocol treatment schedule is a protocol violation or deviation (‘violation’ indicates one that could significantly affect
the study design or results). A non-complier may be different from a subject who withdraws (page 118). The trial endpoint might be measurable on a
non-complier, but not often for a withdrawal.
Analysis of phase III trials
117
Figure 7.7 Hypothetical trial comparing two treatments in 100 patients with lung cancer,
illustrating possible ways to deal with non-compliers to the allocated treatment. The main
outcome could be survival after one year.
There are several ways of dealing with non-compliers in the analysis; see
Figure 7.7. The 10 inoperable patients have more advanced disease, and
therefore a poorer survival. Surgery would appear to have a better survival
under options A and B, when there could be no real difference, because these
patients are either ignored in the surgery group or, worse still, added to the
radiotherapy group. In Option C, it is difficult to identify and remove 10
equivalent patients in the radiotherapy group to the 10 inoperable patients
in the surgery group. Options A to C remove patients from the trial, or move
them between arms, negating the balance achieved by the randomisation process, and possibly creating bias or confounding.
Option D is the most reasonable – called an intention-to-treat analysis.
Subjects are analysed according to the arm to which they were randomised,
regardless of whether they took the allocated intervention or not. This maintains the balance in baseline patient characteristics. The effect size reflects
what could happen in practice, because not all people will take the intervention, or some may stop early because of side-effects. The analysis usually
produces a conservative effect size because some people could have benefited
from the new intervention had they taken it. However, the scientific advantage
of having two balanced trial arms that are unaffected by bias or confounding,
outweighs having an under-estimated effect size. All trials should be analysed
in this way.
A per-protocol analysis only includes subjects who took their allocated
treatment as specified in the trial protocol,# i.e. compliers (Option B in
#
A non-complier can be defined in several ways in a particular trial. For example, it could
be only those subjects who stopped treatment completely; those whose dose was reduced
might still be regarded as compliers.
118
Chapter 7
Figure 7.7). This is used for equivalence or non-inferiority trials. The endpoint is expected to be similar among non-compliers between the trial arms,
so including them in an intention-to-treat analysis could make two interventions appear to have a more similar effect than they really do. A per-protocol
analysis should be used in addition to an intention-to-treat analysis to confirm
that the interventions have a comparable effect in compliers.
Examining the effect size only among those who did comply may also be
useful when the proportion of compliers is clearly different between the trial
arms (acknowledging that some balance in subject characteristics may be lost).
Alternatively, a multi-variate method can produce an estimate of the effect
size after allowing for level of compliance (Box 7.9). Again, these analyses
could confirm consistency with the intention-to-treat analysis.
7.9 Randomised subjects who are ineligible and
subject withdrawals
A randomised subject could be later found to be ineligible according to the
inclusion and exclusion criteria; called a protocol violation or deviation.
Small deviations should not matter, but large ones may. Suppose in a trial
of newly diagnosed asthma patients, someone who has started the trial treatment is later found to not have asthma. He is, therefore, not expected to benefit
from the new treatment, which would stop.
There are two options: include or exclude the subject from the analysis. Neither is perfect. The choice depends on the disease, the interventions being
tested and how far the subject deviates from the eligibility criteria. When there
are few ineligible subjects with significant protocol violations, they might be
excluded. However, keeping them in would be consistent with an intentionto-treat analysis, because this is what would happen in practice. If there are
many ineligible subjects, the reasons should be investigated. There may have
been a problem with the trial design or recruitment. The number of ineligible
subjects and the reasons should be recorded and reported.
Although trial subjects agree to participate until the end, some may withdraw early (subject withdrawal, drop-out, or lost to follow up), for example because of side-effects. When the endpoint is based on ‘counting people’,
withdrawals can be included in the denominator of the risk in each arm. In
the flu vaccine trial (Box 7.1), some subjects provided no blood sample, so it
is not known whether they had serological flu at five months or not. There
were 25 and 22 withdrawals in the vaccine and placebo groups respectively.1
However, the risk of developing serological flu was based on the number randomised, and not 902 (927 – 25) and 889 (911 – 22). The endpoint could be
thought of as ‘known serological flu’. With ‘time-to-event’ endpoints, subjects
who withdraw are censored in the analysis at the time they were last seen.
In both cases, all subjects can be included in an intention-to-treat analysis, though some statistical power might be reduced because the number of
events is less than the number originally expected. A loss of power could be
avoided by inflating the target sample size to allow for possible withdrawals.
Analysis of phase III trials
119
However, when ‘taking measurements on people’, it is difficult to include
withdrawals since there is no value to use in calculating the mean (or median).
Such subjects are often excluded from the analysis, though statistical methods,
called imputation, can attempt to estimate the missing values.
When there are relatively few withdrawals and the number is similar
between the trial arms, the results are unlikely to be materially affected if such
patients are excluded. Attempts should be made to try to ensure that the number of patient withdrawals is kept to a minimum, but if these do occur, they
should be recorded and mentioned in the final report.
7.10 Sub-group analyses
The effect size is often examined to see if it differs between sub-groups of
subjects (e.g. by gender). The ultimate purpose of this should be clear. If the
benefit is greater in a certain group of subjects than another, but there is still
a clear benefit in both groups, all future subjects would still be offered the
new treatment; the sub-group analysis simply provides additional information about the treatment. A problem arises when a sub-group analysis is used
to determine who will and will not receive the new treatment. There needs to
be very clear and convincing data on this, in order to avoid withholding an
effective therapy from future individuals.
Sub-group analyses should be specified at the start of the trial in the protocol, or performed when there is good scientific justification. Otherwise, they
could look like a ’fishing expedition’. This is particularly so when no overall
treatment effect is found, and sub-groups are examined in the hope of finding an effect. Alternatively, the effect of a new intervention in a particular
sub-group could, by chance, appear larger than it really is. If there is prior evidence that a treatment effect is influenced by important prognostic factors, the
sample size could be increased to allow sufficient statistical power to examine
this reliably.
Sub-group analyses are usually presented as a forest plot (Figure 7.8). For
example, the relative risks and 95% CIs are 0.36 (0.13 to 0.97) among males
and 0.68 (0.33 to 1.43) among females. It is incorrect to conclude that the vaccine was effective in males, but not in females because one CI excludes the
no-effect value, and the other includes it.10 The point estimates for both males
and females indicate a benefit, but the wide CIs come from having a smaller
sample size in each sub-group. What matters is whether the CIs do not overlap, or they exclude the overall effect size. In Figure 7.8, no factor does this.
Another approach is to perform a multi-variate statistical analysis (Box 7.9)
and obtain a p-value for an interaction or heterogeneity test.
There are several issues to consider. First, dividing the data into smaller
groups of subjects produces wider 95% CIs making it more difficult to find
statistically significant results. Second, if very different effect sizes are found
between the sub-groups, some plausible explanation is expected, which may
be difficult. Third, the more sub-group analyses performed, the more likely
that a spurious treatment effect is found. An example is shown in Table 7.9.11
120
Chapter 7
Health status:
At-risk patients
Other/healthy
Gender:
Men
Women
Age (years):
60-69
70+
Previously vaccinated:
Yes
No
0.1
0.5
Vaccine better
1
2
3
Placebo better
Relative risk of flu (95% Cl)
Figure 7.8 Forest plot showing the results of sub-group analyses in the flu vaccine trial (Box 7.1),
based on clinician-diagnosed flu five months after vaccination. The figure shows the effect size,
i.e. relative risk, in different groups of trial subjects. The solid vertical line is the relative risk
among all subjects (0.50), and the dashed line is the no-effect value.
Table 7.9 Trial of aspirin versus placebo in treating 17,000 patients with suspected acute
myocardial infarction.11
% with vascular death
after 1 month∗
Astrological sign
Aspirin
N = 8553
Placebo
N = 8610
Relative risk (95%
confidence interval)
p-value
Libra or Gemini
All other signs
All patients
11.1%
9.0%
9.4%
10.2%
12.1%
11.8%
1.09 (0.88 − 1.35)
0.74 (0.68 − 0.82)
0.80 (0.73 − 0.87)
0.50
<0.0001
<0.0001
∗ there were 1820 deaths in total
The number of patients in each group and the confidence intervals were estimated from data
given in the reference.11
Analysis of phase III trials
121
Aspirin appears to be only effective in people who were not Libra or Gemini,
but there is no biological plausibility for this. Large trials can produce spurious subgroup results that are precise, with statistically significant interactions.
Adding the caveat that any such sub-group analysis should be viewed with
caution may not avoid them being misinterpreted.
Consideration should be given to whether sub-group analyses are a sensible approach, and to the possibility of finding false-positive effects. It may be
preferable not to present any, or they should be used to generate hypotheses
for further studies.
7.11 Safety, toxicity or adverse events
Table 7.10 shows selected side-effects from a trial of patients with osteoarthritis or rheumatoid arthritis.12 Effect sizes and 95% CIs could be presented for
each row, but the table could look unwieldy. Instead, summary measures
are presented for groups of side-effects. The sum of the numbers with gastrointestinal effects in the NSAID group (640 + 522 + 392 + 370 + 234) is 2158,
greater than the total of 1465. This is because patients can have more than one
type of event, so appear in more than one row. It should be made clear whether
the number of events or number of patients with an event is reported. Where
people can suffer several (perhaps related) side-effects, it is usually preferable
Table 7.10 The number of patients with specified side-effects from a randomised trial comparing
Celecoxib (a COX 2-specific inhibitor) and nonsteroidal anti-inflammatory drugs (NSAIDs) for
treating osteoarthritis and rheumatoid arthritis.12
Side-effects after six months of treatment
Gastrointestinal
Dyspepsia
Abdominal pain
Diarrhoea
Nausea
Constipation
Any (each patient counted once)
Risk difference % & 95% confidence interval∗
Cardiovascular
Stroke
Myocardial infarction
Angina
Any (each patient counted once)
Risk difference % & 95% confidence interval∗
∗ estimated using the reported results
NSAID
N = 3981
Celecoxib
N = 3987
n (%)
n (%)
640 (16.1)
575 (14.4)
522 (13.1)
387 (9.7)
392 (9.8)
373 (9.4)
370 (9.3)
277 (6.9)
234 (5.9)
68 (1.7)
1465 (36.8)
1250 (31.4)
+5.4 (+3.4 to +7.5)
p-value < 0.0001
10 (0.3)
11 (0.3)
22 (0.6)
39 (1.0)
5 (0.1)
10 (0.3)
24 (0.6)
37 (0.9)
+0.05 (−0.4 to +0.5)
p-value = 0.81
122
Chapter 7
to report the number of affected subjects rather than the number of events,
because the extent of harm in one trial arm could be over-estimated. When a
particular side-effect is recorded on several occasions a common approach is
to report the most severe grade for each subject.
In Table 7.10, the risk of suffering a gastrointestinal toxicity was greater in
the NSAID group than in the Celecoxib group (36.8 vs 31.4%). The relative
risk was 1.17 (36.8 ÷ 31.4%), with 95% CI 1.10 to 1.25. NSAIDs increased the
risk of having a gastrointestinal side-effect by 17%. The absolute risk difference (+5.4%) perhaps better indicates the extent of harm, because it gives the
number of affected individuals. In 100 patients given NSAIDs an extra 5.4 are
expected to have a gastrointestinal side-effect, compared with 100 given Celecoxib. The number needed to harm (NNH) is 18 (100/5.4), similar in principle
to number needed to treat (Box 7.2). For every 18 patients given NSAIDs, one
extra patient with a gastrointestinal side-effect is expected that is attributable
to NSAIDs, compared to the Celecoxib group.
The results in Table 7.10 are for a treatment period of six months, but it is
sometimes important to detect late effects.
7.12 Interim analyses and stopping trials early
Interim analyses involve examining the data when subjects are still being
recruited or sometimes treated. They could be used to change the trial design,
or decide whether the trial should continue or stop early.
Revising the sample size may be necessary when the early analysis indicates
that the effect size used in the sample-size calculation was too large, or there
are fewer events than expected. This could be due to having narrow eligibility
criteria, or improvements in the standard of care. Increasing the sample size or
length of follow up, should increase the number of events. Sample size should
not be reduced when the effect size used in the sample-size calculation is later
considered too small, unless there is clear and convincing evidence for this.
A trial may stop early for several reasons:
r Poor recruitment – the trial is highly unlikely to finish in a reasonable timeframe
r New evidence – information becomes available, perhaps from another trial,
which makes recruitment to one or more arms of the current trial unethical
or unacceptable
r Harm – the new intervention is clearly more harmful than the control. It
is almost always appropriate to consider stopping early when this occurs
r Superiority – the new treatment is judged with sufficient certainly to be
more beneficial than the control
r Futility – it is judged that there is unlikely to be a clinically important
treatment effect if the trial continued to the end. Alternatively, if the new
intervention has more side-effects or is more expensive, the true effect size
is unlikely to be large enough to justify its use.
Analysis of phase III trials
123
The number and timing of the interim analyses can depend on the interventions being tested, but one or two, say after half or a third/two thirds of subjects have been recruited, often seems appropriate. There could be more analyses, particularly early on, if the focus is on safety.
Safety is assessed by determining whether adverse events are likely to be
caused by the trial treatment, and examining their severity and frequency,
and whether they are easily treated. This is often separate from harm that
is directly associated with the efficacy outcome measure (see page 181).
Stopping early for superiority may be justified when there is a large effect
size and a narrow 95% CI. A stopping rule involves pre-specifying a p-value
cut-off at each interim analysis, below which the recommendation is to stop
the trial. The p-value used for the final analysis may then be reduced. Smaller
p-values could be specified at earlier analyses because stronger evidence is
required when there are relatively few subjects.13 For example, with two
interim analyses, the first p-value could be <0.0005, and the second <0.014.
To claim statistical significance in the final analysis, the p-value would need
to be <0.045.# The overall p-value, allowing for three analyses, is about 5%.
Alternatively, a stringent stopping rule is to specify that any interim p-value
must be <0.001 (referred to as the Peto–Haybittle rule). The cut-off for the final
analysis p-value could still be 0.05.# Sample size can be increased to allow for
interim analyses.
Stopping early for futility can be difficult. Current trial data is used to predict the future effect size if the trial continued. Complex statistical methods
can estimate the probability of getting the expected effect size in the future
given the data now, but they are based on several assumptions. Examining
the 95% CI gives a simple estimate of the future true effect size. If it completely excludes a clinically important difference, the intervention is unlikely
to be effective. For example, finding a lower confidence limit of 0.96 when the
expected relative risk is 0.75.
Several considerations arise when interpreting interim analyses of
efficacy.14 First, a statistically significant effect could be found, when there
really is no effect, if many analyses are performed. This is minimised by having a stringent stopping rule. Second, the analyses could be based on a sample size that would not persuade health professionals to change practice. An
interim analysis of 10,000 subjects, out of a target of 20,000, is probably sufficiently large, particularly if there are many events. However, halfway through
a two-arm trial with a target of 500 subjects represents only 125 in each arm.
The effect size might be statistically significant, but with a wide CI, which
would not be convincing enough to change practice. (See the results for ‘new
# It is not 5% because some of the allowed 5% error rate has been ‘used up’ at early
analyses, called alpha spending (the error rate can be notated by α). In the Peto–Haybittle
rule, very little of the 5% has been spent early on, so the final p-value can still be compared
against 0.05.
Chapter 7
–8
–7
–5
–6
Hazard Ratio
–9
1
124
27Mar00 27Jul00
1Mar01
9Aug01
22Feb02 1Aug02
Final
Analysis Date
Figure 7.9 Several interim analyses in a trial comparing Candesartan with placebo, and the
effect on cardiovascular (CV) death.5 Reproduced with kind permission from the American Heart
Journal.
tumour’ in Table 7.5; an apparently large effect, but based on few events.)
Third, treatment effects could be greater earlier on. In Figure 7.9,5 the March
2000 analysis showed a large effect size with a p-value of 0.0006. However, at
the end of the trial, the effect was smaller and only close to statistically significance (p = 0.055). Large early treatment effects could be ‘too good to be true’.
All the relevant evidence needs to be considered before stopping early,
not just statistical stopping rules and p-values. Other considerations include
the success of recruitment, any safety issues, whether there is sufficient evidence to change practice or if a clinically important effect is highly unlikely
if the trial continued. When a trial is stopped too early, the data might not
be reliable enough to persuade the regulatory authority to grant a marketing
licence for a new therapy, or an effective treatment may not be found because
the early analysis suggested futility.
7.13 Clinical versus statistical significance: more on
interpreting results
P-values are often used to drive the interpretation of clinical trial data, and too
much emphasis is placed on whether it crosses the conventional cut-off of 0.05,
i.e. statistical significance. P-values only provide an indication of whether the
observed effect size could be due to chance, so they should be used as a guide
to interpreting data. There should be more focus on interpreting the observed
effect size and CIs, i.e. clinical significance.
Consider hypothetical results of trials evaluating four new diets (Box 7.11).
When trials are large, precise estimates of the effect size are obtained so it is
clear whether the new intervention is likely to be clinically worthwhile (Diet
Analysis of phase III trials
125
Box 7.11 Hypothetical clinical trials of four new diets for weight loss.
Effect size is the mean difference in weight (intervention arm minus
control arm)
Clinical significance?
Statistical
significance?
Yes
No
Yes
No
Diet A
Diet B
N = 1, 000
Mean difference −7.0 kg
95% CI −7.6 to −6.4 kg
p-value <0.0001
N = 2, 000
Mean difference −0.5 kg
95% CI −0.9 to −0.1 kg
p-value = 0.025
Big study
Big effect
Big study
Small effect
Diet C
Diet D
N = 36
Mean difference −3.0 kg
95% CI −6.3 to +0.3 kg
P-value = 0.07
N = 400
Mean difference −0.2 kg
95% CI −1.2 to +0.8 kg
P-value = 0.69
Study not big enough
Probably a real & moderate effect,
but insufficient results to draw a
reliable conclusion
Study probably big enough
Probably small effect
‘N’ is the total number of subjects in a two-arm trial
A in Box 7.11) or not (Diet D). A statistically significant result could be found
in large studies when the effect is small and clinically unimportant (Diet B).
Perhaps the most difficult results to interpret occur when the p-value is just
above 0.05, but the effect size looks large (Diet C). Although the CI includes
the no-effect value, most of the range is below zero. The data must be interpreted carefully. ‘There is no effect’ should not be concluded, because the true
mean difference could be as large as 6 kg. It is better to say ‘there is some
evidence of an effect, but the result has just missed statistical significance’, or
‘there is a suggestion of an effect’. Using language like this does not dismiss
outright what could be a real effect, but it also makes no undue claims about
efficacy.
7.14 Summary
r A summary effect size can be obtained for any comparison of two interventions
126
Chapter 7
r The type of effect size and how it is analysed depends on the type of outcome measure – counting people, taking measurements or time-to-event data:
❜ Counting people: risk difference, relative risk, odds ratio
❜ Taking measurements on people: difference between two means or medians
❜ Time-to-event: hazard ratio, difference between two survival or event
rates
r Confidence intervals and p-values must be calculated for any effect size, to
fully interpret the data
r Design considerations help when interpreting results; two-arm, crossover
and factorial trials; repeated measures
r Sub-group analyses should be justified and specified at the start of the trial,
and interpreted carefully
r Large trials, with many events, should produce the clearest results and
conclusions.
References
1. Govaert TME, Thijs CTMCN, Masurel N et al. The efficacy of influenza vaccination in
elderly individuals. JAMA 1994; 272(21): 1661–1665.
2. Foster GD, Wyatt HR, Hill JO et al. A randomized trial of a low carbohydrate diet for
obesity. N Engl J Med 2003; 348:2082–2090.
3. Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow-up measurements. BMJ 2001; 323:1123–1124.
4. Romond EH, Perez EA, Bryant J et al. Trastuzumab plus Adjuvant Chemotherapy for
Operable HER2-Positive Breast Cancer. N Eng J Med 2005; 353:1673–1684.
5. Pocock S, Wang D, Wilhelmsen L, Hennekens C. The data monitoring experience in
the Candesartan in Heart Failure Assessment of Reduction in Mortality and morbidity
(CHARM) program. Am Heart J 2005; 149:939–943.
6. MRC Vitamin Study Research Group. Prevention of neural tube defects: results of the
MRC vitamin study. The Lancet 1991; 338:132–137.
7. Lovell K, Cox D, Haddock G et al. Telephone administered cognitive behaviour therapy
for treatment of obsessive compulsive disorder: randomised controlled non-inferiority
trial. BMJ 2006; 333:883–887.
8. Kerry SM, Bland JM. Analysis of a trial randomised in clusters. BMJ 1998; 316:54.
9. Bland JM, Kerry SM. The intracluster correlation coefficient in cluster randomisation.
BMJ 1998; 316:1455–1460.
10. Cuzick J. Forest plots and the interpretation of subgroups. The Lancet 2005; 365:1308.
11. Collins R, MacMahon S. Reliable assessment of the effects of treatment on mortality and
major morbidity, I: clinical trials. The Lancet 2001; 357:373–380.
12. Silverstein FE, Faich G, Goldstein JL et al. Gastrointestinal toxicity with Celecoxib vs
nonsteroidal anti-inflammatory drugs for osteoarthritis and rheumatoid arthritis. The
CLASS Study: a randomised controlled trial. JAMA 2000; 284:1247–1255.
13. O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979;
35:549–556.
14. Pocock SJ. When to stop a clinical trial. BMJ 1992; 305:235–240.
Analysis of phase III trials
127
Appendix: Further introduction to p-values
Suppose a coin is to be thrown 10 times. If a Heads appears after one throw,
there is no reason to think there is anything unusual about the coin. If two
Heads are seen in a row, this is also not surprising. If, however, five Heads are
seen in a row, suspicions are aroused, and after 10 Heads, there is a readiness
to believe that something is wrong with the coin. But on what evidence are
the suspicions based? If the coin were fair, the chance of getting Heads (or
Tails) is 0.5. Therefore, among 10 throws of the coin about five Heads and five
Tails are expected. What we are doing mentally after each successive result
is considering whether what is seen is consistent with the assumption that the
coin is fair. We might never be able to determine if the coin is fair or not with
complete certainty. However, it is the assumption of fairness (i.e. probability
of 0.5 of seeing Heads) that we use to judge the coin as it is thrown.
The probability of throwing five Heads in a row if the coin were fair is 0.03
(0.55 ), i.e. 3 in 100. This means that if there were five throws of the coin, and
this was repeated 100 times, five Heads in a row is expected to occur in three
out of the 100 sets, just by chance. Similarly, the probability of seeing 10 Heads
in a row due to chance is 0.001 (0.510 ), i.e. in 1,000 sets each consisting of 10
throws, 10 consecutive Heads could be seen among one set. So it is not impossible to get 10 Heads in a row with a fair coin – it is just very unlikely. The
table below shows the probability of getting various combinations of Heads
and Tails in 10 throws of the coin. Each number in the third column is the
p-value associated with the particular result of the coin thrown 10 times.
Number of Heads
Number of Tails
Probability of this occurring
if the coin were fair∗
0
1
2
3
4
5
6
7
8
9
10
10
9
8
7
6
5
4
3
2
1
0
0.0010
0.0097
0.0440
0.1172
0.2051
0.2460
0.2051
0.1172
0.0440
0.0097
0.0010
Total
1.0000
∗ i.e. the chance of a Heads is 0.5
Suppose there were one Heads and nine Tails. We would not necessarily
be interested only in this particular combination but also one that is more
extreme, i.e. 0 Heads and 10 Tails. The probability of this is 0.0097 + 0.001 =
0.0107. This is referred to as a one-tailed p-value. However, getting one Heads
128
Chapter 7
and nine Tails is as suspicious as nine Heads and one Tails. What is needed is a
p-value for the one vs nine or more extreme, and in either direction. The probability of this is 0.0097 + 0.001 + 0.0097 + 0.001 = 0.021. This is a two-tailed
p-value.
The same principles apply to interpreting clinical trial results. In the flu vaccine trial (Box 7.1), the risk difference is −4.4%. To judge whether an effect at
least as large as this could be due to chance, the calculation of the p-value
assumes that the true risk difference is zero (i.e. there is no effect). The p-value
of <0.001 is associated with an effect as large as 4.4% or more extreme in either
direction (i.e. ≤ −4.4% or ≥ +4.4%), which allows for the vaccine to be better
or worse than placebo. Again, it is not impossible for a trial to produce a treatment effect this large if there really were no effect, but the p-value here tells
us that this is extremely unlikely.
Observed result
Coin
One Heads vs nine Tails
Flu vaccine trial
Risk difference −4.4%
We assume
P-value
Probability of Heads is 0.5
and
Observed result could be more
extreme in either direction
0.021
No effect (true risk difference = 0)
and
Observed result could be more
extreme in either direction
<0.001
CHAPTER 8
Systematic reviews and meta-analyses
Previous chapters presented key features of the design, analysis and interpretation of a single clinical trial. When there are several similar trials on the same
subject, it is possible to review the accumulation of evidence, and provide a
clearer view on the effectiveness of a particular intervention.
Large trials usually provide robust results, allowing unambiguous conclusions to be made. In small trials, it can be difficult to detect a treatment effect,
if one exists, and statistical significance is often not achieved (the p-value
is ≥0.05). This means that a real effect could be missed, and there is uncertainty over whether the observed result is real or due to chance.
The limitations of small trials could be largely overcome by combining them
in a single analysis. This is the main purpose of a systematic review and metaanalysis. Systematic reviews are different from review articles, which may be
presented as narratives based on selected papers, and may therefore reflect
the personal professional interests of the author: there could be a bias towards
the positive (or negative) studies. Such reviews tend to describe the features
of each paper without trying to combine the results. The assessment of several
trials together needs to be done in a systematic and unbiased way.
8.1 The need for systematic reviews of clinical trials
Systematic reviews tend to be conducted on randomised phase II or III trials,
rather than single-arm trials, and there are three broad functions:
r To confirm existing practice but provide a more precise estimate of the treatment
effect. By considering several trials together, the results are combined to give
a single estimate of the effect size. The standard error of the pooled effect
size is usually smaller than for any individual trial. As a consequence, the
95% confidence interval will be the narrowest, i.e. the effect size will have
greater precision, and the result is more likely to be statistically significant.
By having a larger number of subjects in the analysis, it is also possible
to detect smaller treatment effects than those normally found in individual
trials. Also, sub-group analyses are based on more patients than in any individual trial, and so have greater statistical power, though spurious effects
could still be found by chance if many sub-groups are examined (see page
119).
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
129
130
Chapter 8
r To change existing practice. Some systematic reviews have changed health
practice. Occasionally, they have led to a new intervention being adopted
into practice, but usually they have resulted in an existing treatment becoming more commonly used. Examples are tamoxifen and breast cancer, streptokinase and acute myocardial infarction, and aspirin and stroke. Such
reviews are often used to develop national guidelines for defining standard
practice.
r To determine whether new trials are needed. When there is uncertainty over
the effectiveness of an intervention, a systematic review of the literature can
help decide whether a new trial is needed. In this situation, there are usually only a few small published trials. The purpose is to determine whether
these trials taken together would provide sufficient evidence of the treatment effect, because if they do not, having a large new trial is justified.
8.2 What is a systematic review?
Systematic reviews apply a formal methodological approach to obtaining,
analysing and interpreting all the available reports on a particular topic. In
an era of evidence-based health, where health professionals are encouraged
to identify sources of evidence for their work, and to keep abreast of new
developments, systematic reviews are valuable summaries of the evidence. A
review is a research project in its own right and, depending on the number of
published reports to be considered, can be a lengthy undertaking. The review
is only as good as the studies on which it is based. If an area has been investigated mainly using small, poorly designed trials, a review of these may not
be a substitute for a single large well-designed trial.
The systematic review process is given in Box 8.1. The summary data (i.e.
effect sizes) can be extracted from the published papers. Alternatively, the raw
data is requested from the authors, called an individual patient data (IPD)
meta-analysis, and once it is sent to a central depository, there is essentially
a single large data set with a variable that identifies each original trial. Such
analyses can produce a more precise estimate of the combined effect size than
using summary data, and more reliable sub-group analyses.
Systematic reviews can take from a few weeks up to two or more years,
depending on how many trials there are and the type of meta-analysis. Those
based on IPD can be lengthy and require dedicated resources because the raw
data needs to be collected, collated and checked before conducting the statistical analyses and writing the report.
8.3 Sources of published systematic reviews
The Cochrane Collaboration is a well-known collection of systematic reviews.
It covers a wide range of clinical disciplines, and the reviews are available on
the Internet. They are limited to clinical trials of detection, prevention or treatment. There are about 40 established Collaborative Review Groups, within
Systematic reviews and meta-analyses
131
Box 8.1 Stages of a systematic review
1. Define the research question, and identify the appropriate outcome
measures
2. Specify a list of criteria for including and excluding studies
3. Undertake a literature search (using medical databases, for example,
PubMed, Medline and Embase) and after reading the abstracts identify
articles that might be appropriate
4. Obtain the full papers identified from the literature search. The reference
lists of these papers could be used to identify additional papers not found in
the electronic search
5. Critically appraise each report and extract specific relevant information.
Clearly defined outcome measures are essential
6. Perform a meta-analysis which involves combining the quantitative
results from the individual studies into a single estimate
7. Interpret and summarise the findings.
which systematic reviews are prepared to a similar standard, and sometimes
updated regularly:
r Cochrane Collaboration (http://www.cochrane.org)
r The Cochrane Library (http://www3.interscience.wiley.com/
cgi-bin/mrwhome/106568753/HOME)
Systematic reviews and meta-analyses can also be found in electronic
databases of clinical and psychology journals:
r Medline (http://medline.cos.com/)
r Embase (http://www.embase.com/)
r PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi)
These databases contain abstracts of most scientific published articles, and
they have keyword search facilities. Systematic review articles should be categorised as such, but using keywords such as ‘systematic review’ or ‘meta
analysis’ in a search should ensure that all these articles are retrieved.
National governmental organisations, which often fund external systematic
reviews or conduct them internally, may also list completed reviews on their
website. In the UK, for example these include:
r The UK Health Technology Assessment (http://www.ncchta.org/)
r National Institute for Health and Clinical Excellence, NICE (http://
www.nice.org.uk/)
r The Centre for Reviews and Dissemination (CRD) in York (http://
www.york.ac.uk/inst/crd/)
8.4 Interpreting systematic reviews
Systematic reviews of clinical trials usually focus on those that compare two or
more groups of people, so that an effect size is available, for example, relative
132
Chapter 8
risk, risk difference, hazard ratio, or mean difference. It is worth clarifying the
following points:
r What is the aim of the review? This is often similar to the main objective
of a single trial.
r How was the review conducted?
r What are the main outcome measures?
r What are the main results? The pooled effect size and corresponding confidence interval and p-value can be interpreted as described in Chapter 7.
Meta-analysis
The main stage of a systematic review is combining the effect sizes into a
single estimate using a statistical technique called meta-analysis. If a simple
average of the effect size were taken, small and large studies would have the
same influence in the analysis, but there needs to be some way of taking into
account that one trial may be based on 100 people and another on 1,000 (this
is described below).
Figure 8.1 is a typical meta-analysis plot (a forest plot), associated with a
review of nicotine replacement therapy (NRT). It shows the individual results
from 13 randomised trials of self-referred smokers, in which subjects were
randomised to receive either 2 mg nicotine chewing gum or control (such as
placebo gum). The main outcome measure is the proportion of smokers who
had stopped smoking one year after starting treatment, and the effect size is
Review:
Comparison:
Outcome:
The effect of nicotine replacement therapy on smoking cessation
01 2mg nicotine chewing gum versus control
01 The proportion of smokers who had stopped smoking at one year
Sudy
or sub-category
Areechon 1988
Clavel 1985
Fagerstrom 1982
Fee 1982
Hall 1987
Hjalmarson 1984
Hughes 1990
Jarvik 1984
Jarvis 1982
Killen1984
Killen 1990
Malcolm 1980
Pirie 1992
2mg nicotine gum
n/N
56/99
24/205
30/50
23/180
30/71
29/106
8/20
7/25
27/58
11/22
127/600
17/73
75/206
Control
n/N
RR (fixed)
95% CI
37/101
6/222
22/50
15/172
14/68
16/100
7/39
4/23
12/58
6/20
106/618
5/63
50/211
Total (95% CI)
1715
1745
Total events: 464 (2mg nicotine gum), 300 (Control)
2
Test for heterogeneity: Chi = 14.83, df = 12 (P = 0.25), I2 = 19.1%
Test for overall effect: Z = 7.11 (P < 0.00001)
Weight
%
RR (fixed)
95% CI
12.34
1.94
7.41
5.17
4.82
5.55
1.60
1.40
4.04
2.12
35.17
1.81
16.64
1.54 [1.13, 2.10]
4.33 [1.81,10.38]
4.36 [0.93, 2.01]
1.47 [0.79, 2.71]
2.05 [1.20, 3.52]
1.71 [0.99, 2.95]
2.23 [0.94, 5.26]
1.61 [0.54, 4.79]
2.25 [1.27, 4.00]
1.67 [0.76, 3.67]
1.23 [0.98, 1.56]
2.93 [1.15, 7.50]
1.54 [1.14, 2.08]
100.00
1.57 [1.39, 1.78]
0.1 0.2 0.5 1 2
5 10
Favours control Favours 2mg gum
Figure 8.1 Example of a forest plot from a meta-analysis; randomised trials evaluating nicotine
replacement therapy (2 mg nicotine chewing gum) and the effect on smoking cessation rates.1
The figure was obtained using RevMan.3 RR: relative risk; CI: confidence interval; n: number of
events, i.e. number of people who quit smoking; N: number randomised in each trial arm. The
no-effect value is 1.0. If the 95% CI excludes one, the result is statistically significant.
Systematic reviews and meta-analyses
133
the relative risk (the ratio of these proportions). The studies are listed in alphabetical order according to the first author, but they could also be ordered by
year of publication or magnitude of the effect size. Forest plots can be derived
for any type of effect size.
An important observation is that all trials have a relative risk greater than
the no-effect value: the proportion of smokers who quit was always higher in
the NRT group. If there really were no association between NRT and quit rate,
some trials should have a relative risk below one. Although several trials had
statistically significant results (e.g. Areechon 1988 and Clavel 1985), others did
not. Even the largest trial by Killen (1990) just missed statistical significance;
lower confidence limit was 0.98. Because of this, a meta-analysis of all the
results seems appropriate in order to provide a clearer conclusion on the effect
of NRT.
Large trials usually have small standard errors, which produces estimates
of the true effect size that are more precise than those from trials with large
standard errors. The weight given to each trial is calculated from the standard
error of the relative risk, on a log scale (Box 8.2). In Figure 8.1, each weight is
expressed as a percentage of the sum of all the weights across trials, allowing
a comparison of the relative contribution that each trial makes to the analysis.
Box 8.2 ‘Weight’ of a trial in a meta-analysis
Weight is a measure of the relative importance of an individual
trial in a review
Weight = 1/standard error 2
LARGE trial
SMALL trial
SMALL standard error
LARGE standard error
LARGE
weight
SMALL
weight
Trials with small standard errors have narrow confidence intervals and
therefore larger weights. For example, the large trial by Killen (1990) in
Figure 8.1 has 35.17% of all the relative weights. Trials with large standard
errors have wider confidence intervals (e.g. Jarvik 1984) and a smaller weight
(1.4%). It should be noted that the trial by Clavel (1985) was the second largest,
but the standard error was large because the number of events (i.e. smokers
who quit) was small. The size of the trial and the number of events influences
the magnitude of the standard error. Forest plots usually make the size of the
central square for each trial proportional to the weight, making trials with
small standard errors more prominent to the eye.
The statistical techniques used in a meta-analysis allow for the weight of
each trial when combining the effect sizes (the simplest method is shown in
134
Chapter 8
Box 8.3). The combined estimate of the relative risk of NRT compared with
control is shown as the large diamond in the row labelled ‘Total’ in Figure
8.1. It is 1.57, with a 95% CI of 1.39 to 1.78 (the ends of the diamond). Smokers given 2 mg nicotine chewing gum were 57% more likely to quit at one
year than smokers in the control group, and the true excess risk is likely to lie
between 39% and 78%. This range is narrower than any trial on its own. The
p-value associated with the combined estimate is very small, p<0.00001 (see
‘Test for overall effect’), which is highly statistically significant: the observed
effect is unlikely to be due to chance.
Box 8.3 Estimating the combined effect size (fixed effects model)
Combined effect size =
sum of (effect size × weight for each trial)
sum of all the weights
The effect size could be a mean difference, absolute risk difference, relative
risk or hazard ratio (the latter two are used on a loge scale, and the result is
anti-logged).
Heterogeneity
No two trials are identical in design and conduct, so it is necessary to consider whether the observed effect sizes materially differ from each other, i.e.
whether there is heterogeneity, and if it is appropriate to combine the results
into a single estimate. Figure 8.2 illustrates this using four hypothetical studies. Studies 1 to 3 appear similar (no heterogeneity), but Study 4 clearly looks
different from the other three (evidence of heterogeneity). Statistical tests can
determine whether significant heterogeneity is present. When it is, statistical
methods can combine the effect sizes to allow for it.2
In Figure 8.1, a test for heterogeneity (shown in the bottom left-hand corner) produced a p-value of 0.25, suggesting that the relative risk estimates do
not differ substantially from each other. Here, an appropriate method to combine the results is a ‘fixed effects model’, indicated by the word ‘fixed’ at the
top right-hand side. If there is significant heterogeneity (i.e. the p-value for
the test is <0.05), a ‘random effects model’ may be more appropriate, because
this method takes into account variability between studies. The word ‘fixed’
would be replaced with ‘random’. The two methods tend to produce similar
effect sizes and CIs when there is little or no heterogeneity. When there is significant heterogeneity, the pooled effect sizes are usually similar but the ‘random effects’ approach produces wider confidence intervals to allow for the
between-trial variability; there is more uncertainty over the size of the true
effect given the greater variability. In the example, the combined relative risks
and 95% CIs are 1.57 (1.39 to 1.78) and 1.61 (1.38 to 1.86) using the fixed and
random models respectively.
Systematic reviews and meta-analyses
135
Study
4
3
2
1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Relative risk
Figure 8.2 Illustration of heterogeneity among four hypothetical trials. The results from Trials 1 to
3 are similar, but the result from Trial 4 is clearly different.
The standard test for heterogeneity is not very powerful in detecting
differences between trials when there are few trials. The I 2 value is a more
sensitive way of examining heterogeneity: 0% indicates no heterogeneity and
100% a high degree of heterogeneity.4 In Figure 8.1, I 2 is 19.1%, which is low.
On this occasion, the conclusion based on I 2 is consistent with the test for
heterogeneity.
It is useful to investigate heterogeneity, because the overall effect may not
be meaningful if the effect sizes clearly differ between trials. When there is
significant heterogeneity that can be explained by a certain factor (e.g. trials
conducted on younger people have a different effect size than those among
older people), the effect size in each sub-group of the factor might be a more
appropriate estimate of treatment effect than the overall estimate.
8.5 Considerations when reading a systematic review
There are several aspects of any review to consider when deciding whether it
provides good evidence for or against a new intervention.
Differences in disease definition, the interventions
and outcome measures
Trials are conducted in different ways using a variety of methods, and therefore the definition of the disorder, the outcome measures, and the delivery of
the intervention may all vary between trials. In the trials in Figure 8.1, the control group consisted largely of smokers who had placebo gum, but sometimes
the control group were those who were offered standard smoking cessation
therapy, usually by counselling. It is not always possible to easily combine trials in a systematic review, particularly if the designs are very different. However, different trials that produce similar results may provide some evidence
136
Chapter 8
that a new intervention is effective, because the difference in methodology
should increase variability, making it more difficult to find a treatment effect.
It is also useful to determine whether the chosen endpoints are appropriate
for addressing the objective of the review, when deciding whether a particular trial should be included.
Identifying studies
Systematic review reports should provide sufficient information on how studies were identified, by specifying the search criteria employed. This includes
the range of years in which articles were published, whether foreign language articles were excluded, and which databases were used (e.g. Medline
and Embase). More specifically, appropriate keywords should be used when
searching the databases. In a review of a cancer treatment, it is insufficient to
search using only the word ‘cancer’, because some abstracts use ‘tumour’ or
‘carcinoma’. Different spellings should also be considered, for example ‘randomised’ and ‘randomized’, and some reports refer to patients who are ‘randomly allocated’ rather than ‘randomised’. This can be partly overcome by
using wildcards, i.e. the search term would be ‘random*’, where the asterisk
allows for any letters after ‘random’. If many studies are missed, the review
may not be representative, and the results could be biased.
Publication bias
Trials with negative results (those contrary to what is expected, or those
reporting no evidence of an effect) are sometimes less likely to be published
than those that do show an effect, either because the research is not submitted for publication, or because journals reject them. In this situation, the
pooled effect size from the meta-analysis will be biased towards the positive
studies, and be larger than the true value. There are statistical methods that
can detect significant publication bias. A simple method is a funnel plot,
which plots the effect size against the weight (or 1/standard error, or standard error), and if the spread of the observations is clearly asymmetric, this is
evidence of possible publication bias.5
Study quality
After the articles for a review have been identified, study quality may be
assessed, and those judged to be inferior excluded from the meta-analysis.
Exclusion could be based on an assessment of the study design, conduct or
analysis, with consideration of potential bias or confounding. Even if the criteria for exclusion are clearly defined, this is a subjective exercise that could
produce a biased selection of studies to be used in the analysis. If a particular trial is affected by bias or confounding, consideration should be given to
whether the effect is likely to be so large that it clearly distorts the results.
When assessing the effect of study quality it is perhaps best to include all
trials, and then perform the analysis after excluding those considered ‘poor
quality’. Results can be compared to see how consistent they are.
Systematic reviews and meta-analyses
137
Reporting systematic reviews
It is important that the way systematic reviews were conducted are reported
clearly so that health professionals can judge the reliability of the results and
conclusions. A report based on a systematic review should include the following items, though a more detailed set of guidelines can be found in Moher
et al:6
r The main objective
r The search strategy, including the search terms and electronic databases
used, as well as other sources of clinical trials
r How full articles were selected for inclusion in the meta-analysis; including specification of the target population, the disorder, the trial endpoints,
and the interventions
r Specifying the total number of abstracts found during the electronic
search, how many full articles were examined, how many were used in
the meta-analysis, and how many were excluded, and the reasons for their
exclusion
r A table summarising the main characteristics of each trial used in the
meta-analysis, such as geographical location, time period when the trial was
conducted, sample size, subject population (e.g. age range and gender distribution), the interventions and the effect size
r Method of statistical analysis (fixed or random effects model), the effect
size used, and any investigation of heterogeneity if it exists, such as a formal
statistical test or I 2 value
r Interpretation of the results, and their implication for clinical practice.
8.6 Why systematic reviews are important
An example of how meta-analysis could have affected medical practice sooner
than it did is given in Figure 8.3. The left side of the figure shows the
individual odds ratio of dying (similar interpretation to relative risk) for 33
randomised trials comparing intravenous streptokinase with a placebo or
no therapy in patients who had been hospitalised for acute myocardial
infarction.7 Of the trials, 25 suggested a beneficial effect of streptokinase, but
only six had a statistically significant result. The combined treatment effect
showed that the risk of dying was reduced by about 25%, which was highly
statistically significant. Of greater importance is the figure on the right-hand
side. This is a cumulative meta-analysis. Each observation represents the
pooled treatment effect of all the published trials up to that point in time.
For example, the dot at ‘European 2’ is a meta-analysis of this trial and the
three preceding ones. This figure shows that if a meta-analysis had been performed in the mid 1970s, a clear effect on mortality would have been observed.
However, intravenous streptokinase was only recommended in the 1990s. The
work on streptokinase took place before systematic reviews were common.
Had such a review been conducted in the 1970s streptokinase could have been
shown to be life saving almost 20 years earlier, long before its actual adoption
into clinical practice.
138
Chapter 8
Figure 8.3 Meta-analysis of trials of streptokinase (reproduced from Mulrow 1994).7 Reproduced
with kind permission from the BMJ Publishing Group.
8.7 Key points
r Systematic reviews are based on a formal approach to obtaining, analysing
and interpreting all the available studies on a particular topic
r A meta-analysis combines all relevant studies to give a single estimate of
the effect size, which has greater precision than any individual trial
r The conclusions from a review are usually stronger than those from any
single study.
References
1. Tang JL, Law M, Wald N. How effective is nicotine replacement therapy in helping people
to stop smoking? BMJ 1994; 308:21–26.
2. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials 1986;
7:177–188.
Systematic reviews and meta-analyses
139
3. RevMan Analyses (Computer program). Version 1.0 for Windows. In: Review Manager
(RevMan) 4.2. Copenhagen: The Nordic Cochrane Centre, The Cochrane Collaboration,
2003. http://www.cc-ims.net/RevMan
4. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in metaanalyses. BMJ 2003; 327:557–560.
5. Sterne JAC, Egger M, Davey Smith G. Systematic reviews in health care: Investigating and
dealing with publication and other biases in meta-analysis. BMJ 2001; 323:101–105.
6. Moher D, Cook DJ, Eastwood S et al. for the QUORUM Group. Improving the quality of
reports of meta-analyses of randomized controlled trials: the QUORUM statement. The
Lancet 1999; 354:1896–1900.
7. Mulrow CD. Systematic Reviews: Rationale for systematic reviews. BMJ 1994; 309:597–
599.
Further reading on systematic reviews and meta-analysis
Published articles
Davey Smith G, Egger M, Phillips AN. Meta-analysis: beyond the grand mean? BMJ 1997;
315:1610–1614.
Egger M, Davey Smith G. Meta-analysis: potentials and promise. BMJ 1997; 315:1371–1374.
Egger M, Davey Smith G. Meta-analysis: bias in location and selection of studies. BMJ 1998;
316:61–66.
Egger M, Davey Smith G, Phillips AN. Meta-analysis: principles and procedures. BMJ 1997;
315:1533–1537.
Egger M, Davey Smith G, Schneider M, Minder CE. Bias in meta-analysis detected by a
simple graphical test. BMJ 1997; 315: 629–634.
Egger M, Schneider M, Davey Smith G. Meta-analysis: spurious precision? Meta-analysis of
observational studies. BMJ 1998; 316:140–144.
Glasziou P, Irwig L, Bain C, Colditz G. Systematic Reviews in Health Care: A Practical Guide.
Cambridge University Press, 2001.
Books
Egger M, Davey Smith G, Altman D. (Eds). Systematic Reviews in Health Care: Meta-analysis in
Context, 2nd edn. BMJ Books, 2001.
Khan KS, Kunz R, Kleijnen J, Antes G. Systematic Reviews to Support Evidence-based Medicine:
How to Review and Apply Findings of Healthcare Research. Royal Society of Medicine Press
Ltd, 2003.
CHAPTER 9
Health-related quality of life and health
economic evaluation
Previous chapters focus on clinical endpoints associated with efficacy and
safety. However, it is also possible to examine new interventions from the
subject’s perspective, and in relation to financial costs. This chapter presents
these two useful features of clinical trials. A trial team might have one or more
members with this type of expertise.
9.1 Health-related quality of life
Common trial endpoints have a clear clinical impact, such as the occurrence or
recurrence of disease, death, side-effects and changes in biological, biochemical or physiological characteristics. While these are usually taken to be the
primary trial endpoints, it is sometimes useful to examine the effect of a new
intervention from the subject’s own experience, referred to as health-related
quality of life (QoL). Indeed, some equivalence or non-inferiority trials aim to
show that a new intervention may have a similar clinical effect on the disorder
of interest, but QoL is improved. QoL could, therefore, be one of the main endpoints. Most QoL measures are obtained through questionnaires, completed
by the trial subject, guardian or relative, or during an interview with a health
professional.
There is no fixed definition of QoL, but it aims to provide a quantitative
measure of some or all of following:
r Pain
r Physical functioning
r Mental and emotional functioning
r Social functioning
r Feeling of well-being.
A new intervention with more side-effects may increase a patient’s life by
three months, but this could be balanced against the lower quality of life associated with the side-effects.
Elements of QoL and toxicity (or safety) often overlap. For example, pain
level is specifically recorded in many treatment trials in advanced disease,
but it may also be sought in QoL questionnaires. Perhaps the main difference
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
141
142
Chapter 9
between some QoL measures and toxicity is that QoL is based on self-reported
responses by the subject, and this is done in relation to several other factors,
while toxicity is usually diagnosed by or with a clinician. There may not necessarily be a high correlation between QoL and toxicity.
Measuring QoL
There are many QoL questionnaires, sometimes referred to as QoL instruments or measures. Some have been developed for use in the general population, while others are intended for people who have a specific disorder. There
is no perfect measure, and it is possible that when an instrument is used it will
miss some important aspect of the subject’s experience.
QoL responses are based on an individual’s perceived experiences. These
perceptions will vary between people, and also over time within the same
person. It not unusual for QoL scores using different instruments to not be
well correlated when completed by the same subject.
In choosing a QoL instrument for a trial, it is necessary to determine
whether it will measure what subjects would consider important in that trial,
and whether it is sensitive enough to detect meaningful changes in QoL. If
many subjects report a very low (or very high) QoL score at baseline, there is
not much scope to get a lower (or higher) score after treatment – called a floor
(or ceiling) effect.
A validated QoL instrument is one that has been assessed and judged to
measure what it is supposed to measure. The following questions are typical
in making this judgement:
r Are the self-reported scores highly correlated with relevant objective or
clinical outcomes? For example, if patients report high pain scores do they
also request, or use, more pain relief medication?
r Do the scores from a QoL instrument correlate with scores from another,
perhaps well-established instrument, both of which aim to measure similar
aspects of QoL?
Reliability can be assessed by judging whether a QoL instrument will produce similar scores when repeated in similar groups of people.
Box 9.1 shows some QoL instruments used in research studies. Subjects usually rate their experiences or feelings on a scale. The Short Form 12 or 36 (12
or 36 questions) are commonly used measures.1 For example, one of the SF-12
questions is: ‘During the past four weeks, how much did pain interfere with
your normal work (including both work outside the home and housework)?’
Subjects select only one of the following five responses: ‘All of the time’,
‘Most of the time’, ‘Some of the time’, ‘A little of the time’, or ‘None of the
time’.
These responses can be used to produce QoL scores in several domains. For
example, the SF-12 has domains that include physical functioning, body pain,
mental health and general health.
Other instruments, such as the EuroQol-5D (EQ-5D)2 and the Health Utilities Index Mark 2 or 3 (HUI2 or HUI3),3,4 aim to identify a subject’s health
QoL and health economic evaluation
143
Box 9.1 Examples of QoL instruments used in clinical trials
General population
r Short Form 12 or 36 (SF-12 or SF-36)
r Nottingham Health Profile
r General Health Questionnaire (GHQ-30)
Disease-specific
r
r
r
r
r
EORTC* QLQ C-30 (all cancer patients) and cancer-specific modules
Parkinson’s Disease Questionnaire
St George’s Respiratory Questionnaire (SGRQ)
Dermatology Life Quality Index (DLQI)
Stroke and Aphasia Quality of Life Scale (SAQOL-39)
Psychological
r
r
r
r
∗
Hamilton Anxiety Scale (HAS)
Hospital and Anxiety Depression Scale (HADS)
Psychological General Well-Being Index (PGWBI)
Mental Health Index.
EORTC: European Organisation for Research and Treatment of Cancer
state. These measures were often used with other QoL measures, but they are
now used on their own. They can be used to estimate ‘quality adjusted life
years’ (see page 152). For example, the EQ-5D covers mobility, ability for the
subject to care for themselves, ability to undertake usual activities, level of
pain and discomfort and mental health (anxiety and depression). One of the
EQ-5D questions is:
‘Please indicate which statement best describes your own health state
today’:
Mobility
I have no problems in walking about
I have some problems in walking about
I am confined to bed.
How many QoL instruments and how often?
Several different QoL instruments are used in some trials, often because
results will be compared with those from previously published papers on
the same topic. Because these reports were based on different measures,
researchers believe they need to include most of them. When several questionnaires are planned, it is uncertain whether patients will complete them
all or do so accurately, especially if one or more of the questionnaires contain
many items. There is often overlap in the types of questions asked.
The timing of the measures depends on the natural course of the disorder
of interest, and when detectable changes in QoL are expected to occur. For
example, in a treatment trial of advanced lung cancer where most patients
144
Chapter 9
die within 1–2 years, it might be useful to collect QoL scores from patients
every three months. In a disease prevention trial among healthy subjects, one
QoL measurement annually might be sufficient. Asking subjects to complete
questionnaires frequently has the advantage of allowing a better examination
of how QoL changes over time, but this may not be feasible.
In determining the number and frequency of QoL measures, trial subjects
should not be faced with too many pages to fill in, and not too often. This
is especially important if the trial subject is ill. Either situation could be offputting and so result in missing data. It may be better to have complete or
near-complete data from one or two well-timed questionnaires than lots of
missing data from several questionnaires, or a situation where the subject
chooses not to submit any responses at all.
Analysing QoL scores
QoL instruments contain several questions. A total score for each subject can
be obtained by simply summing the individual scores. Alternatively, the score
from one question, or a group of questions, is summed separately, to provide
a value for one of several domains. The score for each subject is often transformed onto a scale that ranges from 0 to 1 (or 0 to 100). Detailed instructions
on how to deal with raw scores, and transform them, are usually provided
with the instrument.
Table 9.1 shows three questions from the SF-12. Each response is assigned
a value 1 to 5. The values are ordered categorical data: a score of 4 means the
subject feels better than someone with a score of 2, but it cannot be said that
they feel twice as good.
QoL scores can come under the category ‘taking measurements on people’
(see page 21). When comparing the average scores between two intervention
arms, allowance must be made for the baseline score. When presenting QoL
data, the mean baseline scores (or median if the distribution is skewed) and
standard deviation indicate the scores before treatment starts.
If the mean baseline values are similar, the simplest analysis is to take the
difference between the QoL score at one timepoint, say six months, and at
baseline in Treatment Arm A (DA ), and to do the same in Treatment Arm B
(DB ). A relevant and important time point should be chosen, i.e. long enough
for the effect of the trial treatment to appear. The effect size is DA minus DB ,
and corresponding confidence intervals and p-values can be calculated (see
page 98).
Table 9.2 illustrates this using QoL endpoints from a placebo-controlled trial
evaluating thalidomide in addition to standard chemotherapy, in treating lung
cancer patients. Thalidomide had no effect on global health status and the
functional scales, but patients on thalidomide suffered less insomnia and more
constipation (both are expected effects of this drug).
An alternative analysis is to compare the proportion of subjects with a high
score (e.g. 4 or 5) between the trial arms, and use methods associated with
‘counting people’ (see page 91). However, information on variability is lost by
turning a continuous measurement into categorical one.
Table 9.1 QoL responses for a trial subject using three questions from the SF-12.
All of the time
1
Q3. During the past four weeks, how
much of the time have you had any of
the following problems with your work
or other regular daily activities as a
result of your physical health?
a) Accomplished less than you
would like
b) Were limited in the kind of
work or other activities
Most of the time
2
Some of the time
3
A little of the time
4
None of the time
5
X
X
Q3 (a) and (b) are combined to give a score for the domain ‘Role physical’
Raw score = sum of the observed scores = 1 + 3 = 4
Transformed score = [(raw score – lowest possible score)/ score range] × 100 = (4 – 2)/8 × 100 = 25%
Q5 on its own is used to give a score for ‘Bodily pain’
Raw score = observed score = 5
Transformed score = [(raw score – lowest possible score)/ score range] × 100 = (5 – 1)/4 × 100 = 100%
X
QoL and health economic evaluation
Q5. During the past four weeks, how
much did pain interfere with your
normal work (including both work
outside the home and housework)?
145
146
Mean score at baselinea
(standard deviation)
Domain
Mean difference
(six months - baseline)
Effect size
Placebo
N = 318
Thalidomide
N = 331
Placebo
DiffP
Thalidomide
DiffT
Mean difference
(99% CI)b
DiffT −DiffP
Global health status
50 (26)
49 (24)
+8.1
+9.9
+1.8 (–5.9, +9.6)
0.54
Functional scales
Physical functioning
Role functioning
Social functioning
62 (27)
49 (36)
60 (34)
61 (25)
48 (35)
61 (32)
+4.2
+10.1
+8.0
+5.2
+10.0
+9.5
+1.0 (-6.8, +8.9)
–0.1 –10.2, +10.0)
+1.5 (–8.1, +11.0)
0.74
0.98
0.70
Symptom scales
Insomnia
Constipation
50 (35)
27 (35)
48 (36)
29 (33)
–19.8
–7.7
–31.7
+3.0
−11.9 (–22.4, –1.3)
+10.7 (+1.0, +20.6)
0.004
0.005
p-value
Data from personal communication (Dr Siow Ming-Lee, University College London Hospital)
a The scores range from 0 to 100 for all endpoints. For the global health and functional scales 0 indicates poor health and 100 good health. For the symptoms
scales, 0 indicates no symptoms and 100 high level of symptoms.
b For the global health and functional scales a positive difference indicates that thalidomide was better and a negative difference indicates that placebo was
better. For the symptoms scales, a negative difference indicates that thalidomide was better and a positive difference indicates that placebo was better.
Chapter 9
Table 9.2 Summary data on selected quality of life domains based on the EORTC QLQ-305 in a trial comparing thalidomide with placebo in treating lung cancer
patients; results at about six months after randomisation.
QoL and health economic evaluation
147
Repeated assessment of the same QoL measure
When the instrument has been completed on several occasions, for example,
there is a value for nausea at baseline, 6, 12 and 18 months after randomisation, an analysis could be performed at each time point. However, if this is
done too many times, the presentation of results may appear unwieldy, and it
might be difficult see what is happening. This analysis ignores the fact that a
subject has contributed several QoL data points during the trial, and that these
are likely to be correlated. Also, as more time points are examined separately,
this increases the chance of finding a spurious effect.
If the QoL values for one measure from each subject were plotted against
time, the result would be a ‘curve’ consisting of connecting straight lines (Figure 9.1). One way of analysing this is to calculate the area under the curve, so
that there is only one data value for each subject. This allows statistical methods for ‘taking measurements on people’ to be used (Chapter 7). There are
also advanced statistical methods that allow a single analysis of all the individual data points from all subjects (repeated measures analysis or mixed
modelling). Analysing the whole data set in this way avoids having to look at
multiple time points.
Having several comparisons
There are often several comparisons, each corresponding to a specific domain
(Table 9.2 has six comparisons). The more comparisons examined from the
same trial, the more likely it is that a spurious effect is found (see page 115).
Adjusting p-values for having multiple comparisons might be considered,
100
QoL score
80
60
40
20
0
0
3
6
9
12
Time since randomisation (months)
Figure 9.1 Example of a quality of life score profile for a single trial subject. Measures were taken
at baseline, 3, 6, 9 and 12 months. The QoL scores at each of these time points are 80, 75, 72, 68
and 61%. The area under the curve is 70% (the example is in Table 9.4).
148
Chapter 9
such as the Bonferroni correction (see page 115). However, some QoL scores
are likely to be correlated, so this approach would make the p-value larger
than it should be. Alternatively, the unadjusted p-values and 99% confidence intervals might be presented, as in Table 9.2, with caution given to pvalues that are just under the conventional cut-off of 0.05, because this does
not provide good evidence for an effect. Strongest effects are found when
p<0.001.
Some QoL instruments can be reduced to one or two domains. For example, the 12 questions on the SF-12 reduces to eight domains, which could be
reduced further to two (‘physical’ and ‘mental’), thus avoiding having several
comparisons. However, there could be treatment effects on specific domains,
which are masked when they are aggregated.
Missing data
Consideration should be given to how to handle missing data, because there
could be reasons for this that bias the results, such as more subjects in one
trial arm who were too ill to complete the questionnaires. If the proportion
of subjects with missing data and the reasons for missingness are similar
between the trial arms, the results are unlikely to be biased. When estimating effect sizes at a single time point (say 12 months) a subject who has only
provided QoL data at six months presents a dilemma. They are unlikely to retrospectively complete the 12-month form reliably (unlike many clinical endpoints which could be obtained from hospital files retrospectively). This subject could be excluded from the analysis or the six-month value could be used
as the 12-month value (referred to as ‘last value carried forward’). Neither
approach is perfect, but both are simple. There are also statistical methods
called imputation which involve estimating what the subject’s value might be,
perhaps based on other data from the subject or data from other trial subjects.
Methods such as mixed modelling use whatever data is available, without
the need for imputation, though this assumes that missing data is randomly
distributed between the trial arms.
Type of analysis
Where possible, an intention-to-treat (ITT) analysis should be performed,
as is standard practice for clinical trial endpoints. For equivalence or noninferiority trials a per protocol analysis could also be conducted (see page
116). However, the problem with trial subjects who do not take their allocated
treatment (non-compliers) is that while it may be possible to obtain information on clinical endpoints from hospital files, and so include them in an ITT
analysis, it is unlikely that QoL data will be available. If the subject has chosen to stop treatment, they may also decide not to complete any further trial
forms. Consideration should then be given to whether there is a high proportion of non-compliers and if there is a potential for bias in the observed results,
if this differs between the trial arms.
QoL and health economic evaluation
149
Interpreting QoL scores
QoL measures are subjective, and as such could be affected by trials in which
the allocated intervention is known. When interpreting QoL data, consideration should be given to the possible effect of the lack of blinding.
Treatment efficacy using effect sizes such as risk difference, relative risk
and number needed to treat, can often be described in a way that many people understand. For example, the reduced chance of developing a disorder,
or increase in survival time. However, one of the challenges with using QoL
results is how to translate them into practice. For many subjects or health professionals it may be difficult to interpret a specified difference in scores. For
example, how would a difference of −11.9 points for insomnia be interpreted
by a patient (Table 9.2)? Also, how much worse is a mean difference in constipation of −20 compared to −10 points? Describing the effect as small, medium
or large could be one way of summarising the average beneficial and negative
effects, without trying to interpret the actual effect size.
9.2 Health economic evaluation
In most societies, financial resources for health care are limited. With advances
in medical treatments, and an ageing population in many developed countries, governments need to monitor how much to spend on public health
and hospital services. Health economic evaluation is therefore an increasingly important consideration when investigating a new intervention, especially with the rising costs of many new drugs, for example, in cancer. Costeffectiveness analysis is often used as a broad description for an economic
evaluation, but the term has a more specific meaning.
Several countries have processes for evaluating the cost-effectiveness of
new interventions. How they do this depends on the health care system in
place. Examples of institutions that perform these types of analyses are the US
Food and Drug Administration, the National Institute for Health and Clinical
Excellence (NICE, UK), Haute Autorité de Santé (HAS, France) and the Institute for Quality and Efficiency in Health Care (IQWiG, Germany). On the basis
of an evaluation of efficacy and health economic costs, these institutions may
choose which interventions to recommend for routine use. It is possible that
some treatments, although effective, are not recommended because they are
judged to be too expensive in relation to the clinical benefit.
What is economic evaluation?
There are three features of an economic evaluation in a trial:
r There is a comparison between two or more interventions, even if the
comparison group received no intervention
r The treatment effect on a clinical endpoint(s)
r The costs, particularly financial costs, associated with the interventions.
The purpose is to consider both treatment efficacy and costs, and to determine whether or not a new intervention is more cost-effective than another
150
Chapter 9
Figure 9.2 Comparing the effectiveness and financial costs of a new intervention (A) with an
alternative (e.g. standard) treatment. The horizontal axis could be the effect size (e.g. risk
difference, or difference in QALYs), and the vertical axis could be the difference in financial costs
between the interventions.
intervention (Figure 9.2). It is not the case that the cheapest treatment is always
recommended. The treatment of choice is likely to be both cheaper and more
clinically effective. However, side-effects and QoL may also be considered. A
treatment that is both less effective and more expensive would not be recommended. If two interventions have a similar effect, then the choice might be
based on costs. Difficulty arises when the treatment is more effective, but more
expensive (NE quadrant of Figure 9.2), in which case it is necessary to determine whether the extra costs are worth the improvement in efficacy. Similarly,
if an intervention is less costly but less effective (SW quadrant), is the loss of
efficacy justified by the cost savings? In both these situations, there may need
to be a trade-off.
Types of economic evaluation
Costs in an economic evaluation are measured in monetary values, and generally fall into the following categories:
r Cost to the service provider (hospital or health service) of administering
the interventions (e.g. treatment cost, cost of assessments or hospital stay)
r Cost to the subject (e.g. travel to hospital)
r Societal costs (e.g. number of work days lost).
The first of these is perhaps the easiest to obtain, because it is associated with
clearly defined items, such as the cost of a drug, X-ray, blood test or treatment for a drug-related toxicity. The costs of such services are often obtained
QoL and health economic evaluation
151
from published list prices, nationally available costs or the internal estimates
from organisations, such as a hospital. The other two categories can cover a
wide range of activities, including valuing lost days of work due to illness,
or inconvenience to the subject if they have to travel to a hospital to receive
a new treatment. Allocating a monetary value to some of these activities may
not be easy. Many health economic evaluations concentrate on the cost to the
service provider. It is also usual practice to standardise financial costs in the
future to what they might be in present day values, allowing for inflation.
The effect of discounting is small when the costs and health benefits occur
at the same time, but large when costs could be incurred over many years.
Annual discount rates are typically 3–6%. For example, a cost of £5,000 per
year over 25 years corresponds to a total of £125,000 if unadjusted, but it
becomes £68,000 if discounted using a fixed annual rate of 6%.
Four methods of health economic analyses are presented below, though the
first two are the most commonly used. Many trials now collect subject-level
cost data, i.e. there is a cost estimate for treatments and/or assessments for
each trial subject.
Cost effectiveness analysis
When clinical efficacy is expected to differ between two interventions, unit
costs can be considered in relation to the clinical outcome. One intervention is
recommended over another if it is cheaper and more effective. But when the
better treatment is more expensive the incremental cost-effectiveness ratio
could be examined. This is illustrated in Table 9.3, based on a hypothetical randomised trial comparing two interventions (A and B). The number of deaths at
one year is the main endpoint. The extra cost associated with saving five more
lives is £20,000 using Treatment A, or £4,000 per extra life saved. Whether this
is worthwhile is a judgement made by the service provider, who may also consider whether the same money could be invested elsewhere but with a greater
benefit.
Other examples of incremental cost-effectiveness ratios are cost per year
of life gained, or cost of detecting/diagnosing one extra affected individual,
in trials of screening and diagnostic tests. This analysis allows comparisons
Table 9.3 Example of a simple cost-effectiveness analysis.
Absolute difference
(A minus B)
Intervention A
Intervention B
Number of subjects treated
500
500
Number of deaths at one year
15
20
–5
Total cost of treating the group
of subjects
£30,000
£10,000
£20,000
Incremental cost-effectiveness ratio =
difference in costs
20,000
=
= £4,000
difference in efficacy
5
152
Chapter 9
between different medical areas if it is possible to express these in the same
unit of measurement (e.g. cost per life year saved). It is also possible to use
95% confidence intervals for the cost-effectiveness ratios, using methods such
as bootstrapping.
Cost utility analysis
This type of analysis incorporates quality of life measures. The financial costs
of two interventions are compared with the outcomes measured in utilitybased units. The most commonly used measure is the Quality Adjusted Life
Year (QALY), which allows both the number of years and quality of life gained
associated with a new intervention to be examined. Questionnaires such as the
EQ-5D, SF-6D, HUI2 or HUI3 allow QALYs to be calculated. This is illustrated
in Table 9.4. When there are several values from different time points, the area
under the curve is used to calculate the total QALY for each subject.
A QALY for one year is in the range 0 (death) to 1 (perfect or full health).
A subject can occasionally consider some states to be worse than death, so the
value could be less than 0. A QALY of 0.7 means that the subject’s quality of
life is worth 0.7 of a year of full health. In Table 9.4, the subject’s QALY for
the first year is given, but values for subsequent years can be summed. For
example, if the total over three years were 2.5, the subject has experienced 2.5
QALYs out of a possible total of three. If a new treatment extends a subject’s
life for one year, but at half full health it is associated with an increase of 0.5
QALYs (1 × 0.5). If life is extended by three years at half full health there is an
increase of 1.5 QALYs (3 × 0.5).
Suppose two interventions, A and B, are compared in a trial, and each subject’s health state is obtained over several time points (as in Table 9.4). Every
subject has an ‘area under the curve’ value, and the mean area under the curve
Table 9.4 Calculating a QALY for one trial subject who has provided four EQ-5D responses in
one year (i.e. every three months) in addition to the baseline value.
Month of
EQ-5D
EQ-5D score
Time period
(months)
Calculationa
QALY
0
3
6
9
12
0.80
0.75
0.72
0.68
0.61
0–3
3–6
6–9
9–12
(0.80
(0.75
(0.72
(0.68
0.19
0.18
0.17
0.16
Total
+
+
+
+
0.75)/2 × 0.25
0.72)/2 × 0.25
0.68)/2 × 0.25
0.61)/2 × 0.25
0.70b
a The calculation involves taking the average of two consecutive QoL scores then multiplying by
the time interval in years (3 months is 0.25 years)
b This is the area under the curve for this subject (the calculation is simple when the time interval
between responses is the same throughout)
QoL and health economic evaluation
153
is obtained in each trial group: Mean AreaA and Mean AreaB . There are also
the financial costs of administering the interventions to the subjects in groups
A and B: CostA and CostB (e.g. the mean cost in each group). An incremental cost-effectiveness ratio can then be calculated, similar to that mentioned
above:
£15, 000
£25, 000 − £10, 000
CostA − CostB
e.g.
Ratio =
=
Mean AreaA − Mean AreaB
3.2 − 1.5
1.7
The result, £8,823, is the cost per one QALY gained (i.e. the marginal cost), and
it indicates how much it costs to gain an extra year of healthy life using the
new intervention. This is a common measure used in economic evaluation,
and it allows health service providers to compare different interventions for
the same disorder, and also interventions in unrelated areas of medicine.
Service providers often produce guidelines on what might be considered
to be a cost-effective intervention, based on the cost per QALY gained. For
example, the National Institute for Clinical Excellence in the UK, does not look
favourably on interventions with a cost per QALY gained that far exceeds a
specified amount (£20,000 to £30,000 in 2008).6
There are also analytical methods that allow for variability in the effect size
and in the financial costs (for example, bootstrapping), producing a range
of costs per QALY gained, similar in principle to confidence intervals. This
method of economic analysis is preferred by many organisations that conduct
health technology assessments because it incorporates quality of life, based
on a common outcome measure, and can produce a range of estimates that
indicates the uncertainty around the decision to adopt a new technology.
Cost minimisation analysis
When two interventions are expected to have a similar clinical efficacy, the
decision on which to choose may rest on financial costs; i.e. what is the cheapest treatment. There should be evidence from equivalence trials. The effect
size, such as relative risk, hazard ratio or difference between two means,
should fall within a relatively narrow window around the no effect value.
This method of analysis is not often used because it does not easily allow for
variability in treatment effects or costs. Also, it is better to consider efficacy
and costs together, as in the approaches given above.
Cost benefit analysis
In a cost benefit analysis, all outcomes are valued in monetary terms, including treatment efficacy. Subjects are asked to estimate how much they would be
willing to pay for a certain increase in health (e.g. one extra year of life) associated with a new intervention. This could then be directly compared with
the costs that arise from the intervention (e.g. costs to the health provider).
However, there are several difficulties with this approach, including subjects
having to understand the full implications of having the new intervention or
not, and that willingness to pay may vary greatly between subjects. It is therefore not a commonly used method.
154
Chapter 9
9.3 Summary
r It sometimes useful to examine the effect of a new intervention from the
subjects’ perspective
r Health-related quality of life (QoL) attempts to quantify various attributes
such as mental and physical well-being
r When conducting a trial, consideration should be given to the number of
different QoL instruments used, and how often subjects are expected to complete them
r As more interventions are developed, there is a need to have health economic evaluations, particularly where there are limited financial resources
r Several methods of analysis are available, but cost-utility analysis is commonly performed. It produces a financial cost per quality of life year gained,
and allows different interventions to be compared.
References
1. Ware JE, Kosinski M, Turner-Bowker DM, Gandek B. How to score Version 2
of the SF-12 Health Survey. Quality Metric Inc., Lincoln, RI 2002. http://www.
qualitymetric.com/
2. The EuroQol Group. EuroQol – a new facility for the measurement of health-related quality of life. Health Policy 1990; 16:199–208. http://www.euroqol.org/
3. Torrance GW, Feeny DH, Furlong W et al. Multi-attribute preference functions for a comprehensive health status classification system. Health Utilities Index Mark 2. Med Care
1996; 34:702–722.
4. Feeny DH, Furlong W, Torrance GW et al. Multi-attribute preference functions for a comprehensive health status classification system. Health Utilities Index Mark 3. Med Care
2002; 40:113–128.
5. Aaronson NK, Ahmedzai S, Bergman B et al. The European Organisation for Research
and Treatment of Cancer QLQ-C30: A quality-of-life instrument for use in international clinical trials in oncology. J Nat Cancer Inst 1993; 85:365–376. http://www.
eortc.be/home/qol/
6. National Institute for Health and Clinical Excellence. http://www.nice.org.uk/
Further reading
Heath-related quality of life
Addington-Hall J, Kalra L. Who should measure quality of life? BMJ 2001; 322:1417–1420.
Carr AJ, Gibson B, Robinson PG. Is quality of life determined by expectations or experience?
BMJ 2001; 322:1240–1243.
Carr AJ, Higginson IJ. Are quality of life measures patient centred? BMJ 2001; 322:1357–1360.
Farsides B, Dunlop RJ. Is there such a thing as a life not worth living? BMJ 2001; 322:1481–
1483.
Fayers P, Hays R. Assessing quality of life in clinical trials. 2nd edn. Oxford University Press,
2005.
Fayers PM, Machin D. Quality of life: assessment, analysis and interpretation. John Wiley & Sons,
Ltd, 2000.
QoL and health economic evaluation
155
Higginson IJ, Carr AJ. Using quality of life measures in the clinical setting. BMJ 2001;
322:1297–1300.
Spiegelhalter DJ, Gore SM, Fitzpatrick R et al. Quality of life measures in health care. III:
resource allocation. BMJ 1992; 305:1205–1209.
Streiner DL, Norman GR. Health measurement scales. 3rd edn. Oxford University Press, 2003.
Health economics
Byford S, Raftery J. Perspectives in economic evaluation. BMJ 1998; 316:1529–1530.
Byford S, Torgerson DJ, Raftery J. Cost of illness studies. BMJ 2000; 320:1335.
Glick HA, Doshi JA, Sonnad SS, Polsky D. Economic evaluation in clinical trials. Oxford University Press, 2007.
Palmer S, Byford S, Raftery J. Types of economic evaluation. BMJ 1999; 318:1349.
Palmer S, Torgerson DJ. Definitions of efficiency. BMJ 1999; 318:1136.
Raftery J. Economic evaluation: an introduction. BMJ 1998; 316:1013–1014.
Raftery J. Costing in economic evaluation. BMJ 2000; 320:1597.
Robinson R. Economic evaluation and health care: What does it mean? BMJ 1993; 307:670–
673.
Robinson R. Economic evaluation and health care: Costs and cost-minimisation analysis.
BMJ 1993; 307:726–728.
Robinson R. Economic evaluation and health care: Cost-effectiveness analysis. BMJ 1993;
307:793–795.
Robinson R. Economic evaluation and health care: Cost-utility analysis. BMJ 1993; 307:
859–862.
Robinson R. Economic evaluation and health care: Cost-benefit analysis. BMJ 1993; 307:
924–926.
Robinson R. Economic evaluation and health care: The policy context. BMJ 1993; 307:994–996.
Torgerson DJ, Byford S. Economic modelling before clinical trials. BMJ 2002; 325:98.
Torgerson DJ, Campbell MK. Use of unequal randomisation to aid the economic efficiency
of clinical trials. BMJ 2000; 321:759.
Torgerson DJ, Campbell MK. Cost effectiveness calculations and sample size. BMJ 2000;
321:1697.
Torgerson D, Raftery J. Measuring outcomes in economic evaluations. BMJ 1999; 318:1413.
Torgerson DJ, Raftery J. Discounting. BMJ 1999; 319:914–915.
C H A P T E R 10
Setting up, conducting and
reporting trials
Setting up and conducting clinical trials is more difficult than it was several years ago, largely because of increased regulations and required governance responsibilities. This chapter summarises the clinical trial process
(Box 10.1). Some sections only relate to trials evaluating a drug or medical
device. Although details vary between countries, there are some fundamental similarities. Current requirements and timelines for regulatory and ethical
approval should be checked [See page 185 for glossary of common terms].
10.1 Pre-trial
Establishing a working group and the trial team
A small multidisciplinary working group of key people (say three to five)
should initially develop the project. They agree the trial objectives and endpoints, and share responsibility for writing the trial protocol and, perhaps,
the grant application. The group should include relevant health professionals,
a statistician and other speciality members, for example, trial co-ordinator,
pathologist or health economist.
After securing funding, the group can expand to form the trial team, (also
called trial management group, or trial steering group/committee) to manage the trial over its entire duration. It could additionally include expertise in
data management, regulations and safety monitoring, IT (database and randomisation systems) and some investigators from the larger centres.
Estimate the financial costs of the trial
Clinical (especially multi-centre) trials can be difficult and expensive to set
up and conduct. Staff funding and resources necessary for planning, trial initiation and conduct, follow up and statistical analyses should not be underestimated.
The number and type of staff required will depend on the complexity of
the trial and sample size, and could include a trial co-ordinator, data manager, statistician, pathologist, laboratory technician, research nurse, health
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
157
158
Chapter 10
Pre-trial
Develop trial protocol, patient information sheet, consent form
↓
Obtain EudraCT number (EU only)∗
Record trial on an international clinical trials register
↓
Obtain authorisation from regulatory authority
Obtain ethical approval
Implement procedures for drug handling
↓
Develop the case report forms
Develop the necessary agreements and contracts
Obtain approval from each site
Set up trial in centres (e.g. site assessment & initiation)
Activate sites
↓
End of trial
During trial
Establish working group to develop idea
↓
Estimate the financial costs
Secure grant funding (when required)
↓
Trial set-up
Box 10.1 Key elements to the trial process
Conduct trial; monitor progress in sites
Independent Data Monitoring Committee review
↓
Send annual safety report to regulatory authority∗
Send Annual Progress & Safety report to Ethics Committee
↓
Lock database
Close trial (inform regulatory authority∗ )
Sponsor & recruiting sites should store all relevant documentation
↓
Published first efficacy and safety results
↓
Long-term follow up (efficacy and/or safety)#
∗ Where required. This is usually only for trials investigating an investigational medicinal product
#
not always done
economist and someone with expertise in quality of life. The salary costs for
staff will depend on how much time they will spend on the trial and where
the work will be undertaken (central trials unit, and patient recruitment at
centres). For example, large trials often require at least one full-time dedicated
staff, but small trials, where there are few subjects recruited per month, should
only require part of a person’s time to manage the trial daily.
Setting up, conducting and reporting trials
159
Other costs could include:
r Those to be met by recruiting centres: for example, extra clinical assessments, extra blood or tissue samples, pathological reviews or laboratory
analyses
r Office and travel expenses; printing protocols and case report forms (page
171), and travel and other costs for the trial group meetings, centre initiation
and monitoring visits (see pages 174 and 179)
r Applications to the regulatory authority, where necessary, and sometimes
the independent ethics committee (see pages 167 and 169)
r A fee for each patient recruited, sometimes required by centres.
Secure funding
Therapies evaluated by the manufacturer, for example, pharmaceutical companies, usually have internal funding. Non-commercial organizations, such as
universities, hospitals or public sector departments, must usually seek external funding. Grants to conduct clinical trials may come from governmental
bodies, charities and private benefactors, but funds are limited and competitive. Although the format of the application forms will vary, many aspects are
often covered by the trial protocol (page 161).
Funding bodies usually seek value-for-money, and may not look favourably
on a small, very expensive trial. Dividing the total grant requested by the
expected number of subjects gives a crude cost per subject. If this looks high, it
is worthwhile justifying clearly why the resources are essential. Many funders
specify what costs they will not cover, for example, a new drug from a pharmaceutical company. It is worthwhile listing centres that have already agreed,
in principle, to participate.
A grant application is more likely to succeed if the trial has the potential to change practice, or sometimes to provide further valuable information
on a disorder, perhaps leading to larger studies. It should be produced by
a working group that has discussed key issues thoroughly, so that these are
not picked up for the first time by the funding committee or their external
reviewers.
Do the interventions need regulatory approval?
Many regulations focus on trials of an investigational medicinal product (IMP,
in the EU), or investigational new drug (IND, in the US or Japan). Classifying
a trial drug as an IMP, or not, determines the paperwork required to obtain
the approvals and which systems must be in place during the trial. Generally,
a substance, or combination of substances, is an IMP if the trial aims to determine whether it affects disease treatment, detection or diagnosis, or prevent
disease or early death (see pages 2 and 190). In some countries, it may also
include substances used to restore, correct or modify physiological functions
(e.g. in the EU). An IMP could be a new drug that is not licensed for human
use, or one that already has a marketing authorisation. Some countries have
160
Chapter 10
Box 10.2 Sponsor and investigator
r Sponsor: An individual, company, institution or organisation which takes
responsibility for the initiation, management and financing of a clinical trial.
The sponsor must ensure that the trial is conducted in accordance with any
relevant regulations and guidelines.
r Chief investigator (CI): A single named person responsible for the trial
design and conduct, though the sponsor has ultimate responsibility. The CI
is often the person who conceived the idea for the trial or may be a key opinion leader in the disease area. He/she is named as the lead investigator on
applications for regulatory and ethical approval. The CI often works in the
same institution that acts as the sponsor.
r Investigator: A person who is responsible for conducting the trial at a site
(centre). Where a group of people are involved in trial conduct at the site, one
person should be identified as the principal investigator (PI).
additional legislation relating to certain medical devices and exposures such
as radiotherapy. The local regulatory authority can advise on this.
10.2 Trial set up
Many research departments have the primary purpose to design, set up and
analyse clinical trials, and this work is central to pharmaceutical companies.
These organisations have permanent staff in place, including clinicians, statisticians, trial co-ordinators and IT personnel. Where there is limited direct
access to such resources, it is advisable to seek advice.
Identify the lead trial researcher, sponsor and
recruiting investigators
A Sponsor is the institution with ultimate responsibility for the trial design
and conduct (Box 10.2). An individual is rarely the sponsor because of the
legal, insurance and indemnity implications. The chief investigator is the key
researcher for the trial, and often first developed the idea. A principal investigator is an individual responsible for the trial at a single recruiting centre.#
It is usually easy for multi-national commercial companies to sponsor
international trials if they operate with legal status in the relevant countries.
However, a university, for example, acting as a sponsor does not normally
have a legal status in another country. Certain responsibilities of trial conduct
and safety monitoring must therefore be delegated to named individuals or
institutions in each foreign country, and this would need to be specified in an
agreement (see page 172).
#
chief investigator and principal investigator are not standard terms, but their roles are.
Setting up, conducting and reporting trials
161
Many pharmaceutical companies employ an independent commercial contract research organisation (CRO), to conduct the trial on their behalf.
The pharmaceutical company remains the named sponsor, but many of the
responsibilities are delegated to the CRO.
Potential recruiting centres, often called sites, should be identified, with a
realistic estimate of the number of expected subjects per site (investigators
tend to over-estimate this). This helps ensure that the target sample size is
feasible in a reasonable timeframe.
Trial protocol
The protocol provides justification for the trial, details of the design, and a set
of instructions for sites and the co-ordinating centre, describing how subjects
are to be recruited, treated (with the trial interventions and other treatments,
where necessary) and followed up, and the systems in place for safety monitoring. It ensures the trial is conducted to a similar standard across all sites
and it ultimately reduces variability, making it easier to find a treatment effect
if it exists. The protocol is signed off by the sponsor and chief investigator.
While some trial protocols can be up to 20 pages, those for IMP trials are
often longer, because they require more details on trial conduct, administration of treatments and safety monitoring. The protocol should contain a clear
plan of what will happen to subjects, from the time they consent to participate to the time they leave the study. For non-drug trials (e.g. surgical interventions, or changes in behaviour or lifestyle), it is important to describe the
delivery of the intervention clearly in the protocol, to ensure consistency and
standardisation across the trial. Table 10.1 shows suggested key sections in
a protocol, Figure 10.1 shows a simple flow diagram giving an overview of
the trial, and Table 10.2 is an example of how to summarise the timing of
assessments.
The objectives (aims) of the trial should not be confused with the endpoints.
The outcome measure is quantifiable and used to address the objective (Box
10.3). There should be one, at most two or three primary objectives, each associated with an endpoint. Other objectives and endpoints should be referred to
as secondary. The primary endpoint is the one that would change practice.
The wording of the objectives should be consistent with the phase of trial.
Phase I and single-arm phase II studies are often easy to describe. However,
the wording for a randomised phase II trial may make it look like a phase III
trial. Phase II studies only provide preliminary evidence on the effectiveness
of a new intervention, so the objective could use words such as ‘to examine’
or ‘to investigate’, to avoid suggesting that the trial results will conclusively
show whether the new treatment works or not. For phase III trials, stronger
words such as ‘to evaluate’, ‘to show’ or ‘to determine’ (Box 10.3), are perhaps
more appropriate.
Patient information sheet and consent form
All subjects, or their legal representative (see page 191), must give informed
consent before participating in a clinical trial. Sufficient information about the
162
Section Heading
Description
Chief investigator and Sponsor
• Name and address of one individual who is the overall lead, and the representative of the institution acting as the
sponsor
• Both need to sign and date each version
• The names, affiliations and roles of members of the trial team
A fairly concise summary of:
• The scale of the disease burden (e.g. prevalence or incidence)
• The effect of current interventions
• Reference to trials or systematic reviews that are relevant to the proposed trial
• A summary of why a trial is needed now, and how it may be expected to change practice, or be used to justify further
studies
• The current evidence and biological plausibility for the new intervention
• Type of trial— phase I, II or III; randomised or single-arm; single- or double-blind
• There should be one or two primary ones (used to determine whether health practice should change), and possibly
several secondary ones (that would provide additional information)
• There should be primary or secondary endpoints, each corresponding to the specified objectives
• There should be enough information for the sample size to be reproduced independently, with reference to the
expected treatment effects from published or unpublished work, or an effect that is the least clinically important
• A list of the inclusion and exclusion criteria (see page 11)
The trial management group
Background to the trial
Justification for the trial
Trial design
Objectives
Outcome measures
Sample size
Target population
Chapter 10
Table 10.1 Suggested sections in a trial protocol.
Interventions
Recruitment and Follow up
Case report forms (see page 170)
Safety monitoring
Consent and trial approvals
Statistical analyses
Insurance and indemnity
Ownership
Setting up, conducting and reporting trials
Assessments of subjects
• A clear description of what the trial interventions are and how they will be administered, including the dose and
frequency (if applicable) and duration of treatment
• Other treatments to be given at the same time should be specified
• If free drugs or medical devices are supplied, from say a pharmaceutical company, this should be stated with a
summary of how they will be supplied, documented and where appropriate, destroyed
• The length of the recruitment and follow-up periods should be specified
• When added together they should represent the total length of the trial
• Many trials will specify an ‘active’ phase, which ends when the last patient has completed the last protocol visit,
after which the follow-up phase begins
• A detailed description of how subjects will be assessed – how frequently, what will happen at each visit (such as
clinical examinations, tests and any other evaluation), and how long subjects will be in the trial
• A list of what they are, when they should be administered (Table 10.2) and who should complete them (the subject
or researcher)
• A list of any known expected adverse reactions associated with the trial treatments
• Describe the procedures for identifying, monitoring and reporting all adverse events
• Summarise the procedures for obtaining informed consent, and ethical and regulatory approval
• A description of the main statistical methods to be used to analyse the main and secondary endpoints
• Specification of the analyses that would be based on intention-to-treat or per-protocol
• Specification of subgroup analyses
• If there are any planned interim analyses, specify how they will be used
• Mention what cover is in place if a subject is harmed through participating in the trial
• A statement about ownership of the trial data
163
164
Chapter 10
Figure 10.1 Example of a flow diagram of a trial of a flu vaccine.
trial should be provided to allow them to examine the possible benefits and
risks of taking part. Information may be provided verbally, or by video, DVD
or audio tape, but it should always be given in writing: the patient information sheet (PIS). After reading this, subjects must sign and date a consent
form, which is co-signed by an authorized staff member. Suggested sections
are shown in Boxes 10.4 and 10.5. Signed consent forms should be kept in the
site files, and a copy given to the patient.
The text in these documents should be clear and written in simple language. Additionally, there may be site- or country-specific requirements, such
as insurance, or they may need to be translated into another language (see
page 193). It is often useful to ask a few subjects or members of a patient
Table 10.2 Example of a table that summarises the timing of assessments.
Intervention period
Baseline
Assessmentsa
Subject history
Clinical examination
Blood sample
CT scan
X
X
X
X
Other case report forms (CRFs)
Quality of life
X
Follow up
3 months
6
12
18
24
X
X
X
X
X
X
X
X
X
X
a
there should be a CRF to record this information as it is collected; for example, six clinical
examinations should yield six CRFs.
X
Setting up, conducting and reporting trials
165
Box 10.3 Examples of descriptions of objectives and endpoints
Phase
of trial
Objective
Outcome measure (endpoint)
I
To determine the maximum
tolerated dose of a new therapy
for advanced colorectal cancer
The number of patients who
suffer a dose-limiting toxic
event
II
To investigate Drug A in
patients with Parkinson’s
disease
The proportion of patients
who progress after one year
II
To examine the potential effect
of Therapy B for lung cancer
The proportion of patients
who have a partial or
complete tumour response
III
To evaluate the effectiveness of a
flu vaccine in the elderly
The proportion of people who
develop flu
III
To determine the effectiveness of
statin therapy in people without
a history of heart disease
Mean serum cholesterol level
III
To show whether a Therapy D
for asthma has a similar effect as
standard treatment
The proportion of patients
who suffer a severe asthma
exacerbation.
representative group to comment on the text before it is finalised, particularly
for complex trials or those with several arms. Both the PIS and consent form
should be signed off by the sponsor and chief investigator. The independent
ethics committee (see page 169) may recommend changes.
The subject should be neither encouraged nor discouraged to participate.
When discussing the trial with an eligible subject, the health professional
should try to maintain a position of equipoise, i.e. there is genuine uncertainty
over the effect of the new intervention. Sometimes this is difficult to do. For
example, in surgical trials, the patient expects the surgeon to recommend the
best treatment. Here, it might be better for a non-surgeon to discuss the study.
In IMP and medical device trials, subjects could be given a card to carry
showing their unique trial number, a brief description of the trial and 24-hour
contact details of trial staff or site representatives. This is common in blind
trials.
EudraCT number (EU IMP trials only)
The EudraCT (EU Drug Regulating Authorities Clinical Trials) database contains information about all IMP trials conducted in the EU. Before regulatory
or ethical approval can be sought, a EudraCT number (a unique identifier)
must be obtained via the European Medicines Agency (EMEA).1
166
Chapter 10
Box 10.4 Recommended sections in a patient information sheet
r Background and justification for the trial
r A description of how subjects will be randomised, and the probability of
being in each treatment arm
r A description of the trial interventions, especially identifying those that are
experimental
r What the trial subject has to do as part of the trial and the expected duration
of their participation
r Which tissue samples, if any, are being collected, and what will be done
with them for the purpose of the trial and for future research
r What are the possible side-effects of the interventions (including the magnitude of the risks, and discomforts to the subject, as well as to any embryo,
foetus or nursing infant of the subject)
r The possible benefits and disadvantages of taking part
r Alternative procedures or treatments available to the trial subject if they do
not participate
r A statement about securing confidentiality of subject data and who will
have access to the data
r A statement that participation is voluntary and refusal to participate will
involve no penalty or loss of benefit; and that the subject may withdraw at
any time
r Circumstances under which a subject’s participation may be terminated by
the investigator without regard to the subject’s consent
r Who is funding the research
r Who to contact if there are any queries, including a 24-hour telephone number in an emergency
r A statement about liability and compensation if something goes wrong.
Register the trial
All trial reports, not just those with ‘positive’ results, should be published. To
minimise the bias associated with researchers not submitting ‘negative trials’,
or journals not publishing them, there are international systems of clinical trial
registration. All new trials should be recorded on a recognised database before
recruitment starts. This has been a legal requirement in the US since 2007,
and sponsors of marketing applications must certify that they have complied.2
Common trial registers are:
r International Standard Randomised Controlled Trial Number (ISRCTN)
(http://www.controlled-trials.com/)
r www.ClinicalTrials.gov.
These databases contain information about the main objectives, design, outcome measures, duration and funding. They allow researchers to check on
other trials in progress or that have been completed. Many medical journals
require the registration number when considering an article for publication.
Setting up, conducting and reporting trials
167
Box 10.5 Examples of text used in a consent form
r I confirm that I have read and understood the information sheet Version 1.0
dated 10 January 2008.
r I understand that my participation is voluntary and that I am free to withdraw at any time, without giving any reason and without my medical care or
legal rights being affected.
r I understand that my medical notes may be looked at by responsible individuals or by regulatory authorities∗ where it is relevant to my taking part in
research. I give permission for these individuals to have access to my records.
r I give permission for an extra blood sample to be taken at the start of the
trial. I understand that giving this sample is voluntary and that I am free to
withdraw my approval for its use at any time without giving a reason and
without my medical care or legal rights being affected.
r I agree that the blood sample I have given, and the information gathered
about me, can be stored at Institution’s name for use in future studies.
r I agree for my family physician to be told of my participation in this
study.
r I agree to participate in the trial.
r Signature of the trial subject or legal guardian.
∗ these could be listed, for example, trial monitors, trial co-ordinator, regulatory authority, etc.
Regulatory approval (for certain interventions)
In most countries, the clinical trial regulations specifically cover IMP or IND
(i.e. drug) trials, and sometimes medical devices. Before recruitment can
begin, regulatory authority approval must be obtained in each country in
which the trial will be conducted. The sponsor and chief investigator (Box
10.2) must be named on the protocol, and applications to regulatory authorities and ethics committees. Each EU country has its own Competent Authority (CA), to issue regulatory approval. The equivalent body in the United
States is the Food and Drug Administration (FDA), and in Japan it is the Pharmaceutical and Medical Devices Agency (see also pages 197 and 198).
For newly developed drugs, it is sometimes useful to meet with the regulators to reach agreement on the trial design. This is particularly so for phase
III trials, which may be used later to make claims about effectiveness and contribute to a marketing authorization application. In the US, sponsors could
arrange a formal meeting with the FDA.3 In the EU, sponsors may seek scientific advice from the Committee for Proprietary Medicinal Products.
To gain regulatory approval, several documents must be submitted to the
regulatory authority (Box 10.6): called the Clinical Trial Application (in the
EU), or Investigational New Drug application (in the US).
The Investigator’s Brochure (IB) provides detailed information about the
trial drug, including:
168
Chapter 10
Box 10.6 Examples of documents needed for submission for
regulatory approval to conduct an IMP/IND trial
r The EudraCT number∗
r Investigator’s Brochure (IB); or Summary of Product Characteristics
(SmPC)∗
r Investigational Medicinal Product Dossier (IMPD)∗
r Investigational New Drug (IND) application∗∗
r Information about the investigators, recruiting sites and laboratories
r Details of drug manufacturing and distribution (e.g. Qualified Person documentation in the EU)
r Trial protocol and sample consent form
r Specification of measures to deal with vulnerable subjects, when required
r Completed application form
r Information about the independent ethics committee
r Fee.
∗ European Union Trials
∗∗ US Trials
r Its physical, chemical and pharmaceutical properties, with evidence from
laboratory studies
r A description of the pharmacological, metabolic and toxicological results
from animal experiments, and methodological details of these experiments
r A description of the metabolic, safety and efficacy evidence from studies
of human subjects, such as phase I data, or marketing experience if the drug
already has a licence
r An example of the drug label, from the manufacturer.
There should be one IB for each drug being evaluated, and it is not usually specific to a trial. The IB provides justification for the dose, method of
delivery and other biological aspects of the IMP specified in the trial protocol,
and describes the expected safety profile based on animal data and previous
human experience. This allows the trial investigators to assess the possible
risks and benefits of the drug. The IB is usually developed and updated by the
drug manufacturer (annually, or when new significant information becomes
available), with significant input from at least one clinician. The recommended
sections in an IB are listed in ICH GCP guidelines.4
For IMPs already licensed for human use, and to be used within the terms
of the marketing authorisation, the regulatory body may require a Summary
of Product Characteristics (SmPC) instead of a detailed IB.
The Investigational Medicinal Product Dossier (IMPD; EU IMP trials) provides information about the quality, safety and use of all IMPs in the trial,
including placebo or any other comparator. It allows the regulatory body
to examine the possible safety and toxicity profile of the product(s). Some
Setting up, conducting and reporting trials
169
information will overlap with that in the IB, and so can simply be crossreferenced. Again, the SmPC may suffice for drugs already licensed for human
use. Requirements for the IMPD differ between countries so the regulatory
body should be consulted before submission.
In the EU, the regulatory authority aims to assess applications within
30 days, followed by five days to inform the applicant whether it has been
approved or declined. Further information may be requested, which may
extend the processing time up to 60 days. For some trials, for example, gene
therapy, genetically modified organisms, or xenogenic cell therapy (live cells
or tissue from animal sources to be given to humans), the approval process
is longer. For trials to be conducted in more than one EU member state, an
application has to be made to the competent authority in each state.
An Investigational New Drug (IND) application must be filed for trials
involving US residents. The sponsor must submit information on manufacturing and quality control, pharmacology and toxicology data, and data from
prior human studies (unless previously submitted) to the FDA. For an original IND application, a sponsor may not initiate the study until 30 days after
receipt at the FDA. For subsequent studies under the same IND, the 30-day
wait period is not required, although the sponsor proceeds with the study at
risk. If concerns arise, particularly relating to safety, the FDA may place all or
part of the trial on hold until the sponsor adequately addresses the concerns.
Thereafter, the sponsor submits annual reports to the FDA on the status of the
study, and to update the general investigational plan for the coming year.
Regulatory authorities require the current name of the principal investigator at each recruiting site (e.g. in both the US and the EU). All ‘substantial
amendments’ to trial documentation and design, must be approved by the
regulatory authority (see page 176).
Independent ethics committee approval
All proposed trials should be reviewed and approved by an Independent
Ethics Committee (IEC). It examines the trial protocol, and any documentation intended for the trial subject, such as the patient information sheet, consent form and questionnaires. The committee considers:
r The scientific justification for the trial and its design
r Acceptability to subjects, including an assessment of potential harms and
benefits
r The administrative aspects of the trial, including procedures for compensating trial subjects in case of negligence
r The suitability of the investigators.
The committee may request changes to the trial design, conduct or documentation. In the EU, the committee has a maximum of 60 days to approve or
decline the application, and is permitted a single request for further information, which temporarily halts the 60-day period. There is sometimes a dialogue between the researchers and the committee if aspects of the trial need
to be resolved.
170
Chapter 10
Applications are stronger if the trial design has already had independent
peer review, for example, through a grant application. Involvement of subjects or patient representatives in developing the patient information sheet
and consent form could demonstrate that the wording is likely to be acceptable to potential trial subjects.
The process for obtaining ethical approval varies between countries and
according to whether the trial is single or multi-centre. For single-centre studies a local ethics committee is appropriate. For multi-centre studies, ‘national’
approval might be possible. For example, applications in the UK are made to
a single organisation via a website, and considered by one of several committees. Approved trials may be then conducted anywhere in the UK. In other
countries, for example, the Netherlands, applications can be made to one
of several organisations, but approval also allows the trial to be conducted
throughout the country. In the United States, ethics is reviewed by an Institutional Review Board (see page 173), one at each recruiting site. Sometimes one
IRB covers several sites.
Procedures for handling trial drugs
The sponsor has ultimate responsibility for how IMPs (or INDs) or other trialspecific treatments, including placebo and any other comparator drugs, are
manufactured, transported, stored and processed during the trial. Several procedures need to be established (Box 10.7).
Requirements for handling IMPs differ between countries. In the US, the
sponsor often deals with drug quality assurance, handling and distribution, or
one of more of these functions may be delegated. In the EU, IMP manufacturers or importers must hold a manufacturing authorisation, granted through
the regulatory authority. Only the authorised holder can be involved in production, import, assembly, blinding, packaging, labelling, quality control,
batch release and shipping. A Product Specification File contains written
instructions, or refers to other documents, used to perform these activities.
At least one named Qualified Person (QP) should be responsible for the product specification file. They must sign a release certificate (QP release) for each
batch and for the final product sent to sites or trial subjects. For IMPs imported
into the EU, the QP must sign a QP release certificate, indicating that each
batch meets the appropriate standards for Good Manufacturing Practice. A
system may be needed to recall batches if there is a problem, particularly if
drugs do not have a marketing license.
A manufacturing authorisation, QP and QP release are always needed for
unlicensed drugs in the EU. For drugs that are already marketed for human
use, these are required only if an organisation does something with the drug,
or its packaging or labelling, as part of the trial. For example, in a double-blind
trial with placebo, the drug name must be removed.
Case report forms (CRFs)
Data are always collected from trial subjects. This could come from clinical records, additional assessments and tests performed within the trial, or
Setting up, conducting and reporting trials
171
Box 10.7 Procedures for quality assurance of IMPs (INDs) in clinical
trials
The Sponsor should have documentation to ensure that:
r The drugs are manufactured in accordance with guidelines on Good
Manufacturing Practice
r The drugs are stored according to the manufacturer’s specification
r The drugs are packaged and shipped in such a way to prevent contamination and deterioration
r Individual packages are correctly labelled (including contact details of the
sponsor or other trial team member, expiry date, an identifier, batch number
and instructions to the trial subject on storage and administration, ‘keep out
of reach of children’ for drugs to be taken at home and ‘for clinical trial use
only’)
r The correct pack code is given to drugs in blind trials, and that the code for
a particular pack can be unbroken in an emergency in some trials
r Drugs are delivered to sites or trial subjects in a timely fashion, and there is
a clear system for ordering further supplies
r Each recruiting site keeps records on drug shipment (including dates and
batch numbers) and receipt, and return and destruction of unused drugs
r Records are kept on biochemical analyses of sample batches
r Enough drugs will be available for the whole trial and target number of
subjects
r Drugs batches can be recalled when necessary.
questionnaires completed by the subject. These data are used to evaluate treatment compliance, efficacy and safety. Trial-specific case report forms (CRFs)
are an efficient way of collecting data because not all subject data will be useful for the trial. Examples of CRFs could be:
r Baseline CRF: Includes dates of birth and randomisation, the allocated
intervention (or treatment code if the allocations are concealed), physical
characteristics (e.g. weight and height), blood measurements, possibly an
assessment of pre-treatment disease and confirmation that the eligibility criteria have been met.
r Treatment CRFs: Include details of trial interventions and other treatments
received by subjects, how these were administered, which allocated trial
interventions were not received and why.
r Efficacy CRFs: These collect variables that allow estimation of the treatment effect (and effect sizes), for example, date of disease occurrence or
recurrence, or death; presence of absence of a disorder, characteristic or
habit.
r Safety CRFs: These record variables associated with adverse events,
including those that are expected. Enough space must be left for unexpected
events to be recorded (this could be a separate CRF).
172
Chapter 10
CRFs should be simple and relatively quick to complete, which will also
reduce the time taken to enter the data onto an electronic database. This is
particularly important for trials with long follow up, because recruiting sites
usually have limited resources and the number of trials they are involved in is
likely to increase over time. The CRFs should be developed by the trial team,
so that all key variables directly associated with the efficacy endpoints, safety
and treatment compliance are recorded. Phase I and II studies are exploratory
and based on relatively few subjects, so many variables may be needed to
obtain a clearer view of the potential safety and effectiveness of a new therapy, because this will decide whether it is investigated further. However, too
many variables, particularly for large phase III trials, may result in complex or
multiple forms. Many variables may not be used at all in the statistical analyses, and site staff may fail to ensure that key variables are completed. The
CRF format may help to determine how the data will be analysed, and the
structure of the electronic database.
Traditionally, CRFs are printed on paper, and completed by hand. These
data are then entered, again by hand, onto a trial database (see page 177).
However, some large organisations (e.g. pharmaceutical companies) use an
Electronic Data Capture (EDC) system, where site staff record subject data
electronically, directly onto CRFs on the computer screen. The data is then
automatically stored in the central trial database. This minimises paperwork
and possibly time spent processing subject data, though the electronic CRFs
still need to be developed.
Agreements
The following agreements should be considered, though the names may differ
between countries. National guidelines and the sponsor will specify which are
necessary, and what details they need to contain. There are legal implications
associated with fraud (falsifying trial subjects, or subject’s data), negligence,
lack of informed consent, insider trading, and withholding trial results (see
also page 192).
Clinical trial (site) agreement
This is an agreement between the sponsor (often the co-ordinating centre) and
each recruiting site, listing the roles and responsibilities of the sponsor, local
investigator and sites. It is mandatory for EU IMP trials. It aims to ensure
that the site must have the necessary regulatory and ethics approval in place
before starting recruiting, and that it conducts the trial according to Good
Clinical Practice, in particular handling the drug or medical device appropriately, ensuring subject safety and timely reporting of adverse events, and that
all trial data is sent to the co-ordinating centre. It also specifies the sponsor’s
responsibilities, for example, data management and analyses of the data, and
ensuring that the site is always informed of any relevant trial documentation
and revisions. The agreement should also outline the responsibilities of the
site pharmacy for handling and disposal of trial drugs. Specially adapted
Setting up, conducting and reporting trials
173
agreements are usually required between the sponsor and international sites,
and this can take several months to finalise because of issues over insurance,
indemnity and which country’s law takes precedence. It must clearly specify
which tasks have been delegated to overseas sites. If part of the protocol is
incorrect, a site may claim compensation from the sponsor if, for example,
a subject has suffered harm as a consequence. Similarly, the sponsor may
claim compensation from the site if, for example, subjects or data have been
falsified. The agreement should state the amount of money to which each
claim is limited.
Drug supply agreement
When a manufacturer has agreed to supply a free drug (and sometimes
placebo), this needs to be documented, so that there is an obligation to continue supply for the target number of subjects. It is signed between the sponsor and manufacturer and certifies that they are operating to Good Manufacturing Practice (see page 195), and outlines the roles and responsibilities of the
parties. Where a third organisation is involved in packaging or distribution, a
tripartite agreement, or separate technical agreement (or quality agreement),
may be acceptable.
Other service agreements
Technical or service level agreements may be needed between the sponsor
and any other organisation providing services for the trial (e.g. pathology
reviews and biochemical analyses). While they may not be legally binding,
they ensure that all parties understand the detail and standards of the work
to be undertaken.
Material transfer agreement
If biological samples (e.g. blood or tissue) are to be sent from a recruiting site
to a central laboratory or depository (such as a tissue bank) for trial-specific or
future research, an agreement may be required between the site and destination organisation, to ensure that, for example, samples are handled safely and
stored appropriately. The ownership of the samples and rights to intellectual
property arising from the research should be made clear. The sponsor may
also wish to have a service agreement with the central depository to clarify
the roles and responsibility of each party in relation to the biological samples.
Institutional approval
Sponsor
The institution acting as the sponsor will have its own internal review of the
trial design, conduct, protocol, and assessment of subject safety and wellbeing. This is because it has ultimate responsibility for the trial and may have
legal responsibility to financially compensate any subjects harmed by the trial.
The sponsor must ensure that there is sufficient indemnity cover for this. Most
174
Chapter 10
trials now need to have a named sponsor, whatever the intervention being
tested.
Recruiting sites
All institutions from which subjects are to be recruited will review the protocol, documentation for subjects and, where appropriate, the Investigator’s
Brochure. This is because the institution will be partly responsible for conducting the trial to Good Clinical Practice, including ensuring that subjects
have given informed consent, adverse events are being recorded and reported,
drugs are received and distributed to subjects appropriately, and that trial data
are sent to the co-ordinating centre. A clinical trial may incur local financial
costs (e.g. additional X-rays or blood tests that are not part of routine care, or
treatment costs) that would not be covered by a grant, so the site will need to
agree to meet these.
Institutional approval may be conducted by a Research and Development
committee in the UK. In the US or Japan an Institutional Review Board (IRB)
is used, which may also evaluate ethics.
Site assessment and site initiation
Potential recruiting centres may be assessed by a member of the trial team
(usually the trial co-ordinator) before or at the same time as institutional
approval is sought: site assessment. This could involve visiting the site to
examine the staff, systems and procedures in place for:
r Subject recruitment
r Storage and supply of trial drugs to subjects
r Administering the trial interventions and undertaking assessments
r Data collection and completion of case report forms
r Reporting of adverse events.
Multi-centre trials may list the minimum criteria for participating sites in the
trial protocol. The purpose is to identify problems that may arise, and to judge
whether the site is able to conduct the trial according to Good Clinical Practice
(i.e. a form of risk assessment). The assessment could be by a questionnaire to
be completed by the site which confirms they are able to conduct the trial to a
high standard. The main interest is in minimising possible risks to subjects in
the trial, but also risks to the sponsor, who despite delegating certain responsibilities to the site, will bear ultimate responsibility for trial conduct over all
recruiting sites.
Site initiation aims to familiarise local site staff with the proposed trial and
the protocol, and to establish a link with the trial co-ordination centre. The trial
co-ordinator may attend the site in person, or initiation may be performed by
a teleconference, particularly for relatively simple trials. Site staff involved in
the trial, and their delegated functions would be recorded.
Site assessment and initiation are common practices for multi-centre trials,
and are useful for all IMP (IND) trials because of the regulatory requirements.
Setting up, conducting and reporting trials
175
For small trials or non-drug studies, the extent to which these activities are
carried out will depend on the resources available.
10.3 Trial conduct
The trial can proceed after obtaining the approvals and signed agreements.
The Sponsor, Chief Investigator, and Co-ordinator should have access to a
set of essential documents, which together form the Trial Master File (TMF,
Box 10.8). A full list of documents in the TMF (held by the Sponsor) and
Investigator File (held at each recruiting site) are identified in ICH GCP.4 Having these kept in a single location, allows easy review and audit. During the
trial, regular reports on progress and safety may be sent to oversight bodies
(e.g. IRBs and regulatory agencies), who will specify what is required.
Box 10.8 Contents of a Trial Master File – suggested key Essential
Documents
Investigator’s Brochure∗
Investigational Medicinal Product Dossier or SmPC∗
Approved trial protocol∗
Approved patient information sheet, consent form and any other documents for the subject∗
r Case report forms (CRFs)∗
r Financial aspects of the trial (e.g. letter from funder, insurance and indemnity cover)
r All signed agreements between sponsor, recruiting sites, drug supplier and
other parties
r Approval letter and any correspondence with the sponsor from the regulatory agency
r Approval letters and any correspondence from all ethics committees/IRBs,
including allowed advertising for subject recruitment and details of subject
compensation, if applicable
r Approval letter from all recruiting centres
r Curricula vitae of the chief investigator and principal investigator from each
site, and financial disclosure forms where applicable (to identify any potential
conflicts of interest)
r Sample of labels for IMPs/INDs
r Product Specification File/QP release documentation
r Current laboratory certifications and laboratory normal ranges
r List of staff and their responsibilities.
r
r
r
r
∗ These need to be dated and have version numbers so that the most current
ones can be identified easily.
A full list is given in Section 8 of ICH GCP.4
176
Chapter 10
Recruiting sites should also keep documents such as the protocol, the
patient information sheet, signed consent forms, a list of enrolled and screened
patients with their unique trial numbers, and any other documents associated
with trial set-up and conduct, for example, local approval documentation, site
delegation logs and curriculum vita of site staff involved in the trial.# These
should be contained in an Investigator Site File. For IMP trials, a site Pharmacy File could contain a list of the trial subjects and the drugs they received,
staff delegation log, a summary of drug supply arrangements, records of drug
receipt, dispensing and destruction, and any relevant local policies.
Many trials have dedicated staff to set up the study, collect and maintain
all the documentation needed for the TMF, and monitor progress. This is in
addition to helping with queries from sites and possibly subjects, entering
and checking data and organising meetings of the trial team.
Changes to documents in the Trial Master File
The protocol, patient information sheet (PIS) and other documentation may
change during the trial because of the following:
r Changes to the design, for example, the original eligibility criteria may be
too strict and need to be relaxed
r New information from the published literature, the data monitoring committee, or interim reports may affect current or future trial subjects, so the
PIS needs to be revised accordingly
r Additional data is to be collected so the CRFs must be revised
r The Investigator’s Brochure should be updated annually, or when new
significant information becomes available.
The regulatory authority should approve significant changes to the trial
design or protocol, usually before being implemented. These changes, as well
as those to documentation intended for subjects, should also be approved by
the independent research ethics committee that gave the original approval,
again before implementation. Once approved, updated information can be
disseminated to sites. Significant changes could include anything that affects:
r The safety and well-being of the trial subjects
r The scientific value of the trial
r The conduct or management of the trial
r The quality and safety of any of the IMPs.
In EU trials, any significant change is referred to as a substantial amendment.
All documents should have a version number and be dated.
Randomising subjects
An eligibility checklist, with each of the inclusion and exclusion criteria for a
subject can be ‘checked off’ to make clear that the subject is eligible. This could
#
Sites are legally obliged to keep some of these documents depending on the regulations in
that country.
Setting up, conducting and reporting trials
177
be a case report form, and copies are sent to the co-ordinating centre (see also
page 85).
Statistical analysis plan (SAP)
At the start of the trial, the statistician could draft an outline of the statistical analyses to be performed at the end. This should include assessment of
treatment compliance, efficacy and safety. It may be expanded after requests
for interim analyses from the data monitoring committee during the trial (see
page 179), but it should be finalized before the database is ready for the full
analysis. A SAP avoids many unplanned analyses at the end of a trial, for
example, many sub-group analyses. However, if there are important unexpected results, the SAP should not prevent an investigation of the data beyond
the pre-specified analyses.
Database
All trial data should be entered onto an electronic database. For small nonrandomised trials, a simple spreadsheet might suffice. For randomised, or
large, trials it is better to use a proper database system, and there are several commercially available ones. This ensures that data entry is structured
and makes researchers think more carefully about the data analysis. For IMP
trials in the EU the database needs to be fully validated, and allow an audit
trail of activity, where changes to the data are clearly recorded.
A problem with using a spreadsheet is that the same variable can be entered
in different ways, and numbers and characters can be mixed together:
Date of randomisation
03/01/2001
21/June/2005
15-Sep-2004
Body weight
80 kg
187 pounds
95 kg
Cancer type
Lung cancer
Lung CA
Lung tumour
It is impossible to analyse this data. The dates all have different formats,
weight is in a mixture of kg and pounds but it should only be in a single
format with no text in the cells, and there are different spellings of the same
disease. Statistical analysis packages would have difficulty reading this, and
the data would require much manual editing before it could be analysed.
By using a dedicated database, computer screens can be made to look very
similar to the paper CRFs, making data entry easier. Automated validation
checks can minimise data entry error, or identify errors on the CRFs. For example, there could be an electronic check that the date of birth precedes the date
of randomisation, and the trial treatment dates are after the randomisation
date. Range checks could be used to identify extreme blood and physiological
measurements. The database could also help identify overdue CRFs, or key
variables that need to be chased up. It is important to ensure that information
on the main efficacy endpoints and side-effects are as complete as possible,
i.e. with minimal missing data, particularly for phase I and II trials.
178
Chapter 10
Any database system must be securely stored, with access limited to relevant trial staff, and backed up regularly to minimise the amount of lost work
if the system malfunctions. It should have a disaster recovery plan in place
(e.g. in case of fire), and be sited on a robust IT network.
For double-blind trials, treatment allocation should only be visible in the
database as a drug pack code so that trial co-ordinators and other trial staff
with regular access cannot see which intervention has been allocated to each
subject (see page 86). Only the trial statistician should be able to access this
data for the purposes of the final analysis and interim reports to the data monitoring committee (see page 179).
Standard operating procedures (SOPs)
It is good practice for organisations involved in clinical trials to have a set of
Standard Operating Procedures (SOPs). These are summary guidelines specific to the working practices of the organisation that show staff how to perform certain functions. They allow staff to conduct trials to the same standard,
and new staff to quickly familiarise themselves with these practices. SOPs also
show an external auditor or regulatory inspector that clear and robust systems
are in place. Examples of SOPs are:
r Protocol writing
r Obtaining regulatory approval
r Obtaining ethical approval
r Initial site assessment (before recruitment)
r Setting up sites
r Randomisation procedure
r Database development and maintenance
r Recording and reporting adverse events
r Site visits during the trial
r Making and reporting protocol amendments
r Statistical considerations (sample size, statistical analysis plan)
r Closing the trial (chasing missing data, following up serious adverse events,
ensuring that all the trial documentation is stored).
Meetings of investigators
In developing a new trial, the trial team should meet several times. This
should continue during the trial, particularly in the early stages, to quickly
identify and solve problems with recruitment, delivery of trial interventions,
non-compliance or other key issues.
Investigator meetings are generally held for multi-centre trials with at least
four or five sites. In phase II and III studies the principal investigator from
each might be invited. The meetings are usually co-ordinated by the sponsor
and, although they are not required by the regulations, they serve to educate
and obtain consensus amongst the investigators on the design and conduct of
the study, and train them on important elements of the study. Having meetings when all the sites are ready to start recruiting can also help motivate
study personnel.
Setting up, conducting and reporting trials
179
Regular newsletters to the investigators and staff at the recruiting sites
detailing recruitment, and the amount of missing data that needs to be chased
up, may be useful.
Monitoring of recruiting sites
The level of monitoring necessary for each study depends on its complexity
and potential risks to the subjects or scientific validity of the trial. The sponsor
will often assess this as part of the institutional review.
Monitoring could include checking that trial subjects really exist, that
signed consent has been obtained, that data have been recorded correctly
onto the case report forms (CRFs), and that adverse event reporting has been
appropriate and timely. Pharmaceutical companies undertake a high level of
on-site monitoring because they want their drug to be licensed, and regulatory authorities require clinical trials to be conducted according to ICH GCP
guidelines (see Chapter 11). If the guidelines have been followed closely, an
application for a license is less likely to be declined. Source data verification (SDV), often conducted by pharmaceutical companies, involves checking some entries on the CRFs with what is contained in the patient hospital
files. This can be done for all subjects (100% SDV) or a random proportion of
them (e.g. 10% SDV). SDV can be an expensive activity, and there is uncertainty whether it noticeably changes the main trial results. Furthermore, data
errors should be relatively uncommon but, more importantly, randomly distributed between the trial arms. However, the regulatory authority may indicate what it believes to be an appropriate level of SDV. Where the quality of
data from a particular site is questionable, the trial team may decide that it
requires SDV.
While pharmaceutical companies have the resources to monitor trials
closely, non-commercial organisations may limit on-site monitoring activities
to confirming that subjects are real and that there is signed consent. Central
monitoring, using the electronic database, can identify errors on key variables.
Formal statistical methods can also check data for, for example, digit preference, and compare a variable from one site with the average over all sites
to detect outliers. The site would be contacted to correct or clarify identified
anomalies. Central statistical monitoring is cheaper and easier to perform than
full on-site monitoring and SDV.5
Independent data monitoring committee (IDMC)
This is a group (usually three to five people) of health professionals, a statistician and other relevant experts with no direct connection to the clinical trial.
The IDMC provides an independent and unbiased review of the trial during
the recruitment and treatment period, and advises the trial team. Key functions include:
r Safeguarding the interests of subjects
r Assessing safety and toxicity
r Identifying poor recruitment
180
Chapter 10
r Monitoring the overall conduct of the trial, such as treatment compliance
and missing data
r Examining data on efficacy.
The composition, roles and responsibilities of the IDMC, may be documented
in a charter.6 Before each meeting, the trial statistician, possibly with the trial
co-ordinator, prepares a report for the IDMC, summarising several trial outcomes (see bullet points above). After reviewing the report, the IDMC will
either support continuation of the trial, or make recommendations to close
early. They may also request changes to the trial design, protocol, patient
information sheet or consent form, if any of the trial data, or other evidence,
indicates this is necessary. For double-blind trials, the report to the IDMC
may conceal the interventions (for example, A and B to indicate aspirin and
placebo), but the committee may request unblinded results if, for example,
there is an imbalance in the number of adverse events.
The IDMC meetings can be in two parts. The open meeting, which the trial
statistician, co-ordinator and other members of the trial team, such as the chief
investigator, attend. They discuss issues associated with recruitment, collection of data and adverse event reporting (not according to trial arm). During
the closed meeting only the trial statistician, who has produced the data on
efficacy data by trial arm, attends. After the meeting, the IDMC will issue a
report for the trial team.
If there is clear evidence that the trial should be suspended or closed early,
the trial team and recruiting sites need to be informed quickly, particularly if
there are concerns over safety.
Suspending or closing trials early
There may be several reasons why a trial must be temporarily stopped or
closed early, for example, poor recruitment, unacceptable harm, a clear treatment effect was observed or futility (see page 122). The decision is usually
made and agreed by the trial team and the IDMC. Systems need to be introduced to inform sites about recruitment and subjects already recruited, if the
decision is likely to affect them directly, for example, a previously unknown
increased risk of a disorder, or the subject needs to stop the trial treatment.
The ethics committee, which originally approved the trial, should review and
approve this information before it is sent to subjects. Where, for example, the
trial has been stopped early because of poor recruitment it may not be necessary to contact subjects because those who are already in the trial could still
be followed up as intended.
An important reason for suspending or closing trials early is patient safety.
Sponsors in the EU (IMP trials) can implement urgent safety measures if there
is an immediate significant risk to the health and safety of trial subjects. This
can be done without first seeking approval from the regulatory agency or
ethics committee, though these organisations need to be informed in writing,
with clear justification, within three days. Urgent safety measures may be executed after discussion with the IDMC or a medical assessor at the regulatory
Setting up, conducting and reporting trials
181
authority. For all other reasons associated with early trial closure, either temporary or permanent, the sponsor must notify both the regulatory agency and
ethics committee within 15 days, and give reasons.
10.4 End of trial
Trial closure may be implemented in two phases: closure of recruitment and
closure of follow up. When the recruitment target has been reached, sites must
be informed not to approach further potential subjects. This ‘closure to recruitment’ does not mean the end of the trial. The time point at which the trial
should formally close is usually specified in the protocol. For example, this
could be after the last recruited subject has been followed up for one year.
The sponsor is usually required to notify both the regulatory authority and
the ethics committee when this occurs (e.g. within 90 days in the EU for IMP
trials). Trials may then enter a long-term follow-up phase, collecting key data
on efficacy and safety for future evaluation. This too is specified in the protocol, but no notification is needed.
The status of the trial database should be examined and any missing key
information on CRFs from sites should be sought. Once most of this data has
been received and entered, the database is downloaded for statistical analysis,
called database lock. This analysis forms the first full report.
The Trial Master File and the trial database should be kept by the sponsor
for several years after the trial has closed (e.g. five years for IMP trials in the
EU), and recruiting sites also need to keep relevant trial documentation and
patient CRFs.
10.5 Monitoring adverse events
Identifying, recording and reporting adverse events are essential functions in
trial conduct. The extent to which this is done depends on the intervention, for
example, drugs, medical devices, surgery or behavioural changes. Monitoring
drug safety is often called pharmacovigilance.
An adverse event is any untoward or unintended medical occurrence or
response, whether it is causally related to the trial treatments or not. When
it is judged that the event is likely to be caused by the intervention it can be
called an adverse reaction, or adverse drug reaction in IMP trials. An adverse
event could be the occurrence of a disease or condition that directly affects the
patient’s health, safety or well-being, including ability to function. It could
also be an abnormal and significant biochemical or physiological measurement. Adverse events are not usually the same as the disease of interest, for
example, if evaluating a new drug for advanced lung cancer, death from lung
cancer is not classified as an adverse event, because it is an expected natural
process for this disorder. Death from stroke would be considered an adverse
event. However, if there are many more lung cancer deaths in the new treatment arm, stopping the trial early should be considered.
182
Chapter 10
Adverse events and reactions can be expected or unexpected. They are
expected when, for example, they are pre-specified in the marketing authorisation of a drug that is already licensed for human use, or the Investigator’s Brochure or Investigator’s Medicinal Product Dossier, if not licensed.
Expected events should be listed in the trial protocol.
Adverse events or reactions, whether expected or unexpected, can be further classified as serious adverse events (SAE) or serious adverse reactions
(SAR), if any of the following occur:
r Death
r Is life-threatening
r Requires hospitalisation, or prolongs hospital duration if already in
hospital
r Results in persistent or significant disability or incapacity
r Results in a congenital abnormality or birth defect
r Leads to any other condition, judged significant by the clinician.
They should normally be reported to the sponsor (or the co-ordinating centre)
within 24 hours of discovery. An assessment must be made of whether the
event is suspected to be causally related to the trial treatment and if it is unexpected: a suspected unexpected serious adverse reaction (SUSAR). A SUSAR
is the most important type of event, and requires special processing. If a trial
is blind then assessment of causality and expectedness could be performed as
though the patient were on the active treatment.
For IMP (IND) trials, a sponsor must report a fatal or life-threatening
SUSAR to the regulatory authority within seven days of being notified.7 If
the SUSAR is not fatal or life-threatening, the regulatory authority must be
informed within 15 days. The system and timelines are similar in many countries, including the EU, the US and Japan. The ethics committee or IRB which
originally approved the trial must also be informed, usually within the same
timeframe. The sponsor must also submit an Annual Safety Report to the regulatory authority, which includes:
r An analysis or summary of subject safety in the trial
r A list of all suspected SARs (expected or unexpected) to date
r A summary table of the suspected SARs.
When an SAE occurs in a trial with blinding, the treatment allocation may
need to be revealed. This is almost always the case with a SUSAR, though
usually only the person who reports the event, often the trial co-ordinator,
would know the treatment allocation, and not the clinician or trial staff in
the site from which the subject came. The treating clinician may need to be
unblinded if it will affect how the subject is treated.
A system may need to be in place for emergency unblinding. The request
for unblinding should come from the subject’s clinician, or from a hospital to
which the subject has been urgently admitted. During office hours, the trial
co-ordinator or other named trial staff would be contacted. At other times,
a member of staff from the co-ordinating centre who is ‘on call’, or the hospital pharmacy, should have access to the treatment allocation codes. For
Setting up, conducting and reporting trials
183
international trials, it may be possible to unblind directly through the electronic trial database though this system would need to be set up carefully and
securely to avoid unnecessary unblinding. The decision to unblind must be
clearly justified and a trial clinician should be involved if possible.
Whatever system for unblinding is implemented, there is likely to be a cost
and resource issue. For trials investigating a drug that is unlicensed for human
use, it is usually clear why emergency unblinding is needed. The justification
may be less clear for common drugs, for example, a trial in adults to investigate whether aspirin could prevent cancer, though a case could be made
if a child has accidentally taken the drug. If an SAE occurred, the clinician
would and should treat the symptoms, without necessarily waiting to find
out the trial treatment. Unblinding could take place the following working
day. The need for a system for unblinding outside of office hours will depend
on the disease and treatments being investigated, an assessment by the sponsor, and ultimately the requirements of the regulatory authority. There may
also be a need to have access to 24-hour medical cover, where a clinician treating the subject can seek information about the trial and the treatments being
evaluated.
When the request to unblind is not associated with safety, relatively few
reasons are likely to be justified. The decision should then be made for each
individual, and agreed by the chief investigator or other members of the trial
team. A system for this type of review could be provided in the protocol. It
is important that the scientific validity of the trial is not adversely affected by
unnecessary unblinding.
10.6 Reporting clinical trials in the literature
Results of all trials should be reported, usually in a health professional journal.
There are detailed reporting guidelines (CONSORT).8–11 The following main
sections should be covered, though some parts may not be relevant to phase I
or single-arm phase II trials:
Trial design and conduct
r Summarise the design (e.g. phase I, II or III; whether randomised or not;
single arm or multi-arm; single- or double-blind; crossover; factorial)
r Specify who was blind to treatment allocation (the clinician giving the treatment, those assessing the subject, or the subject)
r Specify the method of randomisation (simple, stratified or minimisation)
and state any stratification factors used, and block size
r Specify the inclusion and exclusion criteria
r Provide details of the sample-size calculation
r Specify how long patients were followed up for before the main outcome
measure was assessed
r Specify how many randomised patients were later found to be ineligible
184
Chapter 10
r Specify the proportion of patients in each arm who were not available for
follow up (i.e. for whom the main trial endpoint is unavailable, i.e. withdrawals)
r Mention the methods of statistical analysis.
A diagram (called the CONSORT flow chart) could be provided, showing the
number of eligible and ineligible patients randomised, the number allocated
to each intervention, the number who complied with treatment, the number
followed up and the number used in the statistical analysis. These are all
reported for each trial group.
Interventions
r Describe the trial interventions being compared, including dose, frequency,
duration and method of delivery
r Mention any other treatments given to patients at the same time.
Results
r State where the trial was conducted, the number of recruiting centres and
calendar years of study
r Provide a summary table of baseline characteristics for each trial arm (without p-values)
r Provide summary measures of efficacy (effect size, 95% confidence intervals
and p-values); including survival curves if using time-to-event endpoints
r Provide a summary of any side-effects observed, and whether they differed
between the trial arms.
r For Phase I studies provide details of the pharmacological effects
Treatment compliance
r Define compliance and specify the proportion of patients in each arm who
did not comply with the allocated trial treatments
r If there is a clear difference between these proportions, provide reasons (e.g.
side-effects).
Discussion
r Mention any limitations of the study design or analysis
r Are the results consistent with other studies? If the results are unexpected,
it is useful to provide possible explanations (e.g. the subjects had less or more
severe disease than originally anticipated)
r What does the trial contribute to practice?
Most journals restrict the number of words, tables and figures, so researchers
have to address the sections listed above concisely. This can be partly achieved
by presenting results in a table rather than in the text. However, many journals are now available electronically, via the Internet, allowing supplementary
text, tables and figures that do not appear in the printed version still to be publicly available. Covering all the sections listed above makes it more likely that
Setting up, conducting and reporting trials
185
journal editors and external reviewers will give a favourable view, because
they are able to assess the paper properly.
Conflict of interests
Many publishers require a declaration of financial support received for a trial,
any relevant patents and any connection with the manufacturers of products or devices used. Conflict of interests, sometimes referred to as competing interests, arises when the professional judgement concerning the validity
and interpretation of research could be influenced by financial gain, or professional advantage or rivalry. Financial interests offer an obvious incentive to
present a treatment in a more positive light.
Authors should state who funded the trial because this may have influenced
their interpretation of the data, perhaps subconsciously. Sometimes, the interpretation is more in favour of one intervention than the results seem to support, or the conclusions indicate that the results are more generalizable than
they really are. Authors should also declare any personal financial interests
associated with the paper, including fees they may have received from manufacturers of the trial interventions, allowing the reader to judge whether this
may have affected the trial conduct and interpretation of the results.
10.7 Summary
r Researchers should have clearly defined systems in place for trial set-up
and conduct
r Many trials (usually drugs) require approval from the regulatory authority
in each country from which subjects will be recruited
r All trials should obtain independent ethical approval and institutional
approval
r Sponsors of trials should ensure that all the necessary documents, contracts
and agreements are in place before recruitment begins
r There should be clear systems for identifying and reporting adverse events,
particularly serious events
r Trial reports should contain all the necessary details on design and analysis,
with a statement about competing interests.
Glossary of common terms
CA
CI
CRF
CTSA
EU
FDA
GCP
IB
Competent Authority
Chief investigator
Case report form
Clinical trials site agreement
European Union
Food and Drug Administration (in the United States)
Good Clinical Practice
Investigators Brochure
186
Chapter 10
ICH
IDMC
IMP
IMPD
IND
IRB
MTA
QP
PI
PIS
SAE
SAR
SDV
SmPC
SOP
SUSAR
TMF
International Conference on Harmonisation
Independent data monitoring committee
Investigational Medicinal Product
Investigational Medicinal Product Dossier
Investigational New Drug
Institutional Review Board
Material transfer agreement
Qualified Person
Principal investigator
Patient information sheet
Serious adverse event
Serious adverse reaction
Source data verification
Summary of Product Characteristics
Standard Operating Procedure
Suspected unexpected serious adverse reaction
Trial master file
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
http://eudract.emea.europa.eu/
http://www.fda.gov/oc/initiatives/advance/fdaaa.html#actions
http://www.fda.gov/cder/guidance/2125fnl.htm
http://www.ich.org/LOB/media/MEDIA482.pdf
Baigent C, Harrell FE, Buyse M, Emberson JR, Altman DG. Ensuring trial validity by data
quality assurance and diversification of monitoring methods. Clin Trials 2008; 5:49–55.
Damocles Study Group. A proposed charter for clinical trial data monitoring committees: helping them to do their job well. The Lancet 2005; 365:711–722.
http://www.mhra.gov.uk/Howweregulate/Medicines/
Licensingofmedicines/Clinical trials/SafetyreportingSUSARSandASRs/index.htm
Moher D, Schulz KF, Altman DG. The CONSORT Statement: Revised recommendations
for improving the quality of reports of parallel-group randomized trials. Ann Intern Med
2001; 134:657–662.
Campbell MK, Elbourne DR, Altman DG. CONSORT statement: extension to cluster randomised trials. BMJ 2004; 328(7441):702–708.
Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJW. Reporting of noninferiority
and equivalence randomized trials: An extension of the CONSORT statement. JAMA
2006; 295:1152–1160.
Ioannidis JP, Evans SJ, Gotzsche PC et al. Better reporting of harms in randomized trials:
an extension of the CONSORT statement. Ann Intern Med 2004; 141(10):781–788.
C H A P T E R 11
Regulations and guidelines
There are various regulations and guidelines that are associated with setting
up and conducting a clinical trial. However, the number and depth of detail
often appear overwhelming to researchers, especially those new to research.
This chapter provides an overview of the key issues covered by these regulations and guidelines, some of which were mentioned in Chapter 10. Most
laws only cover drugs and some medical devices. For further details or current
requirements, researchers should check their national guidelines, and consult
their institution or regulatory authority.
11.1 The need for regulations
Clinical trials are experiments on humans. Subjects who participate are given
an intervention that they would not normally receive, and they often undergo
additional clinical assessments and tests, including having to complete questionnaires. They agree to participate for the planned length of the trial, which
could be several years. It is therefore essential that their safety, well-being and
rights are protected. This is the main purpose of the regulations and guidelines. They also ensure that the clinical trial data are valid and robust, and can
be used to reliably demonstrate that the benefits of the intervention outweigh
the possible risks. This is a critical component in providing assurance that the
drug or medical device will be approved by regulatory authorities for use in
the wider disease population.
Regardless of what regulations or guidelines are in place, researchers have
an ethical and moral duty to be responsible for the subjects, and should be
accountable to a higher body if subjects are harmed as a result of participating
in the trial.
The first internationally recognised guideline was the Nuremberg Code1
developed in 1948 after several German clinicians and administrators were
prosecuted for conducting experiments on concentration camp prisoners
without their consent (the Nuremberg Trials). Many prisoners suffered great
pain, died or were permanently disabled. The Nuremberg Code formed the
basis of the Declaration of Helsinki, developed by the World Medical Association (1964).2 After several revisions it now consists of 32 paragraphs that
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
187
188
Chapter 11
specify the ethical principles associated with conducting medical research
studies of human subjects. Significant principles include:
r Informed consent must be given
r There should be prior research from animal studies
r The risks of participating in a trial should be justified by the possible
benefits
r Research should be conducted by qualified health professionals
r Physical and mental harm should be avoided.
Although the Declaration is not legally binding in international law, all clinical trial protocols should state that the study has followed it. The principles
have influenced legislation and regulations worldwide. For example, both
the Nuremberg Code and the Declaration of Helsinki are the basis for the
Code of Federal Regulations (Title 45, Volume 46),3 issued by the US Department of Health and Human Services (DHHS) which governs federally-funded
research in the US.
11.2 International Conference on Harmonisation (ICH)
The ICH guidelines4 were developed in 1996 to harmonise the requirements
for registering medicines in Europe, Japan and the United States. They are
internationally recognised. As well as ensuring the safety of subjects, it allows
clinical trial evidence from one country to be accepted by another, reducing
many duplicate evaluations of the same treatment. The general principles of
ICH expand on the Declaration of Helsinki, providing more details on the
design, conduct and statistical analyses of clinical trials. ICH is divided into
four major categories:
Q. Quality: Provides details on the chemical and pharmaceutical quality of
the drug, such as stability, validation and impurity testing, and guidelines for
Good Manufacturing Practice.
S. Safety: Provides details of the safety of the medicinal product, including
toxicology and reproductive toxicology, and carcinogenicity and genotoxicity
testing. It relates to in vitro and in vivo preclinical studies.
E. Efficacy: The largest section and one that is applicable to most clinical trials. It provides details of 13 core principles of Good Clinical Practice covering
trial design, conduct, analysis and adverse event reporting (Box 11.1).
M. Multi-disciplinary: This section covers issues that do not fit into the other
three categories, including standardised medical coding for adverse event
reporting, and timing for pre-clinical studies in relation to clinical development intended to support drug registration.
11.3 Good Clinical Practice (GCP) and the EU Clinical
Trials Directives
Good Clinical Practice (GCP) is a detailed set of recommendations intended
to standardise clinical trial conduct. It defines the roles and responsibilities
Regulations and guidelines
189
Box 11.1 13 Core principles of ICH GCP guidelines for clinical trials5
1. Clinical trials should be conducted in accordance with the ethical
principles of the Declaration of Helsinki, and consistent with Good Clinical
Practice and the appropriate regulatory requirement(s).
2. A trial should only be conducted if the potential risks and inconveniences
are outweighed by the expected benefit for the trial subject and society.
3. The rights, safety and well-being of trial subjects are the most important
considerations and should prevail over the interests of science and society.
4. Non-clinical and clinical information about a new intervention (especially
an investigational medicinal product) should be used to justify the proposed
trial.
5. A clinical trial should be scientifically sound, and described in a clear and
sufficiently detailed protocol.
6. A proposed trial and its protocol must have approval from an
independent ethics committee. Researchers should follow the protocol when
conducting the trial.
7. Trial subjects should be the responsibility of a qualified clinician (or
dentist), who will make decisions about the medical care.
8. All researchers involved in conducting a trial should be qualified by
education, training and experience relevant to their tasks.
9. All human subjects should give informed consent before they participate
in a trial.
10. Clinical trial information should be recorded, handled and stored in a
way that allows its accurate reporting, interpretation and verification.
11. Data should be kept confidential and protected, particularly when it
identifies a particular subject. The regulations that govern privacy and
confidentiality should be followed, where required.
12. Investigational medicinal products should be manufactured, handled
and stored in accordance with Good Manufacturing Practice and used as
specified in the trial protocol.
13. Systems for assuring the quality of the trial conduct and data should be
in place.
of trial staff, while affording an appropriate level of protection to subjects.
ICH6 provides the international GCP standard, although other organisations
have developed their own similar guidelines. The extent to which ICH GCP
is implemented in different countries and by different researchers has been
variable. This led the European Union to develop the EU Clinical Trials
Directive (2001/20/EC) and associated GCP Directive (2005/28/EC) – a legal
framework for clinical trial research among its member states. Some of the following sections refer to the Directives because they are detailed, cover many
countries and are legally binding. However, the EU Directives, ICH GCP, and
190
Chapter 11
other regulations and guidelines in non-EU countries, such as those from the
US FDA,7 have much in common, so this section is applicable to researchers
in all countries.
The Directives help standardise trial conduct across the EU and the European Economic Area (including Norway, Iceland and Liechtenstein), and are
part of European law. They cover all clinical trials involving one or more
investigational medicinal products (IMPs), and phases I to IV, but not observational studies, or interventional trials investigating a medical device, surgical
technique or change in behaviour or lifestyle, though elements may be of use
as examples of good practice.# Although there are established definitions of a
clinical trial (see page 2), the EU Directives use the following terminology:8
A clinical trial is an investigation in human subjects which is intended to
discover or verify the clinical, pharmacological and/or other pharmacodynamic effects of one or more medicinal products, identify any adverse reactions or study the absorption, distribution, metabolism and excretion, with
the object of ascertaining the safety and/or efficacy of those products. This
definition includes pharmacokinetic studies.
Each EU country implemented the Directives into its own legislative system. This is in addition to other laws that may already be in place, such
as those associated with clinical trials of medical devices, using human tissue for research and data protection. One of the key consequences is that
all trials must have a named sponsor (see Box 10.2, page 160). There are 24
‘Articles’, and EU Member States and sponsors have a legal obligation to meet
them.9 They cover five broad categories (Box 11.2). Some are presented in
Chapter 10.
Box 11.2 Obligations covered by the EU Clinical Trials Directives
r Protect the safety and well-being of clinical trial subjects, with special reference to children and vulnerable adults
r Provide procedures to give regulatory approval before a trial starts recruiting
r Provide procedures for an independent ethics committee to review and
‘approve’ the trial protocol and any documentation meant for trial subjects
before a trial starts recruiting and during recruitment
r Provide procedures for reporting and processing adverse events
r Specify standards for the manufacture, importing and labelling of IMPs.
# Individual countries may have other regulations that cover medical devices, surgical trials
and trials involving exposures such as radiotherapy.
Regulations and guidelines
191
Protection of clinical trial subjects
All subjects should be protected against harm caused by being in the trial
(Box 11.3). Eligible subjects must be given enough information allowing them
to decide freely whether they wish to participate. This is done verbally with
a health professional involved the trial, and written information (the patient
information sheet, see page 161). If the subject decides to proceed, he/she
and the health professional must sign a consent form.
Box 11.3 GCP requirements for protecting clinical trial subjects
r The expected benefits to patients or society outweigh the possible risk of
harm
r The physical and mental well-being of the subject is safeguarded
r Informed consent is obtained from every trial subject or legal representative
r Subjects can withdraw from the trial at any time
r Medical care is the responsibility of a clinically qualified person (doctor or
dentist)
r Subjects have a point of contact for further information about the trial at
any time
r Insurance or indemnity provision must be in place to cover the liability of
the investigator and sponsor.
The Directives make special reference to children (usually aged under 16
years), those who are chronically ill, the elderly, prisoners and vulnerable
adults, such as those who are incapacitated, for example, mentally disabled or
unconscious. This is of particular relevance where a subject is unable to give
informed consent, so consent must be sought from a legal representative, i.e.
someone who has a personal relationship with the subject, but not involved
in the trial. For children, this is one or both of the parents or other legally
appointed guardian, and in incapacitated adults this could be the spouse or
next-of-kin.
The legal representative must be given information about the trial before
providing consent. In addition, a child’s view must be sought where possible,
using a specially developed patient information sheet appropriate for his/her
level of understanding (such as lots of simple pictures), or discussion with the
health professional and parent. Incapacitated adults who later become mentally competent are able to withdraw. Whatever the method of consent used in
these unusual circumstances, an ethics committee with appropriate expertise
must have approved the protocol and method of recruitment.
When trials are based on children or incapacitated adults there must be a
clear need for the research, the number of subjects should be as small as possible to address the main objective reliably, and efforts are needed to minimise
pain, discomfort, fear and other harm.
192
Chapter 11
Subjects can withdraw from the trial at any time, by discontinuing the trial
treatment, or not attending clinic visits or undergoing any other assessment.
Researchers can generally still use data on that subject up to the point when
they withdrew, although it might be useful to make this clear in the patient
information sheet and consent form. However, subjects have the right to withdraw any data that concerns them, including tissue samples, if they wish.
Insurance and indemnity
Trial sponsors normally provide insurance for non-negligent harm (physical
or emotional injury or death) caused by participating in the trial, but where
the protocol was followed correctly by site staff. This allows affected subjects
to receive financial compensation from the sponsor’s insurers (the sponsor
would indemnify the recruiting site against such claims, i.e. the site would be
protected). The patient information sheet should contain a statement about the
sponsor’s insurance and who to contact in the event of a claim. This insurance
is different from that in a hospital, which has responsibility for the standard of
care for a trial subject (where relevant). If hospital staff have been negligent,
for example, they gave the wrong trial drug or dose, compensation should
be met by the insurers of their employer (for negligent harm), and not the
trial sponsor (the site would indemnify the sponsor against such claims, i.e.
the sponsor would be protected). Defective drugs or medical devices should
be the responsibility of the manufacturer. Details of liability and indemnity
should be provided in the clinical trials agreement (see page 172). To make a
negligence claim, four factors need to be established:
r A duty was owed: a legal duty exists whenever a hospital or healthcare
provider undertakes care or treatment of a subject
r A duty was breached: the provider failed to conform to the relevant standard of care
r The breach caused an injury
r Damage: there needs to be a financial or emotional loss, otherwise there
is no basis for a claim, even if there was negligence.
Furthermore, there may be instances where the sponsor (particularly members of the review committee who assessed the clinical trial, for example, an
Institutional Review Board) or the Independent Data Monitoring Committee
may be named as defendants in negligence cases, so these individuals may
also need to be indemnified.
Data protection
A central database will store data about participating subjects. This often
means that data will be sent out of the hospital, or other health facility, to the
trial co-ordinating centre. Trial data associated with subjects such as paper
case report forms (CRFs) and the electronic database should be held in a
secure environment.
When people agree to participate in a trial it is on the understanding that
their personal data will remain confidential, as stated in the Patient Information Sheet. Only trial staff, or other authorised parties, such as the regulatory
Regulations and guidelines
193
authority, should have access to it. Also, it should not normally be necessary
to easily link a subject’s trial data with their name or contact details.
Many countries have regulations in place governing data protection, confidentiality, and access to personal data, for example, the EU Data Protection
Directive (95/46/EC). Many trials just use a unique number to identify individuals. However, some data may not be matched to the correct patient if one
digit is written down incorrectly on a CRF. A more reliable method is to add
patient initials, in addition to the unique trial number.
Sometimes it is necessary to use subject names, with or without contact
details. For example, in a disease prevention trial, quality of life forms may
need to be posted to subjects at home because they would not normally be
attending a clinic regularly. Also, where death or cancer incidence is the main
trial outcome, national registries can provide a valuable source of ascertaining these events, as well as the recruiting centre, especially during long-term
follow up. Clinical trial subjects are ‘flagged’ with the registry, so whenever
they die or have been diagnosed with cancer, the trial co-ordination centre
will be informed automatically. This system is often reliant on using patient
names and date of birth to accurately match an event with a particular trial
subject.
In any trial, collecting personal data should be justified and approved by
the ethics committee, and the subject should give specific consent.
Information for subjects in foreign languages
Any documentation intended for trial subjects (e.g. patient information sheet
and consent form) should be written in the dominant language(s) of the country in which subjects will be recruited. For international trials, the sponsor and
lead investigator in each country should ensure that the appropriate language
is used. Documentation developed by English-speaking researchers should be
translated into another language, and the accuracy of this could be tested by
back-translating to English to compare with the original version. The same
principle applies to any language. However, it may not be worth translating
the documents if few foreign-language subjects are expected. Those who are
unable to sufficiently interpret the trial information may be ineligible, and this
would be specified in the eligibility criteria. Alternatively, the ethics committee and institutional review board may allow a hospital translator or multilingual relative to verbally give the trial information to the subject, which
could be taped so that the subjects have a record.
Regulatory approval and notification
Each country has its own regulatory authority (see Table 11.1) responsible for
allowing a clinical trial to be conducted, usually studies with an IMP (or IND).
The main documents to be supplied by the sponsor are listed in Box 10.6 (page
168). The application requirements differ between countries, and should be
checked with the relevant authority. It is essential to obtain documented evidence of approval before subjects are recruited. Failure to do so can have legal
repercussions.
194
Chapter 11
Table 11.1 Regulatory agencies in selected countries.
Country
Regulatory Agency
Website
European Union* Competent authority in each country
Austria
Bundesamt für Sicherheit im
Gesundheitswesen
Belgium
Directoraat generaal
Geneesmiddelen
Direction générale Médicaments
Denmark
Lægemiddelstyrelsen
www.afigp.fgov.be
www.dkma.dk
Finland
Lääkelaitos
www.nam.fi
France
Agence Française de Sécurité Sanitaire des www.afssaps.sante.fr
Produits de Santé
Bundesministerium für Gesundheit und
Soziale Sicherung
www.bmgs.bund.de
Bundesinstitut für Arzneimittel
www.bfarm.de/de/index.php
und Medizinprodukte
www.pei.de
Paul-Ehrlich-Institut
Germany
Greece
National Organisation for Medicines
www.ages.at
www.eof.gr
Iceland
Lyfjastofnun
www.lyfjastofnun.is
Ireland
Irish Medicines Board
www.imb.ie
Italy
Ministero della Salute
www.ministerosalute.it
Netherlands
Staatstoezicht op de volksgezondheid
Inspectie voor de Gezondheidszorg
www.igz.nl
Norway
Statens Legemiddelverk
www.legemiddelverket.no
Portugal
Instituto Nacional da Farmácia e do
Medicamento
www.infarmed.pt
Spain
Agencia española del medicamento
www.agemed.es
Sweden
Läkemedelsverket
www.lakemedelsverket.se
United Kingdom
Medicines and Healthcare products
Regulatory Agency
www.mhra.gov.uk
Czech Republic
State Institute for Drug Control
Institute for the State Control of Veterinary
Biologicals and Medicaments
www.uskvbl.cz
www.sukl.cz
Hungary
National Institute of Pharmacy
Institute for Veterinary Medicinal Products
www.ogyi.hu
Poland
Office for Medicinal Products
www.urpl.gov.pl
Australia
Therapeutic Goods Administration
www.tga.gov.au
Canada
Health Canada
www.hc-sc.gc.ca/
China
State Food and Drug Administration
eng.sfda.gov.cn/eng/
India
Drugs Controller General of India, the Central
Drugs Standard Control Organization
cdsco.nic.in/
Japan
Pharmaceutical and Medical Devices Agency
www.pmda.go.jp/index-e.html
www.mhlw.go/jp/english/index.html
United States
Food and Drug Administration
www.fda.gov
For further details on trial set up in European countries, use the following website (replace ‘France’ with
another European country)
http://www.efgcp.be/Downloads/EFGCPReportFiles/Flow%20chart%20France%20
(revised)%2007-09-01.pdf
Regulations and guidelines
195
Summaries of regulations that govern clinical trials, regulatory bodies and
ethics review processes in EU countries are found on the European Forum
GCP website (given below Table 11.1). During the trial, the regulatory authority or IRB usually needs to be notified of any major change to the trial design
or conduct.
Independent ethics committee assessment
Proposed trials need be reviewed by an independent ethics committee, comprising of a group of experts who are able to assess the trial, including the
protocol and all material intended for subjects (see page 169). Recruitment
should not begin until written confirmation of ethics approval is received.
Sometimes it is necessary for the ethics committee to seek additional expertise when, for example, trials are based on children, or incapacitated or other
vulnerable adults. This is a requirement of the EU Clinical Trials Directives. In
the US, ethics can be assessed by an Institutional Review Board (IRB), which
also reviews the scientific merit of the proposed study, protocol and other documentation such as the Investigator’s Brochure.
Procedures for reporting and processing adverse events
There should be a system for identifying and reporting adverse events (see
page 181). The regulations in most countries are associated with IMP trials,
but they may also apply to other interventions, such as medical devices. Many
countries have similar procedures for classifying adverse events according to
severity, whether unexpected or not and the timelines for reporting them to
the regulatory authority. It is good practice to collect information on safety for
any trial. Even if no adverse events were observed, the final report is strengthened by stating that an attempt was made to collect this data.
In the EU there is now a Eudra Vigilance Database10 containing safety
information about all IMPs used in EU clinical trials, and based on SUSAR
reporting and annual safety reports. The database allows this information to
be exchanged more easily between countries that use it.
Specify standards for the manufacturing, importing and labelling of
IMPs (INDs)
Sponsors of IMP clinical trials must ensure that the trial drugs are manufactured to a high standard, and stored and labelled correctly (see page 170), in
accordance with the internationally recognised guidelines for Good Manufacturing Practice (GMP), a set of standards for the management of manufacturing and quality control of medicinal products.11
In the EU, there is a legal requirement for IMP trials to be conducted in
accordance with GMP (GMP Directive 91/356/EEC). Licensed products are
released in accordance with their marketing authorisation. For unlicensed
products, or licensed drugs that are manipulated in any way (including their
packaging), at least one qualified person (QP) should have responsibility for
releasing them to hospitals or subjects, and for maintaining records. For drugs
196
Chapter 11
manufactured in or imported into the EU, only one QP is required to ‘sign off’
distribution throughout Europe.
11.4 Independent audit or inspection of clinical trials
Many countries have a system for inspecting clinical trials facilities, i.e. the
offices and working practices of the sponsor, the trial co-ordinating centre,
one or more of the recruiting sites, and drug manufacturing facilities. This is a
legal requirement of the EU Clinical Trials Directives, and can be made before,
during or after a trial is conducted, arising from a pre-planned inspection or
triggered because of an unexpected and urgent serious concern. A single trial
or several trials can be inspected during a visit.
An inspection team attend in person. They assess compliance with the
national regulations, ensuring that:
r The necessary regulatory and ethical approvals, and signed agreements
were obtained
r The trial documentation (e.g. Trial Master File, see page 175) is available,
complete and up-to-date
r GCP guidelines are followed adequately
r Systems are in place for monitoring compliance with the trial protocol
r There are clear systems for monitoring safety, and serious adverse events
are reported on time.
The inspectors interview relevant staff and produce a report detailing any
problems found. If there are serious issues with the trial, especially if they
significantly affect the safety of subjects, inspectors have the authority to suspend the trial.
11.5 Regulations surrounding research in
special populations
The EU, US and several other countries have regulations for research in
special populations. For example, there 5,000 to 8,000 distinct rare diseases
that affect 6–8% of the US population. They are known as orphan diseases
and, until recently, people suffering from such diseases had little recourse
available to them. This was because pharmaceutical companies did not find
it profitable to spend the money needed to research and develop drugs
in these areas. The governments of some countries now provide incentives to companies to encourage the development of these drugs. For example, the regulatory authority would grant a company a period of market
exclusivity (7 years in the US, or 10 years in the EU), during which time,
the company is assured of sale of the drug without competition, provided certain caveats are met. Companies now have the incentive to invest
money in researching rare diseases, and this includes many biotechnology
companies.
Regulations and guidelines
197
The EU and FDA also have laws and regulations surrounding the research
and development of drugs for use in children (generally aged 0 to 17). Regulation EC No. 1901/2006 or the ‘Pediatric Regulation’ is designed to better
protect the health of children in EU trials.
11.6 Non-EU countries
Many aspects of the regulations and guidelines are similar between countries.
United States
Two principle sets of regulations that govern clinical trials in the US come
from the FDA and the Department of Health and Human Services. The
FDA regulations are the most commonly used. It is the national regulatory
agency in the US, providing extensive documentation for researchers on its
website.
A trial drug is called an Investigational New Drug (IND). A sponsor must
file an application with the FDA at least 30 days before initiating a trial that
evaluates a new drug for the first time in humans. It contains quality and
safety data about the drug from animal and laboratory studies to give assurance that it can be used safely when administered in accordance with the
trial protocol. An IND is considered approved unless the FDA objects within
the 30-day period. The sponsor must submit annual reports on the status of
the trial. Subsequent studies can be conducted with the same IND, provided
the sponsor submits all the required paperwork to the FDA.
There are several laws, including the Food, Drug and Cosmetic Act and
those listed in Chapter 21 of the Code of Federal Regulations.12 Procedures
for trial set-up and conduct follow ICH GCP closely, and therefore have been
largely covered above. Trials are reviewed by an Institutional Review Board
(IRB) from each centre where subjects will be recruited. Board members cannot be part of the research team. Some central IRBs cover several sites. The IRB
reviews the protocol, investigator’s brochure and documentation for subjects,
and the ethical considerations.
Sponsors should be responsible for quality assurance of the IND, including
quality control and distribution. Safety reporting is similar to European trials,
in which serious adverse event reports and IND Annual Reports are sent to
the FDA, and the timelines are similar to those in Europe (see page 181). Major
changes to the protocol or to the other information submitted for the IND, and
addition of new investigators participating in the study must also be reported
to the FDA by filing timely amendments.
Inspections are part of the Bioresearch Monitoring Program. Although the
FDA cannot enforce its regulations outside of the US, it can and does penalise
sponsors who wish to obtain FDA approval of a marketing application if they
have used non-compliant clinical sites outside the US.
In order to identify and minimise investigator bias associated with the trial,
sponsors submitting a marketing application for a medicinal product must
198
Chapter 11
provide information on compensation to, and financial interests of, all the
investigators who participated in the clinical trial used in the application. This
requires applicants to confirm that the investigators have no financial interests
in the drug or the sponsoring company, or to disclose any financial arrangements. If the sponsor does not provide this information, the FDA can refuse
to file the application.
Under FDA legislation (FDA Amendments Act of 2007) clinical trial results
must be posted on www.clinicaltrials.gov. Previously the information
on the study design and recruitment were posted on the website, but in the
interest of public disclosure of both positive and negative data, the FDA now
requires the results to be publicly available.
Canada
Trials that involve a pharmaceutical, biological, or radiopharmaceutical drug
must obtain approval via a Clinical Trials Application, from Health Canada,
the regulatory authority. The law that governs the use of clinical trial drugs
is the Controlled Drugs and Substances Act. The system for trial set-up and
conduct is similar to the United States and Europe.
The Health Products and Food Branch Inspectorate (HPFBI) aims to
inspect all institutions that conduct clinical trials. Further details about
trial set-up and conduct in Canada can be obtained from their regulatory
website.13
Japan
The medicinal products market in Japan is among the largest in the world,
and there is a long history of clinical trial research. There was once a view that
Japanese subjects reacted to drugs in a different way from other nationalities,
so there was a tendency to repeat trials conducted elsewhere. However, with
ICH GCP, there is now a high degree of standardisation with the US and EU,
and the original guidelines for trial set-up and conduct have been considerably revised.
The national regulatory agency is the Pharmaceutical and Medical Devices
Agency (PMDA), and a key regulation is the Pharmaceutical Affairs Law
(1996). ICH GCP compliance is a legal requirement. Researchers (or their
sponsor) must submit a Clinical Trial Plan Notification to the PMDA before
recruitment begins. Sponsors are encouraged to have an in-house study
review board to evaluate the proposed trial. However, the trial must be
approved, and reviewed annually by an IRB for each recruiting site. Sometimes, several sites share an IRB. During the trial, suspected unexpected
serious adverse reactions (SUSARs) must be reported to the Ministry of
Health, Labour and Welfare (MHLW), in a similar way to European trials (see
page 181). Audits and inspections are the responsibility of the sponsor. Further details are found on the websites in Table 11.1 and reference 14.
Regulations and guidelines
199
Australia
Australia was considered when the ICH GCP guidelines were first developed,
so elements of trial set-up and conduct are similar. The regulatory body is
the Therapeutic Goods Administration (TGA). The laws that govern clinical trials include the Therapeutic Goods Act (1989), the Therapeutic Goods
Regulations (1990) and the Therapeutic Goods (Medical Devices) Regulations
(2002). IMPs or investigational medical devices, are both referred to as ‘unapproved therapeutic goods’, and include new and unlicensed drugs, or those
that are already licensed (and appear on the Australian Register of Therapeutic Goods) but will be used in a ‘separate and distinct’ way.
Unlicensed treatments must be granted a Clinical Trial Notification (CTN)
or Clinical Trial Exemption (CTX) before they can be used in a trial. All trials require ethics approval by one of the human research ethics committees
(HRECs), and the Australian Health Ethics Committee of the National Health
and Medical Research Council must be informed of trials of unapproved therapeutic goods. In drug or medical device manufacturing, import, labelling
and testing, the sponsor must provide certificates of analysis and ensure compliance with Good Manufacturing Practice.
The reporting of serious adverse events to the regulatory body (TGA) is
practically the same as in Europe (see page 181), including annual safety
reports. The TGA can also inspect any organisation involved in trial conduct.
Further details can be obtained from websites.15,16
China
With over 1.3 billion people, China is a potentially large source of trial subjects
and there are several ‘mega trials’ being conducted, based on many thousands
of people. The cost of conducting trials is relatively low, and with the ability to
recruit large numbers of patients quite quickly, the number of trials is increasing, particularly through international collaboration. However, clinical trial
research is still relatively new in China and local staff need to become familiar
with conducting trials to international standards. China has its own guidelines for GCP, based on ICH GCP. Clinical trials of IMPs and medical devices
are regulated by the State Food and Drug Administration (SFDA), and they
need to comply with the Drug Administration Law (2001) and the Drug Registration Procedure (2002).
The process for trial set-up has been streamlined, and there are clear rules
for assuring the rights and interests of subjects, such as obtaining signed consent (directly or from an authorised representative). Only sites that have GCP
certification are allowed to participate in trials. Clinical trial applications are
submitted to the SFDA, which reviews aspects such as inspection of sites,
assessment of the trial drugs or medical devices, and ethics approval. The
Centre for Drug Evaluation (CDE) makes a technical evaluation of the drugs.
The entire process may take at least three months, but trials cannot start until
approval is received from the SFDA. Reporting of serious adverse events is
similar to elsewhere (see website in Table 11.1).
200
Chapter 11
India
India, like China, has a large population and can conduct trials relatively
cheaply. The regulatory body is the Drugs Controller General of India (DCGI)
and trials of IMPs and medical devices are governed by the Drugs and Cosmetics Act (revised Schedule Y 2003). Researchers are expected largely to
comply with guidelines for trial set-up and conduct from the US FDA. The
DCGI can grant permission to conduct a trial without prior ethics committee approval, but researchers are requested not to recruit subjects until this
is obtained. During the trial, serious adverse reactions need to be reported
to the DCGI and the ethics committee within 14 days of discovery. Continual approval is conditional on yearly reports. When reviewing the submitted protocol, the DCGI may seek advice from the Indian Council of Medical
Research.17 During the trial, any changes to the protocol must be reported to
the DCGI and permission sought for major changes. The Indian regulatory
agency is preparing to streamline the clinical research process and, with the
help of US FDA, is planning to set up a Central Drug Authority in the near
future. Further details can be obtained from one of the national websites,18
and the website in Table 11.1.
11.7 Summary
There are key regulatory issues associated with trial set-up and conduct:
r Informed consent
r Good Clinical Practice
r Good Manufacturing Practice
r National regulatory approval (review trial protocol and investigator’s
brochure)
r Institutional and/or ethics committee approval
r Monitoring and reporting adverse events (serious adverse events that are
judged to be caused by the trial treatment are reported to the regulatory
authority)
r Provision for compensation to trial subjects if they suffer harm because of
being in the trial.
References
1.
2.
3.
4.
5.
6.
7.
8.
http://ohsr.od.nih.gov/guidelines/nuremberg.html
http://www.wma.net/e/policy/b3.htm
http://www.access.gpo.gov/nara/cfr/waisidx 00/45cfr46 00.html
http://www.ich.org/cache/compo/276-254-1.html
http://www.ich.org/LOB/media/MEDIA482.pdf (see page 8).
www.ich.org
http://www.fda.gov/oc/gcp/default.htm
http://www.mhra.gov.uk/Howweregulate/Medicines/Licensingofmedicines/Clinicaltrials/Isaclinicaltrialauthorisationrequired/
index.htm
Regulations and guidelines
201
9. http://ec.europa.eu/enterprise/pharmaceuticals/eudralex/vol1/dir 2001 20/dir 2001 20 en.pdf#
10. http://eudravigilance.emea.europa.eu/veterinary/evDbms01.asp
11. http://www.ich.org/LOB/media/MEDIA433.pdf
12. http://www.access.gpo.gov/nara/cfr/cfr-table-search.html#page1
13. http://www.hc-sc.gc.ca/dhp-mps/prodpharma/applic-demande/guideld/clini/index e.html
14. Griffin JP, O’Grady J (Eds). The Textbook of Pharmaceutical Medicine. 5th edn. BMJ Books,
Blackwell Publishing, 2006.
15. http://www.qctn.com.au/ConductingTrials/HowtostartatrialinAustralia/
tabid/67/Default.aspx
16. http://www.qctn.com.au/Portals/0/Australian%20Clinical%20Trials%20Handbook.pdf
17. http://www.icmr.nic.in/
18. http://www.iscr.org/ClinicalTrialsRegulation.aspx
Reading list
Altman D. Practical Statistics for Medical Research. CRC Press, 1990.
Altman D, Machin D, Bryant TN, Gardner MJ. Statistics With Confidence. 2nd edn. BMJ Books,
2000.
Bland JM. An Introduction to Medical Statistics. 3rd edn. Oxford University Press, 2000.
Clive C. Handbook of SOPs for Good Clinical Practice. 2nd edn. Interpharm Press Inc., 2004.
Ellenbery S, Fleming TR, DeMets DL. Data Monitoring Committees in Clinical Trials: A Practical
Perspective (Statistics in Practice). John Wiley & Sons, Ltd, 2002.
Friedman L, Furberg CD, DeMets DL. Fundamentals of Clinical Trials. 3rd Rev. edn. SpringerVerlag New York Inc., 2006.
Girling D, Parmar M, Stenning S, Stephens R, Stewart, L. Clinical Trials in Cancer: Principles
and Practice. Oxford University Press, 2003.
Griffin JP, O’Grady J (Eds). The Textbook of Pharmaceutical Medicine. 5th edn. BMJ Books, Blackwell Publishing, 2006.
Guyatt G, Rennie D, Meade M, Cook D. Users’ Guides to the Medical Literature: A Manual for
Evidence-Based Clinical Practice. 2nd edn. McGraw-Hill Medical, 2008.
Kirkwood B, Sterne J. Medical Statistics. 2nd edn. Blackwell, 2003.
Machin D, Day S, Green S, Everitt B, George S (Eds). Textbook of Clinical Trials. John Wiley &
Sons, Ltd, 2004.
Petrie A, Sabin C. Medical Statistics at a Glance. 2nd edn. BMJ Books, 2005.
Pocock S. Clinical Trials: A Practical Approach. John Wiley & Sons, Ltd, 1983.
Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-Based Medicine:
How to Practice and Teach EBM. 2nd Rev. edn. Churchill Livingstone, 2000.
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
203
Statistical formulae for calculating
some 95% confidence intervals
95% confidence interval = effect size ± 1.96 × standard error of the effect size
Single-arm phase II trial
Counting people (single proportion)
Number of responses to treatment = 28
Number of subjects (N) = 50
Observed proportion (P) = 28/50 = 0.56 (or 56%)
√
Standard error of the true proportion (SE) = [P × (1 − P)]/N =
√
(0.56 × 0.44)/50 = 0.07
95% CI = P ± 1.96 × SE = 0.56 ± 1.96 × 0.07 = 0.42 to 0.70 ( or 42 to 70%)
For small trials (e.g. N < 30) ‘exact’ methods provide a more accurate 95%
confidence interval (Geigy Scientific Tables. Introduction to Statistics, Statistics
Tables and Mathematical Formulae, 8th edn. Ciba Geigy, 1982).
Taking measurements on people (single mean value)
Mean value (x) = 34 mm (VAS score)
Standard deviation (s) = 18 mm
Number of subjects (N) = 40
√
s
Standard error (SE) = √ = 18/ 40 = 2.8 mm
n
95% CI = mean ± 1.96 × SE = 34 ± 1.96 × 2.8 = 34 ± 5.5 = 28 to 40 mm
For small trials (N < 30), a different multiplier to 1.96 is used. It comes from
the ‘t-distribution’, and gets larger as the sample size gets smaller
The multiplier of 1.96 is associated with a two-sided confidence interval.
For a one-sided limit a value of 1.645 could be used, but only the lower
or upper limit is needed, depending on whether the proportion or mean
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
205
206
Statistical formulae for 95% CI
associated with the new therapy should be greater or smaller than standard
treatments to indicate improvement.
Randomised phase II or III trial with two groups
Counting people (risk difference or relative risk)
Example is serological flu (Box 7.1)
P1 = r1 /N1 = 41/927 = 0.044
P2 = r2 /N2 = 80/911 = 0.088
For risk difference
Observed risk difference = P1 − P2 = −0.044 (−4.4%)
√
{[P1 × (1 − P1 )]/N1 + [P2 × (1 − P2 )]/N2 } = 0.01155
Standard error (SE) =
95% CI = difference ± 1.96 × SE = −0.044 ± 1.96 × 0.01155
= −0.066 to − 0.021 = −6.6% to − 2.1%
For relative risk (RR)
Observed RR = P1 ÷ P2 = 0.5
Take natural logarithm (base e) = loge (0.5) = −0.693
Standard error of the log RR (SE) =
√
(1/r1 + 1/r2 − 1/N1 − 1/N2 ) = 0.186
95% CI for the log RR = log RR ± 1.96 × SE
= −0.693 ± 1.96 × 0.186 = −1.058 to − 0.328
Transform back (take exponential) = 0.35 to 0.72 (i.e. e−1.058 to e−0.328 )
(‘e’ is the natural number 2.71828)
Converted to a percentage change in risk, 95% CI is 28 to 65% reduction in risk
Taking measurements on people (difference between
two mean values)
Example is the Atkins diet (Box 7.4)
Change in weight loss at three months
Atkins diet:
Conventional diet:
N1 = 33 Mean1 = −6.8 kg
N2 = 30 Mean2 = −2.7 kg
SD1 = 5.0 kg
SD2 = 3.7 kg
Difference between the two means = Mean1 − Mean2 = − 6.8 − (−2.7) = − 4.1 kg
√
Standard error of the mean difference (SE) = (SD12 /N1 + SD22 /N2 )
√
= (5.02 /33 + 3.72 /30) = 1.1
95% CI = mean difference ± 1.96 × SE
= −4.1 ± 1.96 × 1.1 = −6.3 to − 1.9 kg
Statistical formulae for 95% CI
207
1.96 is used when each trial group has at least say 30 subjects. For smaller studies, a larger multiplier and the t-distribution are used, and there is a different
formulae depending on whether the standard deviations are similar between
the groups.
Time-to-event data (hazard ratio)
A statistical package should be used to estimate 95% CIs because the calculation for the standard error is not simple. However, if only the median and
number of events in each treatment group are available, there is a simple
method to obtain an approximate estimate of the CI, but only after assuming
that the distribution of the time-to-event measure has an ‘exponential distribution’ (i.e. the event rate is constant over time).
Example is early vs late radiotherapy in treating lung cancer (Spiro et al.,
J Clin Oncol 2006; 24: 3823–3830), and the outcome is time to death:
Early radiotherapy:
Median survival M1 = 13.7 months Number of deaths = E1 = 135
Late radiotherapy:
Median survival M2 = 15.1 months Number of deaths = E2 = 136
Hazard ratio (early vs late) HR = M2/M1 = 15.1/13.7 = 1.10
√
Standard error of the log hazard ratio (SE) = (1/ E1 + 1/ E2)
√
= (1/135 + 1/136) = 0.1215
95% CI for the log HR = loge HR ± 1.96 × SE
= log(1.10) ± 1.96 × 0.1215 = −0.143 to 0.333
Transform back (take exponential) = 0.87 to 1.40 (i.e. e−0.143 to e0.333 )
These are close, but not identical, to the results calculated using the raw data:
HR = 1.16, 95% CI 0.91 to 1.47
Index
Note: Page references in italics refer to
Figures; those in bold refer to
Tables
absolute risk difference 91–95
acceptance rate 39
adverse drug reaction 181
adverse events 121–2, 121
expected 182
monitoring 181–3
unexpected 182
reporting 181–3, 195
agreements 172–3
allocation bias 13, 88
allocation concealment 13
alpha spending 123
alternatives to clinical trials 4
analysis of covariance 98
analysis of variance (ANOVA) 114
area under the curve 36–7, 147, 152
audit, independent 196
audit trail 177
Australian regulations and guidelines 199
average 21, 98
baseline value 98
Bayesian methods 35, 114
bell-shaped curve 20, 22, 101
between-trial variability 134
bias 5–6, 13
allocation 13, 88
minimisation of 3, 6, 77
binary data (counting people) 19, 114
bioequivalence drug trials 57–58
biological activity (BA) 36
blinding 13–14, 57, 62
Bonferroni correction 115, 148
Canada regulations and guidelines 198
carryover effect (crossover trial) 58, 107
case-control study 2, 6
A Concise Guide to Clinical Trials Allan Hackshaw
© 2009 Allan Hackshaw. ISBN: 978-1-405-16774-1
case report forms (CRFs) 170–2, 177, 179,
192
categorical data (counting people) 19, 114
cause-specific survival 27, 28
censored subject 25
centile plot 23, 24
centiles 22
chief investigator (CI) 160
chi-square test 114
China, regulations and guidelines 199
clinical importance or significance 91,
124–5
clinical trial agreement 172–3
clinical trial application (EU) 167
cluster randomised trial 61, 109
Cochrane Collaboration 131
Cochrane Library 131
Code of Federal Regulations, US 188
cohort study 2
Committee for Proprietary Medicinal
Products (EU) 167
comparison group 12, 92
Competent Authority (CA) 167
composite endpoint 63, 64
confidence interval (CI) 40, 48, 49–52, 50,
95, 95, 99, 100, 104, 111, 115–16, 116
one-sided 40, 49
statistical formulae for calculating
205–7
two-sided 49
conflict of interests 185
confounding 5–6, 13, 77
consent form 161, 164–5, 167, 170
CONSORT flow chart 184
continuous data (taking measurements
on people) 19, 114
continuous reassessment method 35
contract research organisation (CRO)
161
control (comparison) group 10, 12, 92
cost benefit analysis 153
209
210
Index
cost effectiveness analysis 149, 151–2, 151
cost minimisation analysis 153
cost utility analysis 152–3
Cox’s regression 114
cross-sectional study 2
crossover trial (paired data) 58, 59, 70, 84,
106–7
cumulative meta-analysis 137, 138
Data Monitoring Committee 76, 179
data protection 192–3
database 172, 177–8
database lock 181
Declaration of Helsinki 187–8
difference between two means (mean
difference) 66, 97, 132, 206–7
difference between two medians 101
disease-free survival (DFS) 28, 102
disease progression 27
disease recurrence 27
disease (cause-)-specific survival 27–28,
105–6
dose-limiting toxicity (DLT) 32, 34–5,
37
double blinding 14, 57, 178
drop-outs (see withdrawal)
drug supply agreement 173
dynamic allocation 82
economic evaluation
definition 149–50
types of 150–1
effect size 66, 69, 91, 95, 98, 102, 103, 115,
134,
electronic data capture (EDC) 172
electronic database 177, 192
eligibility checklist 85, 176
eligibility list 11, 85
emergency unblinding 182
endpoints (see outcome measures)
equivalence limit or range 69
equivalence trials 58, 68–71, 74, 108–9,
118
Essential Documents 175
ethical approval 15, 169–170, 195
#
EU: European Union.
EU# Clinical Trials Directives 188–96, 195,
196
2001/20/EC 189
EU Committee for Proprietary Medicinal
Products 167
EU Data Protection Directive (95/46/EC)
193
EU GCP Directive (2005/28/EC) 189
EU GMP Directive 91/356/EEC 195
EU Regulation EC No. 1901/2006
(‘Pediatric Regulation’) 197
Eudra Vigilance Database 195
EudraCT number 165, 168
European Medicines Agency (EMEA)
165
EuroQol-5D (EQ-5D) 142, 143
event-free survival 27, 28
event rates (see survival rates)
excess risk 93–4
exclusion criteria (see inclusion and
exclusion criteria)
factorial trial 60, 70, 107, 114
feasibility (pilot) studies 39
Fibonacci sequence 33, 33, 35
first in man studies (see phase I trials)
Fisher’s exact test 114
fixed effects model 134
Food and Drug Administration (US) 33,
65, 149, 167, 190, 197
forest plot 119, 120, 132–133
frequency distribution 22, 22
funding 157–9
funnel plot 136
futility 122–4
foreign languages, information in 193
Gaussian distribution curve 20, 22
geometric mean 23
Good Clinical Practice (GCP) 188–96
Good Manufacturing Practice (GMP) 170,
173, 174, 195
hazard ratio 66, 102–4, 132, 153
health economic evaluation 149–53
Index
211
definition 149
types 150–3
Health Products and Food Branch
Inspectorate (HPFBI) (Canada) 198
health-related quality of life 141–54
definition 141
measuring 142–3
validated 142
analysis 144–8
heterogeneity 119, 134–5, 135
histogram 22
historical (non-randomised) controls 2, 4,
7
Investigational Medicinal Product (IMP)
1, 2, 159, 165
standards for the manufacturing,
importing and labelling 195–6
Investigational Medicinal Product
Dossier (IMPD) 168–9, 182
Investigational New Drug (IND) 1–2, 159,
197
application 167, 168, 169
standards for the manufacturing,
importing and labelling 195–6
investigator 160
Investigator’s Brochure (IB) 167, 168, 182
I2 value (in meta-analyses) 135, 137
imputation 119, 148
incidence (see also risk) 25
inclusion and exclusion criteria 11–12, 12,
85
incremental cost-effectiveness ratio 151
indemnity 192
independent data monitoring committee
(IDMC) 179–80
independent ethics committee 165,
169–70, 195
information for subjects in foreign
languages 193
insurance 192
independent audit or inspection of
clinical trials 196
India, regulations and guidelines in 200
individual patient data (IPD) 130
inspection of clinical trials 196
institutional approval 173–4
Institutional Review Board (IRB) 174–5,
195, 197
intention-to-treat (ITT) analysis 47,
116–18, 148
interaction between treatments 107, 108,
114
interim analyses 74, 122–4
International Conference on
Harmonisation (ICH) 188
International Standard Randomised
Controlled Trial Number
(ISRCTN) 166
interquartile range 21, 22, 23
intervention 1
intra-class correlation 109
Japan
Pharmaceutical and Medical Devices
Agency (PMDA) (Japan) 167, 198
regulations and guidelines 198
Kaplan–Meier plot 25–7, 26, 53, 53, 102,
103
Kruskal–Wallis ANOVA 114
legal representative 161, 191
licence 1
life-table 25, 25
log rank test 114
lost to follow-up (see withdrawals)
Mann-Whitney U test 114
manufacturing authorisation 170
marketing authorisation 1
material transfer agreement 173
maximum administered dose 32
maximum allowable difference (MAD)
66–67, 69, 108–9, 111
maximum tolerated dose (MTD) 32,
34–36
McNemar’s test 114
mean 21–3, 52
mean difference 66, 98, 132, 206–7
measure of central tendency 21
median 21, 22, 23, 24, 45
difference between two 101
median survival 25, 53, 104
meta-analysis 130, 132–134
minimisation 79, 81, 82–3, 82, 85
minimum biologically active dose
(MBAD) 36
212
Index
mixed modelling 111, 147
mode 21
monitoring safety 181–183, 195
monitoring of sites 179
multiple endpoints 65, 74, 115
multivariate linear regression 98, 114
multivariate logistic regression 114
natural variation 4, 12, 48, 70
negligence and negligent harm
192
no-effect value 92, 95, 115–16, 116
non-compliers 116, 117
non-inferiority trials 58, 68–71, 74, 108–9,
118
non-negligent harm 192
non-parametric methods (skewed data)
44, 101, 114
non-randomised controls (see historical
controls)
non-randomised studies (see
observational studies)
Normal (symmetric) distribution 23, 44,
97, 101, 114
Normal distribution curve 20, 22
number needed to harm (NNH) 122
number needed to treat (NNT) 93–4
Nuremberg Code 187, 188
objectives 31, 39, 58, 161
observational (non-randomised) studies
1, 2, 4–6
odds ratio 97, 98, 107
one-sided confidence interval 49
one-sided significance level 43
one-sided test 11, 72
one-tailed p-value 96, 100
outcome measures 17, 32, 42, 61–5
types of 19–20
overall survival 27, 28, 102
p-value 15, 54, 87, 88, 91, 96, 100, 105,
112–15, 127–8
multiple endpoints 115
one-tailed 96, 100
relationship between confidence
intervals, no-effect value and
115–16, 116
statistical methods that produce 114
stopping rule 123–5
two-tailed 96, 100
pack code 86
paired data 58, 59, 114
paired t-test 114
parallel groups (unpaired data) 58, 59, 70,
114
patient information sheet 39, 161, 164–6,
170, 191
patient withdrawals (see withdrawals)
per-protocol analysis 48, 117–18
period effect (in crossover trials) 59, 107
Peto-Haybittle rule 123
Pharmaceutical and Medical Devices
Agency (PMDA) (Japan) 167, 198
pharmacodynamics 36
pharmacokinetics 36–7
pharmacovigilance 181
Pharmacy File 176
phase I trials (first in man studies) 9–10,
18, 19, 31–7
3 + 3 design 34, 34
5/6 design 36
phase II trial 9, 10, 15, 18, 19, 39–55
interpreting and reporting 54–5
outcome measures
based on counting people 48
based on taking measurements on
people 52
based on time-to-event data 53
randomised
with control arm 41
with several intervention arms (pick
the winner) 41
with several intervention arms:
two-stage design 41–2
sample size method 42–7, 44
calculating sample size 43–7, 44
power 43
statistical significance level 43
single-arm 40, 205–6
single-arm two-stage study 40–1
statistical analysis 47–53
stopping early for toxicity 47
surrogate endpoints 42
types 42
phase II/III trial 75–6
phase III trial 9, 10–11, 18, 19, 39,
91–128
Index
allocating individuals or groups of
individuals to trial groups 61
design of 57–76
effect sizes 91, 95, 98, 102, 103, 115
multiple endpoints 65
objectives 57–8, 161
outcome measures
choosing 61–3
composite 63–5
multiple 65
outcome measures based on counting
people 91–7, 206
no-effect value 92–3
relative risk or odds ratio 97–8, 97
relative risk or risk difference 93–4,
94
outcome measures based on taking
measurements on people 97–101
effect sizes with skewed distribution
101
outcome measures based on
time-to-event data 101–6
cause-specific survival curves 105–6,
106
parallel/crossover trials 70
sample size estimation 65–8
expected effect size 66
level of statistical significance
66–7
power 67
sample-size calculation 68–70
examples 70–1, 71
superiority trials 73
sample size descriptions 72
sample size, reasons for increasing
74
types 57–61
see also confidence intervals; p-values
phase IV trials (post-marketing or
surveillance studies) 9, 11
pick the winner design 41
pilot (feasibility) studies 39
pivotal trials 11
placebo 12, 14
placebo effect 14
plasma concentration-time curves 36–7
population 48
post-marketing studies 9, 11
power 43, 65, 67, 68
213
primary objectives 161
principal investigator 160
probability (centile) plot 23, 24
Product Specification File 170
progression-free survival 28
proportional hazards, assumption of
104
protection of clinical trial subjects 191–2
protocol 31, 161, 162–3
deviation or violation 116, 118
qualified person (QP) 170, 195–6
qualified person (QP) release 170
quality adjusted life year (QALY) 152–3,
152
quality of life (QoL) measurements 142
analysing scores 144, 145, 146
examples 143
interpreting scores 149
missing data 148
repeated assessment, and multiple
comparisons 147–8, 147
random allocation (randomisation) 10
random number list 3, 77, 78, 79, 80–1
random permuted blocks 78–9
randomisation 3, 6, 12–13, 77–88
baseline characteristics 87–8, 87
choice of method 83–4
equal (1:1 randomisation) 83
in practice 85–7
simple 77–9, 84, 85
stratified 75, 79, 80–1, 80, 81, 84–5, 114
unequal 83
randomisation list 85–6, 86
randomised clinical trial (RCT) 2, 4
see also phase III trials
randomised controlled trial (see phase III
trial)
recruiting investigators 160
recruiting sites 174
monitoring 179
reference group (see control group)
regimen 1
registering trials 166
regulations and guidelines 187–200
need for 187–8
research in special populations 196–7
regulatory agencies 194
214
Index
regulatory approval 159, 167–9, 193–5
reporting and processing adverse events
181, 195
relative risk 66, 91–5, 94, 97, 98, 131–2,
149, 153
converting to percentage change in risk
94
95% confidence interval 95
repeated measures analysis 109–11, 147
reporting clinical trials 54, 183–5
residual (carryover) effect 58, 107
risk 20, 91, 93, 95
risk assessment 174
risk difference 66, 93–95, 94, 103, 104, 132,
149
95% confidence interval 95
risk, percentage change in 93
risk ratio (see relative risk)
risk reduction 93–4
safety 57, 121–2, 121
(see also monitoring safety)
safety measures 180
sample 48, 91
sample size
phase I trial 31
phase II trial 42–7
phase III trial 65–71
screening log 85
secondary objectives 161
selection bias 13, 88
semi-experimental study design 2
serious adverse events (SAE) 182
serious adverse reactions (SAR) 182
service level agreement 173
Short Form 12 or 36 (SF-12 or SF-36) 142,
143, 145, 148
significance level (see statistical
significance)
single-blinding trials 14, 57
site 161
agreement 172–3
assessment 174–5
initiation 174–5
monitoring 179
skewed data 22, 101
small trials 14–15
source data verification (SDV) 179
split-mouth design 58
split-person design 58, 59
sponsor 160, 173–4
square root, data transformation 23, 101
standard deviation 21, 23, 52, 69, 99,
205–6
standard error 48, 51–2, 95, 99–100, 104,
111–2, 116, 129, 133, 205–7
standard operating procedures (SOPs)
178
standardised difference 44, 69
statistical analysis plan (SAP) 177
statistical significance 43, 65, 66–8, 91,
96–7, 112, 115, 123–5
(see also p-values)
statistical test 113, 114
stopping rule 41, 47, 74, 123–4
stopping trials early 74, 122–4, 180
stratification and stratification factors
(see randomisation stratified)
sub-group analysis 114, 119–21, 129
test for interaction 119
subjects (participants) 2
subjective outcome measures 61–2
substantial amendment 176
summary of product characteristics
(SmPC) 168–9
superiority trials 58, 68, 73, 106, 108, 112
surrogate endpoint or markers 11,
17–19
surveillance studies 11
survival analysis 24–9, 53, 101–6
survival curves 26, 53, 103, 105
survival rates 24–5, 53, 105
suspected unexpected serious adverse
reaction (SUSAR) 182
symmetric distribution (see Normal
distribution)
systematic reviews 129–38
definition 130
disease definition, interventions and
outcome measures 135–6
identifying studies 136
interpretation 131–5
meta-analysis 132–4
publication bias 136
published, sources of 130–1
reporting 137
stages 131
study quality 136
Index
technical agreement 173
therapy 1
time-to-event data (see survival analysis)
time-to-treatment failure 28
toxicity (see adverse events and safety)
transforming data 23
trial co-ordination centre 86–7
trial conduct 175–80
trial endpoints (see outcome measures)
trial management group 157
Trial Master File (TMF) 175–6
trial steering group/committee/team 157
true outcomes or endpoints 17–8
two-sided confidence interval 49
two-sided significance level 43
two-sided test 72
215
two-tailed p-value 96, 100
Type I error 43, 66
Type II error 43, 67
types of clinical trials 9–11
types of outcome measures 19–29
United States, regulations and guidelines
197–8
uptake rate 39
variability (see natural variation)
washout period 58–9
Wilcoxon Matched pairs test 114
withdrawals (patient or subject) 74, 116,
118–19