Artbin: Extended Sample Size For Randomized Trials With Binary Outcomes

The Stata Journal (2023)
23, Number 1, pp. 24–52 DOI: 10.1177/1536867X231161971
artbin: Extended sample size for randomized

trials with binary outcomes
Ella Marley-Zagar Ian R. White
MRC Clinical Trials Unit MRC Clinical Trials Unit
University College London University College London
London, U.K. London, U.K.
[email protected] [email protected]
Patrick Royston Friederike M.-S. Barthel Mahesh K. B. Parmar

MRCClinical Trials Unit PRA / ICON PLC Germany MRC Clinical Trials Unit
University College London Mannheim, Germany University College London
London, U.K. [email protected] London, U.K.
[email protected] [email protected]
Abdel G. Babiker
MRC Clinical Trials Unit
University College London
London, U.K.
[email protected]
Abstract. We describe the command artbin, which offers various new facilities for
the calculation of sample size for binary outcome variables that are not otherwise
available in Stata. While artbin has been available since 2004, it has not been
previously described in the Stata Journal. artbin has been recently updated
to include new options for different statistical tests, methods and study designs,
improved syntax, and better handling of noninferiority trials. In this article, we
describe the updated version of artbin and detail the various formulas used within
artbin in different settings.
Keywords: st0013_3, artbin, sample size, power, binary outcome, randomized
clinical trial, superiority trial, noninferiority trial
1 Introduction
Sample-size calculation is essential in the design of a randomized clinical trial to ensure
that there is adequate power to evaluate treatment. It is also used in the design of
randomized experiments in other fields such as education, international development
(Attanasio, Kugler, and Meghi 2011), and criminology (Braga et al. 1999). It can also
be used in the design of nonrandomized comparative studies (Quigley et al. 2019).
In Stata, several standard sample-size calculations are available in the inbuilt power
family. More-advanced sample-size calculations are provided in the Analysis of Re-
sources for Trials (ART) package (Barthel, Royston, and Babiker 2005; Barthel et al.
© 2023 StataCorp LLC st0013_3

Marley-Zagar et al. 25
2006; Royston and Barthel 2010). ART is primarily aimed at trials with a time-to-event
outcome, but it also includes the command artbin for trials with a binary outcome.
artbin differs from the official power command by allowing many statistical tests, such
as score, Wald, conditional, and trend across K groups, and by offering calculations
under local or distant alternatives with or without continuity correction.
The calculations in artbin are based on a set of anticipated probabilities of the
binary outcome, one in each treatment group. If the unknown probabilities of the binary
outcome equal the anticipated probabilities, then artbin tells us the power achieved
for a specified sample size or the sample size required to achieve the specified power.
The basic idea of sample-size calculation with a binary outcome is well known. We
define the power 1 − β to be the probability of rejecting the null hypothesis at the
two-sided α level of significance.
In a two-group superiority trial, the null hypothesis is that the outcome probabilities
in the two groups are equal and the alternative hypothesis is that they take the unequal
anticipated probabilities π1a and π2a . If the trial has equal sample sizes n in each group,
then a popular formula for the total sample size required is
n p p o2
z1−α/2 2π a (1 − π a ) + z1−β π1a (1 − π1a ) + π2a (1 − π2a )
2n = 2
(π2a − π1a )2
where zc = Φ−1 (c) is the standard normal deviate and π a = (π1a + π2a )/2 (Julious and
Campbell 2012). Extensions are well known for unequal sample sizes.
However, several complications arise that are tackled by artbin. Some trials have
more than two groups, and in these cases we may test for trend across the groups or for
heterogeneity between the groups. There are variants of the sample-size formulas for
different versions of the test applied to the data (for example, Pearson’s χ2 or Wald),
and there are “local” variants that are valid only when the treatment effect is small.
A loss to follow-up option is useful for the replication of sample-size calculations, as
advocated by Clark, Berger, and Mansmann (2013).
Further, some two-group trials are noninferiority trials, in which the null hypothesis
is that the experimental treatment is no worse than the control treatment by a prespec-
ified amount m, termed the margin. They are used when the experimental treatment
is not expected to be superior, but they do have other benefits, such as being cheaper,
less toxic, or easier to administer, for example. Substantial-superiority trials are now
increasingly used, especially in vaccine trials, where the null hypothesis states that the
experimental treatment is better than the control treatment by at least m (see Krause
et al. [2020]).
The latest upgrade of artbin substantially improves the original version released in
2004. The option to specify a margin for noninferiority or substantial-superiority trials
has been included to enable sample-size and power calculations for more-complex two-
group trials. New options for statistical tests and methods are now available, such as the
Wald test, which is commonly used for sample-size calculation in noninferiority trials in
26 Sample size with binary outcomes
medicine. The syntax and output have been improved, with more options available and
clearer output. artbin does not require the anticipated event probabilities to be the
same in the two groups for noninferiority or substantial-superiority trials, unlike any
other software packages currently available in Stata. Previous users of artbin will need
to alter existing artbin code to accommodate the changes. Please see the description
of what has changed (appendix 1) for further details.
This article has three aims. First, it clearly lays out the scope of the artbin package
and its dialog boxes and exemplifies its use. Second, it describes the updates made.
Third, it clarifies the formulas used.
The article comprises a description of the new syntax (section 3.1), illustrative ex-
amples (section 3), a description of the updated menus and dialogs (section 4), details
of the methods used (section 5), a description of how the software has been tested
(section 6), and conclusions (section 7).
2 The artbin command

2.1 Syntax
artbin, pr(numlist) margin(#)

unfavourable | unfavorable | favourable | favorable power(#) | n(#)

aratios(aratio_list) ltfu(#) alpha(#) onesided trend doses(dose_list)

condit wald ccorrect local noround force

artbin calculates the power or total sample size for various tests comparing K
anticipated probabilities. Power is calculated if n() is specified; otherwise, total sam-
ple size is estimated. artbin can be used in designing superiority, noninferiority, and
substantial-superiority trials.
artbin makes comparisons on the scale of difference in probabilities. The results
on other scales, such as odds ratios, will be very similar for superiority trials but po-
tentially very different for noninferiority and substantial-superiority trials (Quartagno
et al. 2020).
In a multigroup trial, artbin is based on a test of the global null hypothesis that
the probabilities are equal in all groups. The alternative hypothesis is that there is a
difference between two or more of the groups.
In a two-group superiority trial, artbin is based on a test of the null hypothesis that
the probabilities in the two groups are equal. The alternative hypothesis is that they
take unequal values, such that the experimental treatment is better than the control
treatment.
In a noninferiority trial, artbin is based on a test of the null hypothesis that the
experimental treatment is worse than the control treatment by at least a prespecified
amount, termed the margin. artbin supports the design of more-complex noninferiority
trials in which π1a and π2a are unequal. Substantial-superiority trials are increasingly
used; here the null hypothesis is that the experimental treatment is better than the
control treatment by the margin at most.
To minimize the risk of error in two-group trials, the user is advised to identify
whether the trial outcome is favorable or unfavorable. By default, artbin infers
favorability status from the pr() and margin() options. If π2a > π1a + margin(), the
outcome is assumed to be favorable; otherwise, it is assumed to be unfavorable.
2.2 Options
pr(#1 . . . #K) specifies the anticipated outcome probabilities in the groups that will be
compared. #1 is the anticipated probability in the control group (π1a ), and #2, . . . ,
#K are the anticipated probabilities in the treatment groups (π2a , . . . , πK
a
). pr() is
required.
margin(#) is used with two-group trials and must be specified if a noninferiority or
substantial-superiority trial is being designed. The default is margin(0), denoting
a superiority trial. If the event of interest is unfavorable, the null hypothesis for all
of these designs is π2 − π1 ≥ m, where m is the prespecified margin. The alternative
hypothesis is π2 − π1 < m. m > 0 denotes a noninferiority trial, whereas m < 0
denotes a substantial-superiority trial. On the other hand, if the event of interest
is favorable, the above inequalities are reversed. The null hypothesis for all of these
designs is then π2 − π1 ≤ m, and the alternative hypothesis is π2 − π1 > m. m < 0
denotes a noninferiority trial, while m > 0 denotes a substantial-superiority trial.
The hypothesized margin for the difference in anticipated probabilities, #, must lie
between −1 and 1.
unfavourable | unfavorable or favourable | favorable are used with two-group trials
to specify whether the outcome is unfavorable or favorable. If either option is used,
artbin checks the assumptions; otherwise, it infers the favorability status. American
and English spellings are both allowed.
power(#) specifies the required power of the trial at the alpha() significance level and
computes the total sample size. power() cannot be used with n(). The default is
power(0.8).
n(#) specifies the total sample size available and computes the corresponding power.
n() cannot be used with power(). The default is to calculate the sample size for
power 0.8.
aratios(aratio_list) specifies the allocation ratios. The allocation ratio for group k
is #k, k = 1, . . . , K; for example, aratios(1 2) means that two participants are
randomized to the experimental group for each one randomized to the control group.
With two groups, aratios(#) is taken to mean aratios(1 #). The default is equal
allocation to all groups.
ltfu(#) assumes a proportional loss to follow-up of #, where # is a number between

0 and 1. The total sample size is divided by 1−# before rounding. The default is
ltfu(0), meaning no loss to follow-up.
alpha(#) specifies that the trial will be analyzed using a significance test with level #.
That is, # is the type 1 error probability. The default is alpha(0.05).
onesided is used for two-group trials and for trend tests in multigroup trials. It spec-
ifies that the significance level given by alpha() is one sided. Otherwise, the value
of alpha() is halved to give a one-sided significance level. Thus, for example,
alpha(0.05) is exactly the same as alpha(0.025) onesided.
artbin always assumes that a two-group trial or a trend test in a multigroup trial
will be analyzed using a one-sided alternative, regardless of whether the alpha level
was specified as one sided or two sided. artbin, therefore, uses a slightly different
definition of power from the power command: when a two-tailed test is performed,
power reports the probability of rejecting the null hypothesis in either direction,
whereas artbin only considers rejecting the null hypothesis in the direction of in-
terest.
artbin assumes that multigroup trials will be analyzed using a two-sided alterna-
tive, so onesided is not allowed with multigroup trials unless trend or doses() is
specified (see below).
trend is used for trials with more than two groups and specifies that the trial will be
analyzed using a linear trend test. The default is a test for any difference between
the groups. See also doses().
doses(dose_list) is used for trials with more than two groups and specifies “doses” or
other quantitative measures for a dose–response (linear trend) test. doses() implies
trend. doses(#1 #2 . . . #r) assigns doses for groups 1, . . . , r. If r < K (the total
number of groups), the dose is assumed equal to #r for groups r + 1, r + 2, . . . , K. If
trend is specified without doses(), then the default is doses(1 2 . . . K). doses()
is not permitted for a two-group trial.
condit specifies that the trial will be analyzed using Peto’s conditional test. This
test conditions on the total number of events observed and is based on Peto’s lo-
cal approximation to the log odds-ratio. This option is also likely to be a good
approximation with other conditional tests. The default is the usual Pearson χ2
test. condit is not available for noninferiority and super-superiority trials. condit
cannot be used with wald, because only one test type is allowed. condit implies
local. The ccorrect option is not available with condit.
wald specifies that the trial will be analyzed using the Wald test. The default is the
usual Pearson χ2 test. wald cannot be used with condit, because only one test type
is allowed. The Wald test inherently allows for distant alternatives, so wald and
local cannot be used together.
ccorrect specifies that the trial will be analyzed with a continuity correction. ccorrect
is not available with condit. The default is no continuity correction.
local specifies that the calculation should use the variance of the difference in propor-
tions only under the null. This approximation is valid when the treatment effect is
small. The default uses the variance of the difference in proportions both under the
null and under the alternative hypothesis. The local method is not recommended and
is only included to allow comparisons with other software. The Wald test inherently
allows for distant alternatives, so wald and local cannot be used together.
noround prevents rounding of the calculated sample size in each group up to the nearest
integer. The default is to round.
force can be used with two-group studies to override the program’s inference of the
favorable or unfavorable outcome type. This may be needed, for example, when
designing an observational study with a harmful risk factor; the favorability types
would be reversed and the force option applied.
3 Examples
3.1 Binary outcome and comparison with published sample size
We reproduce the sample-size calculation in Pocock (1983) for a two-group superiority
trial comparing the efficacy of therapeutic doses of Anturan in patients after a myocar-
dial infarction with the placebo standard treatment. The primary outcome was death
from any cause within one year of first treatment. The control (placebo) group was
anticipated to have a 10% probability of death within one year and the Anturan treat-
ment group a 5% probability, with the trial powered at 90%. The patient outcome was
binary: either failure (death in a year) or success (survival). The published sample size
was 578 patients per group (1,156 patients in total).
In the below artbin example, we do not specify in the syntax whether the outcome
is favorable or unfavorable; rather, we let the program infer it. The aim of a clinical
trial is always to improve patient outcome. Therefore, because the experimental-group
anticipated probability (π2a = 0.05) is less than the control-group anticipated probability
(π1a = 0.1), it can be inferred that the outcome is unfavorable (that is, the trial is aiming
to reduce the probability of the event occurring, in this case, death).
. artbin, pr(0.1 0.05) alpha(0.05) power(0.9) wald

ART - ANALYSIS OF RESOURCES FOR TRIALS (binary version 2.0.1 09june2022)
A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Ella Marley-Zagar and Ian White
MRC Clinical Trials Unit at UCL, London WC1V 6LJ, UK.
Type of trial superiority

Number of groups 2
Favourable/unfavourable outcome unfavourable
Inferred by the program
Allocation ratio equal group sizes
Statistical test assumed unconditional comparison of 2
binomial proportions
using the wald test
Local or distant distant
Continuity correction no
Anticipated event probabilities 0.100 0.050
Alpha 0.050 (two-sided)
(taken as .025 one-sided)
Power (designed) 0.900
Total sample size (calculated) 1156
Sample size per group (calculated) 578 578
Expected total number of events 86.70
The artbin output table shows the trial setup information, including the study
design, statistical tests, and methods used. The hypothesis tests are shown with the
calculated sample size and events based on the selected power. A total sample size of
1,156 participants is required, as per the published sample size given by Pocock (1983).
The same result is achieved by the command artbin, pr(0.9 0.95) alpha(0.05)
power(0.9) wald, assuming a favorable outcome (survival) instead. The Wald test
is used instead of the default score test because Pocock used the sample estimate in
the method of estimating the variance of the difference in proportions under the null
hypothesis H0 .
3.2 Binary outcome and comparison with power

We compare the output of artbin with the output of Stata’s power command, which,
like artbin, uses the score test as the default.
. power twoproportions 0.1 0.05, alpha(0.05) power(0.9)

Performing iteration ...
Estimated sample sizes for a two-sample proportions test
Pearson's chi-squared test
H0: p2 = p1 versus Ha: p2 != p1
Study parameters:
alpha = 0.0500
power = 0.9000
delta = -0.0500 (difference)
p1 = 0.1000
p2 = 0.0500
Estimated sample sizes:
N = 1,164
N per group = 582
. artbin, pr(0.1 0.05) alpha(0.05) power(0.9)


Number of groups 2
Favourable/unfavourable outcome unfavourable
using the score test
Both give a total sample size of 1,164.
3.3 One-sided noninferiority trial

Next we show a one-sided noninferiority trial with the onesided option. We anticipate a
90% probability of survival in both the control group and the treatment group, with the
null hypothesis that the treatment group is at least 5% less effective than the control.
. artbin, pr(0.9 0.9) margin(-0.05) onesided


Type of trial non-inferiority

Number of groups 2
Favourable/unfavourable outcome favourable
Null hypothesis H0: H0: pi2 - pi1 <= -.05
Alternative hypothesis H1: H1: pi2 - pi1 > -.05
Alpha 0.050 (one-sided)
A sample size of 457 is required in each group.
3.4 Superiority trial with multiple groups

Here we demonstrate a superiority trial with more than two groups. Instead of com-
paring each of the treatment groups with the control group, artbin uses a global test
to assess if there is any difference among the groups.
. artbin, pr(0.1 0.2 0.3 0.4) alpha(0.1) power(0.9)



Number of groups 4
Favourable/unfavourable outcome not determined
Anticipated event probabilities 0.100 0.200 0.300 0.400
Sample size per group (calculated) 44 44 44 44
A sample size of 44 is required in all four groups.
3.5 Complex noninferiority trial in a real-life setting

Finally, we demonstrate a more complex noninferiority design from the STREAM trial.
The need for the STREAM trial arose from the increase of multidrug-resistant strains
of tuberculosis, especially in countries without robust healthcare systems that were
unable to administer treatment over long periods of time. The STREAM trial evaluated
a shorter, more intensive treatment for multidrug-resistant tuberculosis compared with
the lengthier treatment recommended by the World Health Organization.
A favorable outcome was defined by cultures negative for mycobacterium tubercu-
losis at 132 weeks and at a previous occasion, with no intervening positive culture or
previous unfavorable outcome (Nunn et al. 2019). The sample-size calculation used an
anticipated 0.7 probability of a favorable outcome on control (π1a ) and 0.75 on treatment
(π2a ). Hence, it was assumed that 70% of the participants in the long-regimen group and
75% in the short-regimen group would attain a favorable outcome. A 10-percentage-
point noninferiority margin was considered to be an acceptable difference in efficacy,
given the shorter treatment duration (m = −0.1 defined as π2 − π1 ). It was assumed
there were twice as many patients in treatment compared with control. The wald test
was applied because it is often used in noninferiority trials.
. artbin, pr(0.7 0.75) margin(-0.1) power(0.8) aratios(1 2) wald ltfu(0.2)


Type of trial non-inferiority

Number of groups 2
Favourable/unfavourable outcome favourable
Allocation ratio 1:2
using the wald test
Null hypothesis H0: H0: pi2 - pi1 <= -.1
Alternative hypothesis H1: H1: pi2 - pi1 > -.1
Loss to follow up assumed: 20 %
The noninferiority trial required a total sample size of 399 (133 in control and 266
in treatment), assuming 20% of patients were not assessable in primary analysis. When
the STREAM trial concluded, it estimated that a shorter, more intensive treatment for
multidrug-resistant tuberculosis was only 1% less effective than the lengthier treatment
recommended by the World Health Organization and demonstrated significant evidence
of noninferiority.
4 Menu and dialogs

All the features in artbin are available from the artbin menu and associated dialogs.
Once the selections have been inputted into the menu box, the associated command line
will be displayed in the Review window. If the user would like to generate a do-file to
reproduce the calculations, a log file can be opened before executing the commands via
the dialog, which will then save the command line.
Once the ART package has been installed in Stata, the artbin dialog menu can
be used. To access the interactive menu, type artmenu on, which will cause a new
item, ART, to appear on the system menu bar under User. To turn this menu off, type
artmenu off. ART consists of three programs, namely,
• survival outcomes (corresponding to artsurv),

• projection of events and power (corresponding to artpep), and
• binary outcomes (corresponding to artbin).
artsurv and artpep are described in Barthel, Royston, and Babiker (2005) and
Royston and Barthel (2010), respectively.
Compared with previous versions, new options such as Margin, Favourable or Un-
favourable, Loss to follow-up, Score test, Wald test, Continuity correction, and Do not
round have now been included within an updated layout design.
Figure 1 illustrates the dialog box for binary outcomes. The artbin dialog box allows
the user to input the parameters for the trial setup. Options are deselected based on
the user’s choices; for example, if the Conditional test (Peto) checkbox is selected, then
the Wald test checkbox will be grayed out.
Figure 1. Example of a completed artbin menu for binary outcomes
The dialog box output is the same as the output in section 3.5 and corresponds to
the inputs shown in the figure 1 menu box. The detailed display enables the user to
check that the trial design has been inputted correctly.
5 Methods and formulas

5.1 Notation
Consider the design of a study to compare K independent groups in terms of a bi-
nary outcome whose probability of occurrence for an individual in group k is πk ,
k = 1, 2, . . . , K. We refer to group 1 as a control group and groups 2, . . . , K as ex-
perimental groups.
Let Yk be the number of events in a sample of size nk = rk N from a total sample
size N , where rk is the fraction allocated to group kPfor k = 1, 2, . . . , K. Then Yk has
the binomial distribution binom(nk , πk ). Write π = k=1 rk πk as the overall outcome
K
probability. Let Y. = k Yk . The estimated outcome probabilities π bk and πb are π

P
bk =
Yk /rk N and π bk .
PK
b = Y. /N = k=1 rk π
We consider the general case and then the case K = 2. For each case, we define a
test statistic and derive its distribution under the null and alternative hypotheses (sec-
tion 5.2). We then apply generic methods to derive sample sizes or powers (section 5.3).
5.2 Summary of test statistics and their distributions

Unconditional methods are based on a score vector U = (U2 , . . . , UK )0 , where Uk =
bk − π
π b. Conditional methods are based on a different score vector X = (X2 , . . . , XK )0 ,
where Xk = Yk − rk Y. = rk N Uk . Table 1 shows the test statistics and their null and
alternative distributions. See appendix 2 for further details of definitions, such as Q, V,
A, M , and T . All methods are unconditional unless otherwise stated. The approximate
distant method is based on the work of Yuan and Bentler (2010).
Table 1. Summary of test statistics and their distributions
Method Statistic Distribution

Null Alternative
K groups, heterogeneity
Score Qu = N U0 V −1
bu U χ2K−1 N Cχ2 (K − 1, λ)
local V
b u = N var
c (U|H0 ) −1
λ = N µ0 Vu µ
µk = πka − π a
Score same same cN Cχ2 (K − 1, γ)
distant Yuan and Bentler (2010)
approximate with equations for c, γ (see
appendix 2)
Wald Qw = N U0 A
b −1 U χ2K−1 N Cχ2 (K − 1, λ)
A = N var
b c (U|Ha ) λ = N µ0 A−1 µ
Conditional Qc = X0 Vc−1 X/M χ2K−1 N Cχ2 (K − 1, λ)

local M =πb(1 − πb)N 2 /(N − 1) λ = M η 0 Vc η
Vc = var (X|H0 ) /M ηk = logit πka − logit π1a
K groups, trend
Score Tu = c0 U N(0, c0 Vu c/N ) N(c0 µ, c0 Vu c/N )
local ck = rk (dk − d1 )
where d1 , d2 , . . . , dk are
doses for groups 1, 2, . . . , k
Score same same N(c0 µ, c0 Ac/N )
distant
Wald same N(0, c0 Ac/N ) N(c0 µ, c0 Ac/N )
Conditional Tc = c0 X/M N(0, c0 Vc c/M ) N(c0 Vc η, c0 Vc c/M )
local
Two groups, superiority or noninferiority

All T2 = δb − m N(0, Vn /N ) N(δ − m, Va /N )
a a a a
π1 (1−π1 ) π2 (1−π2 )
b2 − π
δb = π b1 Va = r1
+ r2
m = margin
In the above, Vn = {e π1a (1 − π
e1a )}/r1 + {e e2a )}/r2 , where π
π2a (1 − π e1a and π
e2a
are values of π1a and π2a modified to conform to H0 in one of the following ways:
Score Maximum likelihood estimates of π1 and π2 constrained to δ = m
distant
Score As score, but replacing Va with Vn
local
Wald e1a = π1a and π
π e2a = π2a (so Vn = Va )
Conditional Methods for K groups are used (superiority trial only)
local
5.3 Summary of methods

5.3.1 K groups, heterogeneity
The following statistics are approximated as χ2K−1 under the null. Let xα (m) be the
(1−α)100th percentile of the (central) χ2 distribution with m degrees of freedom. Then,
for a test statistic for which we write SN to emphasize its dependence on sample size
N , power is related to the total sample size N by the equation
power = Pr{SN > xα (K − 1)|Ha } (1)
The distributions under the alternative hypothesis are all of the form cX, where c is
a constant depending on N and X is a noncentral χ2 random variable with K − 1
degrees of freedom and noncentrality parameter λ depending on N and the anticipated
probabilities. Then (1) gives the key equation
power = 1 − FK−1,λ {xα (K − 1)/c}
where FK−1,λ (x) is the cumulative distribution function of the noncentral χ2 distribu-
tion with K − 1 degrees of freedom and noncentrality parameter λ. We can directly
evaluate this for power given N . Solving for N given power involves iterative methods
in some cases.
5.3.2 All other cases
These statistics SN are all approximated as N(0, σ02 /N ) under H0 and N(µ1 , σ12 /N )
under Ha , where σ1 depends on the anticipated probabilities. Let za denote the (1 −
a)100th percentile of the standard normal distribution, where, for a one-sided test,
a = α, and for a two-sided test, a = α/2. Then (1) gives the key equation
√ !
√ µ1 − za σ0 / N
power = Pr SN > za σ0 / N |Ha = Φ √
σ1 / N
Rearranging, the total sample size to achieve power 1 − β is

2
za σ0 + zβ σ1
N=
µ1
6 Software testing
artbin is for use in the design of randomized trials, so we have tested it extensively. The
program was modified by Ella Marley-Zagar and tested by Ella Marley-Zagar, Ian R.
White, Patrick Royston, and Abdel G. Babiker. We report the testing methods below
to verify both the sample-size and the power results. We ran the test scripts with the
default variable type (set type) as float and as double.
1. We compared results for noninferiority trials with those given by Julious and
Owen (2011), Blackwelder (1982), Pocock (2003), and the online calculator Sealed
Envelope (2012). Exact agreement was achieved.
2. We compared results for a superiority binary outcome with those given by Pocock
(1983) and the online calculator Sealed Envelope (2012). Exact agreement was
achieved.
3. We tested several scenarios including continuity correction results given by artbin
and those given by the Stata program power. The results from both programs
were in agreement.
4. We checked the results given by artbin using the margin() option against Julious
and Owen (2011). Exact agreement was achieved.
5. The output of artbin was compared with Cytel’s software EAST, which is a so-
phisticated package able to produce sample-size and power calculations for several
binary outcomes in clinical trial settings. We achieved perfect agreement in all
but a handful of cases where the sample size differed by 1, which we believe is due
to the difference in the way the packages round sample size.
6. For the new syntax options, we tested onesided for a one-sided test and ccorrect
to apply a continuity correction.
7. We tested every permutation of two-group and more than two-group and noninfe-
riority, substantial-superiority, and superiority trials with margin, local or distant,
conditional or unconditional, trend, and Wald test options to check that the results
were as expected and that sample size was increased or decreased accordingly.
8. We checked error messages in several impossible cases to ensure that we obtained
error messages as required.
9. We tested the dialog box menu options to verify that the results were as required.
7 Conclusions
We have written artbin to include new syntax with additional options, including ex-
tensions to the tests and methods offered by previous versions of the software. We have
also refreshed the layout of the dialog box for artbin, with mutually exclusive options
grayed out for clarity. The updated artbin program compares well with Stata’s power
program, as well as other commercially available products such as Cytel’s EAST and the
Sealed Envelope Calculator. One of the main features of artbin that sets it apart from
the other available software in Stata is the range of trial types, statistical tests, and
methods that it offers for sample-size calculation. Notably, Stata’s power can provide
sample size for superiority trials only.
As noted in section 2.2, artbin reports power as the probability of rejecting the
null hypothesis in the direction of interest, whereas power reports the probability of
rejecting the null hypothesis in either direction if a two-tailed test is performed. We

believe the former is more appropriate for a clinical trial. Technically, this procedure is
conservative, but the difference matters only for unrealistically large alpha.
The majority of noninferiority trials are designed so that π1a = π2a . However, artbin
allows more flexibility where π1a and π2a can differ, as in section 3.5. The noninferiority
margin is expressed on the risk-difference scale, and the results would be very different
for other scales (Quartagno et al. 2020). All calculations in artbin are based on the
approximation that the difference in proportions is normally distributed (or for the
conditional case that the score statistic is normally distributed). This approximation
may fail with very small sample sizes, in which case the continuity correction should
be used. We suggest using the usual rule for the Pearson χ2 test, namely, to mistrust
the results when any expected cell count is lower than about 5. Concerned users should
check the power by simulation.
We have not so far offered advice on which method to use. In our experience, analysts
often use the score test for superiority trials and the Wald test for noninferiority trials.
For small trials, conditional tests are often used. With small differences in probabilities,
all tests give similar results. We recommend avoiding the Wald test when there are
large differences in probabilities, and we would never use the local option except when
comparing results from other programs.
Furthermore, the design of multigroup trials in artbin is based on testing the global
null hypothesis evaluating if there is a difference between any of the groups. The latter is
in contrast to the case of comparing each group with the control. This can, however, be
achieved by applying the two-group case; if the familywise error rate is to be controlled,
this can be done by dividing alpha by the number of comparisons.
artbin has been created to assist the design of clinical trials, but it can also be used
in the design of observational studies to explore a protective or harmful factor. The trial
and outcome types may need to be reinterpreted; for example, for a harmful risk factor
in an observational study, the favorable or unfavorable outcome types would be reversed.
This would be an example of when the option force would be used. An observational
study design to demonstrate a protective factor could be designed in exactly the same
way as a trial, but the term superiority might be replaced by benefit. This is further
described in the newly available artcat, a Stata program to calculate sample size or
power for a two-group trial with an ordered categorical outcome (White et al. 2023).
A useful future extension will be for artbin to handle the conditional test for non-
inferiority or substantial-superiority trials.
8 Acknowledgments
This work was supported by the Medical Research Council Unit Programme number
MC_UU_00004/09. We thank Henry Bern and Tim Morris for their very helpful
comments and for testing the program.
9 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of
publication of this article, type
. net sj 23-1
. net install st0013_3 (to install program files, if available)
. net get st0013_3 (to install ancillary files, if available)
The artbin command also is available on the Statistical Software Components

archive and can be installed directly in Stata with the command
. ssc install art
All the code we used for testing and the output testing files are included in the
package. The files are also available along with the program itself on the GitHub
repository https: // github.com / UCL / artbin.
10 References
Attanasio, O., A. D. Kugler, and C. Meghi. 2011. Subsidizing vocational training for
disadvantaged youth in Colombia: Evidence from a randomized trial. American
Economic Journal: Applied Economics 3: 188–220. https: // doi.org / 10.1257 / app.3.
3.188.
Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of

sample size and power for multi-arm survival trials allowing for non-uniform accrual,
non-proportional hazards, loss to follow-up and cross-over. Statistics in Medicine 25:
2521–2542. https: // doi.org / 10.1002 / sim.2517.
Barthel, F. M.-S., P. Royston, and A. Babiker. 2005. A menu-driven facility for com-
plex sample size calculation in randomized controlled trials with a survival or a
binary outcome: Update. Stata Journal 5: 123–129. https: // doi.org / 10.1177 /
1536867X0500500114.
Blackwelder, W. C. 1982. “Proving the null hypothesis” in clinical trials. Controlled

Clinical Trials 3: 345–353. https: // doi.org / 10.1016 / 0197-2456(82)90024-1.
Box, G. E. P. 1954. Some theorems on quadratic forms applied in the study of analysis
of variance problems, I. Effect of inequality of variance in the one-way classification.
Annals of Mathematical Statistics 25: 290–302. https: // doi.org / 10.1214 / aoms /
1177728786.
Braga, A. A., D. L. Weisburd, E. J. Waring, L. G. Mazerolle, W. Spelman, and

F. Gajewski. 1999. Problem-oriented policing in violent crime places: A randomized
controlled experiment. Criminology 37: 541–580. https: // doi.org / 10.1111 / j.1745-
9125.1999.tb00496.x.
Clark, T., U. Berger, and U. Mansmann. 2013. Sample size determinations in origi-
nal research protocols for randomised clinical trials submitted to UK research ethics
committees: Review. BMJ 346: f1135. https: // doi.org / 10.1136 / bmj.f1135.
Farrington, C., and G. Manning. 1990. Test statistics and sample size formulae for
comparative binomial trials with null hypothesis of non-zero risk difference or non-
unity relative risk. Statistics in Medicine 9: 1447–1454. https: // doi.org / 10.1002 /
sim.4780091208.
Fleiss, J. L., A. Tytun, and H. K. Ury. 1980. A simple approximation for calculating
sample sizes for comparing independent proportions. International Biometric Society
36: 343–346. https: // doi.org / 10.2307 / 2529990.
Julious, S. A., and M. J. Campbell. 2012. Tutorial in biostatistics: Sample sizes for
parallel group clinical trials with binary data. Statistics in Medicine 31: 2904–2936.
https: // doi.org / 10.1002 / sim.5381.
Julious, S. A., and R. J. Owen. 2011. A comparison of methods for sample size estima-
tion for non-inferiority studies with binary outcomes. Statistical Methods in Medical
Research 20: 595–612. https: // doi.org / 10.1177 / 0962280210378945.
Krause, P., T. R. Fleming, I. Longini, A. M. Henao-Restrepo, and R. Peto. 2020. COVID-
19 vaccine trials should seek worthwhile efficacy. Lancet 396: 741–743. https: // doi.
org / 10.1016 / S0140-6736(20)31821-3.
Mathai, A., and S. Provost. 1992. Quadratic Forms in Random Variables: Theory and
Applications. New York: Dekker.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. London:
Chapman & Hall/CRC.
Nunn, A. J., P. P. J. Phillips, S. K. Meredith, C.-Y. Chiang, F. Conradie, D. Dalai,
A. van Deun, et al. 2019. A trial of a shorter regimen for rifampin-resistant tubercu-
losis. New England Journal of Medicine 380: 1201–1213. https: // doi.org / 10.1056 /
NEJMoa1811867.
Pocock, S. J. 1983. Clinical Trials: A Practical Approach. Chichester, U.K.: Wiley.
. 2003. The pros and cons of noninferiority trials. Fundamental and Clinical
Pharmacology 17: 483–490. https: // doi.org / 10.1046 / j.1472-8206.2003.00162.x.
Quartagno, M., A. S. Walker, A. G. Babiker, R. M. Turner, M. K. B. Parmar, A. Co-
pas, and I. R. White. 2020. Handling an uncertain control group event risk in non-
inferiority trials: Non-inferiority frontiers and the power-stabilising transformation.
Trials 21: 145. https: // doi.org / 10.1186 / s13063-020-4070-4.
Quigley, J. M., J. C. Thompson, N. J. Halfpenny, and D. A. Scott. 2019. Critical
appraisal of nonrandomized studies—A review of recommended and commonly used
tools. Journal of Evaluation in Clinical Practice 25: 44–52. https://doi.org/10.1111/
jep.12889.
Rencher, A. C., and G. B. Schaalje. 2008. Linear Models in Statistics. 2nd ed. Hoboken,
NJ: Wiley.
Royston, P., and F. M.-S. Barthel. 2010. Projection of power and events in clinical trials
with a time-to-event outcome. Stata Journal 10: 386–394. https://doi.org/10.1177/
1536867X1001000306.
Satterthwaite, F. E. 1941. Synthesis of variance. Psychometrika 6: 309–316. https:

// doi.org / 10.1007 / BF02288586.
Sealed Envelope. 2012. Power calculator for binary outcome non-inferiority trial. https:
// www.sealedenvelope.com / power / binary-noninferior / .
Welch, B. L. 1938. The significance of the difference between two means when the
population variances are unequal. Biometrika 29: 350–362. https://doi.org/10.2307/
2332010.
White, I. R., E. Marley-Zagar, T. P. Morris, M. K. B. Parmar, P. Royston, and A. G.

Babiker. 2023. artcat: Sample-size calculation for an ordered categorical outcome.
Stata Journal 23: 3–23. https: // doi.org / 10.1177 / 1536867X231161934.
Yuan, K.-H., and P. M. Bentler. 2010. Two simple approximations to the distributions
of quadratic forms. British Journal of Mathematical and Statistical Psychology 63:
273–291. https: // doi.org / 10.1348 / 000711009X449771.
About the authors

Ella Marley-Zagar is a senior research associate and medical statistician in methodological
software at the MRC Clinical Trials Unit in London, U.K. Her interests include developing new
software and research into health and the environment, particularly issues affecting lower- and
middle-income countries.
Ian White is a professor of statistical methods for medicine at the MRC Clinical Trials Unit in
London, U.K., where he coleads programs of design of clinical trials, analysis of clinical trials,
and meta-analysis. His research interests include study design, handling missing data and
noncompliance in clinical trials, statistical models for meta-analysis, and simulation studies.
He is the author of other Stata commands, including mvmeta, network, and simsum.
Patrick Royston is a medical statistician with more than 40 years of experience and a strong
interest in biostatistical methods and in statistical computing and algorithms. He works largely
in methodological issues in the design and analysis of clinical trials and observational studies.
He is currently focusing on alternative outcome measures and tests of treatment effects in trials
with a time-to-event outcome and nonproportional hazards, on parametric modeling of survival
data, and on novel clinical trial designs.
Sophie Barthel is currently a functional manager of the real world solutions group at PRA/ICON
PLC. Her work includes consultancy in clinical research in the areas of clinical trials and
real world data. She is a published author of international research papers in statistics and
eating disorders and has presented at many international conferences, including several invited
presentations.
Mahesh Parmar is a professor of medical statistics and epidemiology and the director of the
MRC Clinical Trials Unit at University College London and the Institute of Clinical Trials
and Methodology at University College London. The unit he directs is at the forefront of
resolving internationally important questions, particularly in infectious diseases, cancer, and
more recently neurodegenerative diseases, and it also aims to deliver swifter and more effective
translation of scientific research into patient benefits by carrying out challenging and inno-
vative studies and by developing and implementing methodological advances in study design,
conduct, and analysis. Examples of his methodological contributions include the development
and implementation of the MAMS platform and DURATIONS designs.
Abdel Babiker is a professor of epidemiology and medical statistics at the MRC Clinical Trials
Unit at University College London. He works on clinical trials in infectious diseases, including
HIV, influenza, and COVID-19, and associated methodology.
Appendix 1: Description of what has changed

Program structure
artbin calls the subroutine art2bin for all two-group trials, which also allows for
substantial-superiority trials. Previously, art2bin was called only for noninferiority
trials in artbin; now it is called for all two-group trials. art2bin can be used as a
standalone command, but we do not recommend this.
New syntax
Some improvements have been made to artbin. The user will need to alter previous
coding using artbin to accommodate the following changes.
The syntax for artbin has been updated to include a margin() option for two-
group trials. For a noninferiority or substantial-superiority trial, the program will
use pr(p1 p2) and the new option margin(). For example, in the previous version
(artbin version 1.1.2), the syntax artbin, pr(.2 .3) ni(1) would now be speci-
fied as artbin, pr(.2 .2) margin(.1). The option ni() is now redundant.
Previously, local was taken as the default in superiority trials. Now it is distant;
the distant() option has been replaced by local in the syntax. Previous syntax (up
to version 1.1.2) will need to be altered so that artbin, pr(.1 .2) distant(1) will
now be artbin, pr(.1 .2) and artbin, pr(.1 .2) distant(0) will now be artbin,
pr(.1 .2) local, for example.
The user may identify whether the outcome is favorable or unfavorable in the con-
text of a two-group trial. With this information plus the margin, the program will then
determine the type of trial (that is, noninferiority, substantial superiority, or superi-
ority). If the user does specify favorable or unfavorable, the program will check the
assumptions. If not, then the program will infer it. The force option can be used to
override the program’s inference of the favorability status, for example, in the design of
observational studies.
The wald option has also been included for the Wald test as an alternative to the
default score test.
Sample size per group is now reported, and rounding up to the nearest integer is
performed per group. A noround option has been included for the case when the user
does not want artbin to round the calculated sample size up to the nearest integer. A
loss to follow-up option is now available.
The option condit always implies the local option because there is no conditional
distant option available in artbin. If the conditional option is selected, then local will
be used (instead of the default distant).
The allocation ratio reflects the fact that sample size is now rounded upward in each
group rather than overall, and the expected number of events is calculated using the
rounded sample size (unless the noround option for calculated sample size is used).
Earlier versions of artbin required several yes or no options to be specified numer-
ically, for example, onesided(1) or onesided(0). In updating the syntax, we have
enabled the more standard options, for example, onesided and ccorrect, but the nu-
merical version of the syntax is retained if the user wishes to use it.
The number of groups is taken as the number of anticipated probabilities in all cases,
and the ngroups() option is now redundant. The required option pr() now takes a
numlist instead of a string.
Changes have been made to the output table so that results are presented in the
same format whether art2bin was called. Now included in the description table are
whether the trial is noninferiority, substantial superiority, or superiority; the trial out-
come type; the statistical test assumed (including score or Wald); whether local or
distant alternatives were used and the hypothesis tests; and whether the continuity cor-
rection was used. Minor formatting was also made to the existing allocation ratio, alpha,
linear trend output, and version numbering output. Sample size per group is reported,
and the returned values have been streamlined to include only results as opposed to
user-inputted options.
The text output has been changed from p0 and p1 to π1 and π2 . Therefore, the
control-group anticipated outcome probability for noninferiority trials is π1 . The corre-
sponding hypotheses tests (included in the output table) are
H0 : π2 − π1 >= / <= m
Ha : π2 − π1 < / > m
The program will now produce error or warning messages for disallowed or uncoded
combinations of options, namely,
• noninferiority or substantial-superiority design with conditional test or trend,

• conditional test and nonlocal alternatives,
• conditional test and Wald test,
• Wald test and local alternatives, and
• continuity correction and the conditional case.
Also, an error message will be produced for > 2 groups if the user specifies fewer
numbers in aratios() than in pr().
Appendix 2: Details of methods

Comparison of K anticipated probabilities
The unconditional tests are based on the score vector U = (U2 , . . . , UK )0 , where Uk =
bk − π
π b.
Define within-group variances sk = πk (1 − πk ) within group k and s = k=1 rk sk
PK
overall, and define total variance s = π(1 − π). Under the null hypothesis H0 : π1 =
π2 = . . . = πK , we have E(Uk |H0 ) = 0 and N cov(Uk , Ul |H0 ) = vkl = s(δkl /rk −
1). Under the global alternative hypothesis Ha : πk 6= πl , for some k, l. Under the
anticipated probabilities πk = πka for all k, we have E(Uk |Ha ) = µk = πka − π a and
N cov(Uk , Ul |Ha ) = akl = sk (δkl /rk − 1) − sl + s.
Let µ = (µ2 , . . . , µK )0 ; Vu = (vkl )k,l=2,...,K ; A = (akl )k,l=2,...,K ; and V
b u and A
b be
the sample estimates of Vu and A obtained by replacing {πk ; k = 1, . . . , K} by their
sample estimates.
We consider first the unconditional score tests for heterogeneity and trend and then
the equivalent Wald and conditional tests.
Unconditional score test for K groups
The score test statistic is Qu = N U0 V

b −1 U. Direct expansion of the quadratic form
u
shows that Qu is equal to the Pearson statistic
K
X 2 n o
Qu = Yk − rk N π
b b 1−π
/ rk N π b
k=1
Asymptotically, under H0 , Qu ∼ χ2K−1 , a central χ2 distribution with K − 1 degrees

of freedom. Denoting the (1 − α)100th percentile of the (central) χ2 distribution with
m degrees of freedom by xα (m), power is related to the total sample size N by the
equation
power = Pr{Qu > xα (K − 1)|Ha }
There is no analytic solution to this equation for K > 2. We consider two ways
to approximate the asymptotic distribution of Qu under Ha in terms of a noncentral
χ2 distribution with K − 1 degrees of freedom and noncentrality parameter λ, whose
cumulative distribution function we denote as FK−1,λ (x).
Local alternative method. If max |πia − πja | is small, the asymptotic distribution of
Qu may be approximated by a noncentral χ2 P with K − 1 degrees of freedom and
noncentrality parameter λ = N µ0 Vu−1 µ = N k µ2k rk /s.
Then the key equation is
power = 1 − FK−1,λ {xα (K − 1)} (2)
which we solve for power given N or for N given power.
Approximate distant method. We instead approximate the distribution of Qu by

that of cX, where c is a constant and X is a noncentral χ2 random variable with
K − 1 degrees of freedom and noncentrality parameter γ; c and γ both depend
on N . Such an approximation for the two-group case was originally proposed by
Welch (1938) and further studied by Satterthwaite (1941) and Box (1954). See
also Yuan and Bentler (2010). The constant multiple c and the noncentrality
parameter γ are calculated by equating the first two moments of Qu and cX using
well-known formulas for the mean and variance of quadratic forms of normal
variables (Mathai and Provost 1992; Rencher and Schaalje 2008):
E(Qu ) = tr(Vu−1 A) + N µ0 Vu−1 µ = c(K − 1 + γ)

n 2 o
var(Qu ) = 2tr Vu−1 A + 4N µ0 Vu−1 AVu−1 µ = 2c2 (K − 1 + 2γ)
We then modify (2) and solve the equation
power = 1 − FK−1,γ {xα (K − 1)/c}
Wald test. The Wald test statistic is Qw = N U0 A b −1 U, and the formulas for power
and sample size are like those for the score test but with the covariance matrix
Vu replaced by A. Thus, the asymptotic distribution of Qw is noncentral χ2 with
K − 1 degrees of freedom and noncentrality parameter λ = N µ0 Vu−1 µ.
Unconditional score test for trend
For dose–response, theP

test for trend with dose scores d1 , . . . , dK is based on the statistic
Tu = k=1 rk dk Uk = k=1 ck Uk , where ck = rk (dk − b) and b is an arbitrary constant
PK K
(because k=1 rk Uk = 0). Taking b = d1 , we have T = k=2 ck Uk = c U, where

PK PK 0
c = (c2 , . . . , cK ) . The mean and variance of T under the null and alternative hypotheses
0
are
E(Tu |H0 ) = 0; var(Tu |H0 ) = c0 Vu c/N

E(Tu |Ha ) = c0 µ; var(Tu |Ha ) = c0 Ac/N
Let za denote the (1 − a)100th percentile of the standard normal distribution. For a
one-sided test, let a = α, and for a two-sided test, let a = α/2. The total sample size
to achieve power 1 − β for a distant test is
√ 0 √ 2
za c Vu c + zβ c0 Ac
N=
c0 µ
For a local test, A is replaced by Vu . Conversely, for a Wald test, Vu is replaced by A.
Conditional test
Some analyses condition on the margins of the contingency table of outcome by treat-
ment group, for example, Fisher’s exact test. For such analyses, a conditional cal-
culation is preferred. As noted in the main text, this PKuses a different score vector
X = (X2 , . . . , XK )0 , where Xk = Yk − rk Y. and Y. = k=1 Yk is the total number of
events.
Let ηk = log{πk /(1 − πk )} − log{π1 /(1 − π1 )} denote the log odds-ratio for the
occurrence of the event in group k relative to group 1, η = (η2 , . . . , ηK )0 . Conditional
on Y. = y. , Y = (Y2 , . . . , YK )0 has a multivariate noncentral hypergeometric distribution
with support
( K
)
X
D= (y2 , . . . , yK ) : 0 ≤ yk ≤ nk , 0 ≤ y. − yk ≤ n 1
k=2
and probability function

1 K nk
f (y2 , . . . , yK ) = Π exp(yk ηk ) (3)
P k=1 yk
where y1 = y. − yk , η1 = 0, and
PK
k=2

X K nk
P = Π exp(yk ηk )
k=1 yk
{(y2 ,...,yK )D}
Denote by e(η) and C(η) the conditional mean and covariance matrix of Y. Differenti-
ating the log of f (y2 , . . . , yK ) in (3) with respect to η yields the score (gradient vector
of the log-likelihood function) for η
def
S(η) = ∂ log{f (y)}/∂η 0 = y − E(Y|y. , η)
the observed minus the (conditionally) expected number of events in groups 2, . . . , K
given y. , η. Under the null hypothesis H0 : η = 0, and conditional on y. , Y has a (cen-
tral) hypergeometric distribution with the elements of the mean vector and covariance
matrix
def
ek (0) = E(Yk |y. , η = 0) = rk y.
(4)
C(0)kl = cov(Yk , Yl |y. , η = 0) = M vkl
def
where
M = y. (N − y. )/(N − 1) (5)
and vkl = rk (δkl − rl ) (McCullagh and Nelder 1989, chap. 7).
The score statistic Qc = S(0)0 C(0)−1 S(0) is a quadratic form based on the score
vector S(0) and its covariance matrix C(0) under the null hypothesis. Denote the kth
element of S(0) by Xk = Yk −rk y. , X = (X2 , . . . , XK )0 , and let Vc be the (K−1)×(K−1)
matrix with elements vkl . Using (4), the score statistic can be written as
√ √
Qc = (X/ M )0 Vc−1 X/ M = X0 Vc−1 X/M
Under H0 , the asymptotic distribution of Qc is (central) χ2 with K − 1 degrees

of freedom. However, there is no simple form for its asymptotic distribution under a
general alternative hypothesis
√ Ha : η 6= 0. We use a local approach. Under a local
alternative [ηk ∼ O(1/ N ) for k = 2, . . . , K], S(η) can be approximated by a linear
function using a first-order Taylor expansion about η = 0,
. .
S(η) = S(0) + S(0)η (6)
.
where S(η) = ∂S(η)/∂η 0 = ∂ 2 log{f (y)}/∂η∂η 0 is the matrix of second partial deriva-
tives of the log likelihood with respect to η. Note that E{S(η)} = 0 and cov{S(η)} =
.
−E{S(η)}. Taking the expectation of both sides of (6), we have
E(X|η) = E{S(0)|η} = M Vc η
Now let the anticipated value of η be η a with ηka = log{πka /(1 − πka )} − log{π1a /(1 −
for all k. Under a local alternative, the asymptotic distribution of Qc is noncentral
π1a )}
χ with K − 1 degrees of freedom and noncentrality parameter
2
λ = M q(η a )
where q(η) = η 0 Vc η. We therefore have this equation relating power to M and hence
to N :
power = 1 − FK−1,λ {xα (K − 1)/c} (7)
Given N , we can compute M from (5) and hence compute power from (7). To
compute N from power, we first use (7) to compute λ. We then solve for N as follows.
Asymptotically, M = T (N − T )/(N − 1), where T = E(Y. ) = N π a is the expected total
number of events. It follows that
λ = T (N − T )q(η)/(N − 1) = T (T /π a − T )q(η)/(T /π a − 1) (8)
Equation (8) is a quadratic equation in T that can be expressed as
(1 − π a )q(η)T 2 − λT + λπ a = 0
The smaller solution is inappropriate, and so

n p o
T = λ + λ2 − 4q(η)λπ a (1 − π a ) / {2(1 − π a )q(η)}
Finally, the total sample size N = T /π a .
Conditional test for trend
For dose–response,
√ Pthe test for √
trend with dose scores d1 , . . . , dK is based on the statistic
Tc = c0 X/ M = k=1 ck Xk / M , where as before c = (c1 , . . . , cK )0 ; ck = rk (dk − d1 );
K
and M = y. (N −y. )/(N −1). The mean and variance of Tc under the null and alternative
hypotheses are
E(Tc |H0 ) = 0; var(Tc |H0 ) = c0 Vc c
√
E(Tc |Ha ) = M c0 Vc η; var(Tc |Ha ) = c0 Vc c
The total sample size to achieve power 1 − β is obtained from
√ 0 √ 2
z a c V c c + z β c0 V c c
M=
c0 V c η
and equating M to its asymptotic value
E(M ) = E(Y .) {N − E(Y. )} /(N − 1) = E(Y .) {N − E(Y. )} /(N − 1)
and noting that E(Y .) = N π a as in the derivation of (8).
Comparing two treatment groups: Noninferiority and substantial su-

periority
Two-arm studies to assess superiority of an experimental treatment use the formulas
given above for K groups. In studies designed to assess noninferiority or substan-
tial superiority of an experimental treatment (group 2) relative to a control treatment
(group 1), the aim is to test whether the outcome in two treatment groups differs by
more than a prespecified amount, and the single parameter of interest is δ = π2 − π1 . If
the binary outcome is unfavorable, the null hypothesis for testing noninferiority takes
the form H0 : δ ≤ m, where m is a prespecified margin and the alternative hypothesis is

Ha : δ > m. The null hypothesis is tested at its boundary H0 : δ = m. As above, let Yi
be the number of events in group i, π
bi = Yi /ni , and ni = ri N , for i = 1, 2.
We consider test statistics of the form T∗ = δb − m = π b1 − m, whose distribution
b2 − π
under the null hypothesis is approximately N(0, Vn /N ), for various definitions of the
variance Vn (discussed below). The anticipated distribution of T∗ under Ha is N(δ −
m, Va /N ), where Va = π1a (1 − π1a )/r1 + π2a (1 − π2a )/r2 . The sample size for a two-sided
test at level α (one-sided test at level α/2), power 1 − β, is
√ p 2
N = zα V n + zβ Va /(δ − m)2
It remains to specify the variance Vn , using the form
e1a (1 − π
Vn = π e1a )/r1 + π
e2a (1 − π
e2a )/r2
where π
e1a and π
e2a are the values π1a and π2a modified so that π2a − π1a = m. They may be
computed in several ways (Farrington and Manning 1990):
• Score test (distant): π

e1a and π
e2a are maximum likelihood estimates of π1 and π2
constrained to δ = m.
• Score test with local approximation: like the score test, but Va is set to equal
Vn . Unlike in the case of a superiority trial, this approximation is not a simpler
calculation than the more appropriate distant calculation, so it should not be used.
• Wald test: π
e1a = π1a and π
e2a = π2a ; equivalently, Vn = Va .
• Score test variant: πe1a and π

e2a are estimates of π1a and π2a constrained to δ = m and
r1 π a
e2 = r1 π1 + r2 π2 . These constraints amount to fixing the margins, like
e1 + r2 π a a a
the conditional test; however, the score test variant is not a conditional method,
because it is based on the risk difference, whereas the conditional test is based on
the odds ratio.
The score test variant is available (but not recommended) by setting the null vari-
ance method using the undocumented option nvmethod(2), where nvmethod(1)
corresponds to the Wald test and nvmethod(3) corresponds to the score test. The
nvmethod() option was used more widely in earlier versions of artbin.
Continuity correction
The continuity-corrected sample size is estimated by computing the unadjusted sample

size in each group and then inflating these by the factor
r 2
1 2c
1+ 1+
4 Nun
where Nun is the total unadjusted sample size and c = 1/(r1 r2 |δ − m|) (Fleiss, Tytun,
and Ury 1980).
The continuity-corrected power is estimated by deflating the given sample size Nadj
by a factor of
c c
1− 1−
Nadj 4Nadj
and then using the standard method on the deflated sample size.

Artbin: Extended Sample Size For Randomized Trials With Binary Outcomes

Uploaded by

Copyright:

Available Formats

Artbin: Extended Sample Size For Randomized Trials With Binary Outcomes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artbin: Extended Sample Size For Randomized Trials With Binary Outcomes

Uploaded by

Copyright:

Available Formats

The Stata Journal (2023)

23, Number 1, pp. 24–52 DOI: 10.1177/1536867X231161971

artbin: Extended sample size for randomized

Patrick Royston Friederike M.-S. Barthel Mahesh K. B. Parmar

© 2023 StataCorp LLC st0013_3

2 The artbin command

artbin, pr(numlist) margin(#)

unfavourable | unfavorable | favourable | favorable power(#) | n(#)

aratios(aratio_list) ltfu(#) alpha(#) onesided trend doses(dose_list)

ltfu(#) assumes a proportional loss to follow-up of #, where # is a number between

. artbin, pr(0.1 0.05) alpha(0.05) power(0.9) wald

A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Type of trial superiority

3.2 Binary outcome and comparison with power

. power twoproportions 0.1 0.05, alpha(0.05) power(0.9)

A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Type of trial superiority

Both give a total sample size of 1,164.

3.3 One-sided noninferiority trial

. artbin, pr(0.9 0.9) margin(-0.05) onesided

A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Type of trial non-inferiority

A sample size of 457 is required in each group.

3.4 Superiority trial with multiple groups

. artbin, pr(0.1 0.2 0.3 0.4) alpha(0.1) power(0.9)

A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Type of trial superiority

A sample size of 44 is required in all four groups.

3.5 Complex noninferiority trial in a real-life setting

. artbin, pr(0.7 0.75) margin(-0.1) power(0.8) aratios(1 2) wald ltfu(0.2)

A sample size program by Abdel Babiker, Patrick Royston, Friederike Barthel,

Type of trial non-inferiority

4 Menu and dialogs

• survival outcomes (corresponding to artsurv),

• binary outcomes (corresponding to artbin).

Figure 1. Example of a completed artbin menu for binary outcomes

5 Methods and formulas

probability. Let Y. = k Yk . The estimated outcome probabilities π bk and πb are π

5.2 Summary of test statistics and their distributions

Table 1. Summary of test statistics and their distributions

Method Statistic Distribution

Conditional Qc = X0 Vc−1 X/M χ2K−1 N Cχ2 (K − 1, λ)

Two groups, superiority or noninferiority

5.3 Summary of methods

power = Pr{SN > xα (K − 1)|Ha } (1)

power = 1 − FK−1,λ {xα (K − 1)/c}

5.3.2 All other cases

Rearranging, the total sample size to achieve power 1 − β is

rejecting the null hypothesis in either direction if a two-tailed test is performed. We

9 Programs and supplemental materials

The artbin command also is available on the Statistical Software Components

. ssc install art

Barthel, F. M.-S., A. Babiker, P. Royston, and M. K. B. Parmar. 2006. Evaluation of

Blackwelder, W. C. 1982. “Proving the null hypothesis” in clinical trials. Controlled

Braga, A. A., D. L. Weisburd, E. J. Waring, L. G. Mazerolle, W. Spelman, and

Satterthwaite, F. E. 1941. Synthesis of variance. Psychometrika 6: 309–316. https:

White, I. R., E. Marley-Zagar, T. P. Morris, M. K. B. Parmar, P. Royston, and A. G.

About the authors

Appendix 1: Description of what has changed