Formulas
Formulas
Formulas
Appendix A
Contents
Glossaries 18
Acronyms 19
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND DATA
VISUALIZATION 1
This appendix chapter holds a collection of formulas. All the relevant equations from def-
initions, methods and theorems are included – along with associated R functions. All are
in included in the same order as in the book, except for the distributions which are listed
together.
Sample mean 1 n
n i∑
1.4 x̄ = xi mean(x)
The mean of a sample. =1
Sample median
The value that divides a sam-
x
( n+ 1 for odd n
1.5 ple in two halves with equal 2 ) median(x)
Q2 = x ( n ) + x ( n +2 )
2
number of observations in
2
2
for even n
each.
Sample quantile
The value that divide a sam- (x
(np) + x(np+1)
ple such that p of the obser- for pn integer quantile(x,p,type=2),
1.7 qp = 2
vations are less that the value. x(dnpe) for pn non-integer
The 0.5 quantile is the Me-
dian.
Sample quartiles Q0 = q0 = “minimum”
The quartiles are the five Q1 = q0.25 = “lower quartile” quantile(x,
quantiles dividing the sample Q2 = q0.5 = “median” probs,type=2)
1.8
in four parts, such that each where
Q3 = q0.75 = “upper quartile”
part holds an equal number of probs=p
Q4 = q1 = “maximum”
observations
Sample variance
n
The sum of squared differ- 1
n − 1 i∑
1.10 s2 = ( xi − x̄ )2 var(x)
ences from the mean divided =1
by n − 1.
Sample standard deviation s
√ 1 n
n − 1 i∑
1.11 The square root of the sample s= s2 = ( xi − x̄ )2 sd(x)
variance. =1
Sample covariance
1.18 Measure of linear strength of s xy = 1
n −1 ∑in=1 ( xi − x̄ ) (yi − ȳ) cov(x,y)
relation between two samples
Sample correlation
Measure of the linear strength
xi − x̄
yi −ȳ
s xy
1.19 r= 1
n −1 ∑in=1 sx sy = s x ·sy cor(x,y)
of relation between two sam-
ples between -1 and 1.
Chapter A A.2 PROBABILITY AND SIMULATION 3
Covariance
The covariance between be
2.58 Cov( X, Y ) = E [( X − E[ X ])(Y − E[Y ])]
two random variables X and
Y.
Chapter A A.2 PROBABILITY AND SIMULATION 5
A.2.1 Distributions
Here all the included distributions are listed including some important theorems and definitions
related specifically with a distribution.
Binominal distribution
f ( x; n, p) = P( X = x ) dbinom(x, size, prob)
n is the number of indepen-
n x pbinom(q, size, prob)
dent draws and p is the prob- = p (1 − p ) n − x
x qbinom(p, size, prob)
2.20 ability of a success in each
n n! rbinom(n, size, prob)
draw. The Binominal pdf de- where = where
scribes the probability of x x x!(n − x )!
size=n, prob=p
succeses.
Mean and variance of a bino-
µ = np
2.21 mial distributed random vari-
able. σ2 = np(1 − p)
Transformation of a normal
distributed random variable X−µ
2.43 Z=
X into a standardized normal σ
random variable.
dlnorm(x,meanlog,sdlog)
Log-normal distribution
plnorm(q,meanlog,sdlog)
α is the mean and β2 is the 2
1 − (ln x−α) qlnorm(p,meanlog,sdlog)
2.46 variance of the normal distri- f (x) = √ e 2β2
x 2πβ rlnorm(n,meanlog,sdlog)
bution obtained when taking
where
the natural logarithm to X.
meanlog=α, sdlog=β.
Mean and variance of a log- 2 /2
µ = eα+ β
normal distributed random
2 2
2.47 variable. σ2 = e2α+ β (e β − 1)
dexp(x,rate)
( pexp(q,rate)
2.48
Exponential distribution λe−λx for x ≥ 0 qexp(p,rate)
λ is the mean rate of events. f ( x; λ) = rexp(n,rate)
0 for x < 0
where
rate=λ.
Mean and variance of a ex- 1
µ=
ponential distributed random λ
2.49 1
variable. σ2 = 2
λ
dchisq(x,df)
pchisq(q,df)
χ2 -distribution
1 x qchisq(p,df)
x 2 −1 e − 2 ;
ν
Γ ν2 is the Γ-function and ν is
2.78 f (x) = x≥0
2 Γ 2
ν ν
2 rchisq(n,df)
the degrees of freedom.
where
df=ν.
Chapter A A.2 PROBABILITY AND SIMULATION 7
tion.
dt(x,df)
Relation between normal pt(q,df)
random variables and χ2 - Z qt(p,df)
2.87 X= √ ∼ t(ν)
distributed random variables. Y/ν rt(n,df)
Z ∼ N (0, 1) and Y ∼ χ2 (ν). where
df=ν.
For normal distributed ran-
dom variables X1 , . . . , Xn , the
random variable follows the
t-distribution, where X is the X−µ
2.89 T= √ ∼ t ( n − 1)
sample mean, µ is the mean of S/ n
X, n is the sample size and S
is the sample standard devia-
tion.
Mean and variance of a t- µ = 0; ν>1
2.93 distributed variable X. σ2 =
ν
; ν>2
ν−2
ν21 df(x,df1,df2)
F-distribution 1 ν1
f F (x) = pf(q,df1,df2)
ν1 an ν2 are the degrees of ν1 ν2
B 2, 2 ν2
qf(p,df1,df2)
2.95 freedom and B(·, ·) is the Beta − ν1 +2 ν2
ν1 ν1 rf(n,df1,df2)
function. · x 2 −1 1 + x where
ν2
df1=ν1 ,df2=µ2 .
The F-distribution appears as
the ratio between two inde-
U/ν1
2.96 pendent χ2 -distributed ran- ∼ F (ν1 , ν2 )
V/ν2
dom variables with U ∼
χ2 (ν1 ) and V ∼ χ2 (ν2 ).
Chapter A A.2 PROBABILITY AND SIMULATION 8
1 n 2
The distribution of the mean σ
3.3
of normal random variables.
X̄ = ∑
n i =1
Xi ∼ N µ,
n
The distribution of the σ-
X̄ − µ
√ ∼ N 0, 12
3.5 standardized mean of normal Z=
σ/ n
random variables
The distribution of the S-
X̄ − µ
3.5 standardized mean of normal T= √ ∼ t ( n − 1)
S/ n
random variables
Standard Error of the mean s
SEx̄ = √
3.7 n
The one sample confidence in- s
3.9 x̄ ± t1−α/2 · √
terval for µ n
X̄ − µ
3.14 Central Limit Theorem (CLT) Z= √
σ/ n
" #
( n − 1 ) s 2 ( n − 1) s2
σ2 : ;
Confidence interval for the χ21−α/2 χ2α/2
3.19 variance and standard devia- "s s #
( n − 1) s2 ( n − 1) s2
tion σ: ;
χ21−α/2 χ2α/2
x̄ ± t1−α/2 · √sn
3.33 Confidence interval for µ
acceptance region/CI: H0 : µ = µ0
Test: H0 : µ = µ0 and H1 : µ 6= µ0 by
p-value = 2 · P( T > |tobs |)
3.36 The level α one-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The one-sample confidence
z1−α/2 ·σ 2
3.63 interval (CI) sample size for- n= ME
mula
The one-sample sample size 2
z +z
3.65 n = σ 1−(µβ −µ1−)α/2
formula 0 1
naive approach: pi = ni , i = 1, . . . , n
The Normal q-q plot with
3.42 commonly aproach: pi = in−+0.5 1, i =
n > 10
1, . . . , n
δ = µ2 − µ1
The (Welch) two-sample t-test H0 : δ = δ0
3.49 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s1 /n1 +s2 /n2
( X̄ − X̄ )−δ
T = √ 21 2 2 0
S /n1 +S2 /n2
1 2
The distribution of the s s2
2
3.50 1
n +n
2
(Welch) two-sample statistic ν=
1 2
(s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1
Test: H0 : µ1 − µ2 = δ0 and H1 : µ1 −
µ2 6= δ0 by p-value = 2 · P( T > |tobs |)
3.51 The level α two-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The pooled two-sample esti- (n1 −1)s21 +(n2 −1)s22
3.52 s2p = n1 + n2 −2
mate of variance
δ = µ1 − µ2
The pooled two-sample t-test H0 : δ = δ0
3.53 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s p /n1 +s p /n2
1. Simulate k outcomes
Non-linear error propagation 2. Calculate the
4.4
by simulation q standard deviation by
s f (X ,...,Xn ) = k−1 1 ∑ik=1 ( f j − f¯)2
sim
1
∑in=1 (Yi − Ȳ )( xi − x̄ )
β̂ 1 =
Sxx
5.4 Least square estimators β̂ 0 = Ȳ − β̂ 1 x̄
where Sxx = ∑in=1 ( xi − x̄ )2
σ2 x̄2 σ2
V[ β̂ 0 ] = +
n Sxx
σ 2
5.8 Variance of estimators V[ β̂ 1 ] =
Sxx
x̄σ2
Cov[ β̂ 0 , β̂ 1 ] = −
Sxx
β̂ 0 − β 0,0
Tβ0 =
σ̂β0
Tests statistics for H0 : β 0 = 0
5.12 β̂ 1 − β 0,1
and H0 : β 1 = 0 Tβ1 =
σ̂β1
β̂ 0 ± t1−α/2 σ̂β0
Parameter confidence inter- confint(fit,level=0.95)
5.15 β̂ 1 ± t1−α/2 σ̂β1
vals
predict(fit,
newdata=data.frame(),
Confidence interval for the line:
q interval="confidence",
1 ( xnew − x̄ )2
Confident and prediction in- β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ n + Sxx
level=0.95)
5.18 predict(fit,
terval Interval for a new point prediction:
q newdata=data.frame(),
1 ( xnew − x̄ )2 interval="prediction",
β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ 1+ n + Sxx
level=0.95)
β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
5.23
the simple linear regression RSS
σ̂2 =
model n−2
β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
6.17
the multiple linear regression RSS
σ̂2 =
model n − ( p + 1)
Test: H0 : p = p0 , vs. H1 : p 6= p0
by p-value = 2 · P( Z > |zobs |)
The level α one-sample pro- prop.test(x=, n=,
7.11 where Z ∼ N (0, 12 )
portion hypothesis test correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0
Guessed p (with prior knowledge):
z −α/2 2
Sample size formula for the CI n = p(1 − p)( 1ME )
7.13
of a proportion Unknown p:
z −α/2 2
n = 14 ( 1ME )
Test: H0 : p1 = p2 , vs. H1 : p1 6= p2
by p-value = 2 · P( Z > |zobs |)
prop.test(x=, n=,
7.18 The level α one-sample t-test where Z ∼ N (0, 12 )
correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0
Otherwise accept
Chapter A A.8 COMPARING MEANS OF MULTIPLE GROUPS - ANOVA 16
k ni k ni
H0 : αi = 0; i = 1, 2, . . . , k,
Test: H0 : µi = µ j vs. H1 : µi 6= µ j
by p-value = 2 · P( T > |tobs |)
ȳi −ȳ j
Post hoc pairwise hypothesis where tobs = s
8.10
1
tests MSE ni + n1
j
One-way ANOVA
Two-way ANOVA
Glossaries
cumulated distribution function [Fordelingsfunktion]The cdf is the function which determines the
probability of observing an outcome of a random variable below a given value 3
Correlation [Korrelation] The sample correlation coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Covariance 2
Covariance [Kovarians] The sample covariance coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Correlation 2, 4
F-distribution [F-fordelingen] The F-distribution appears as the ratio between two independent χ2 -
distributed random variables 16
Inter Quartile Range [Interkvartil bredde] The Inter Quartile Range (IQR) is the middle 50% range
of data 1
Median [Median, stikprøvemedian] The median of population or sample (note, in text no distin-
guishment between population median and sample median) 1
probability density function The pdf is the function which determines the probability of every pos-
sible outcome of a random variable 3
Quantile [Fraktil, stikprøvefraktil] The quantiles of population or sample (note, in text no distin-
guishment between population quantile and sample quantile) 1
Quartile [Fraktil, stikprøvefraktil] The quartiles of population or sample (note, in text no distin-
guishment between population quartile and sample quartile) 1
Acronyms