Formulas

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Chapter A

Appendix A

Collection of formulas and R commands


Chapter A

Contents

A Collection of formulas and R commands


A.1 Introduction, descriptive statistics, R and data visualization . . . . . . . . . . 1
A.2 Probability and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A.2.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A.3 Statistics for one and two samples . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.4 Simulation based statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
A.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.6 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.7 Inference for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
A.8 Comparing means of multiple groups - ANOVA . . . . . . . . . . . . . . . . . 16

Glossaries 18

Acronyms 19
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND DATA
VISUALIZATION 1

This appendix chapter holds a collection of formulas. All the relevant equations from def-
initions, methods and theorems are included – along with associated R functions. All are
in included in the same order as in the book, except for the distributions which are listed
together.

A.1 Introduction, descriptive statistics, R and data


visualization

Description Formula R command

Sample mean 1 n
n i∑
1.4 x̄ = xi mean(x)
The mean of a sample. =1

Sample median
The value that divides a sam- 
x
( n+ 1 for odd n
1.5 ple in two halves with equal 2 ) median(x)
Q2 = x ( n ) + x ( n +2 )
2
number of observations in 
2
2
for even n
each.
Sample quantile
The value that divide a sam- (x
(np) + x(np+1)
ple such that p of the obser- for pn integer quantile(x,p,type=2),
1.7 qp = 2
vations are less that the value. x(dnpe) for pn non-integer
The 0.5 quantile is the Me-
dian.
Sample quartiles Q0 = q0 = “minimum”
The quartiles are the five Q1 = q0.25 = “lower quartile” quantile(x,
quantiles dividing the sample Q2 = q0.5 = “median” probs,type=2)
1.8
in four parts, such that each where
Q3 = q0.75 = “upper quartile”
part holds an equal number of probs=p
Q4 = q1 = “maximum”
observations
Sample variance
n
The sum of squared differ- 1
n − 1 i∑
1.10 s2 = ( xi − x̄ )2 var(x)
ences from the mean divided =1
by n − 1.
Sample standard deviation s
√ 1 n

n − 1 i∑
1.11 The square root of the sample s= s2 = ( xi − x̄ )2 sd(x)
variance. =1

Sample coefficient of vari-


ance
s
1.12 The sample standard devia- V= sd(x)/mean(x)

tion seen relative to the sam-
ple mean.
Sample Inter Quartile Range
1.15 IQR: The middle 50% range of IQR = Q3 − Q1 IQR(x, type=2)
data
Chapter A A.1 INTRODUCTION, DESCRIPTIVE STATISTICS, R AND DATA
VISUALIZATION 2

Description Formula R command

Sample covariance
1.18 Measure of linear strength of s xy = 1
n −1 ∑in=1 ( xi − x̄ ) (yi − ȳ) cov(x,y)
relation between two samples
Sample correlation
Measure of the linear strength 
xi − x̄

yi −ȳ

s xy
1.19 r= 1
n −1 ∑in=1 sx sy = s x ·sy cor(x,y)
of relation between two sam-
ples between -1 and 1.
Chapter A A.2 PROBABILITY AND SIMULATION 3

A.2 Probability and Simulation

Description Formula R command

Probability density function


(pdf) for a discrete variable
dnorm,dbinom,dhyper,
2.6 fulfills two conditions: f ( x ) ≥ f ( x ) = P( X = x )
dpois
0 and ∑all x f ( x ) = 1 and finds
the probality for one x value.
Cumulated distribution
function (cdf)
pnorm,pbinom,phyper,
2.9 gives the probability in a F ( x ) = P( X ≤ x )
ppois
range of x values where
P ( a < X ≤ b ) = F ( b ) − F ( a ).
Mean of a discrete random
2.13 variable µ = E( X ) = ∑i∞=1 xi f ( xi )

Variance of a discrete ran-


2.16 dom variable X σ2 = Var( X ) = E[( X − µ)2 ]

Pdf of a continuous random


variable
is a non-negative function for Rb
2.32 P( a < X ≤ b) = a
f ( x )dx
all possible outcomes and has
an area below the function of
one
Cdf of a continuous random
variable Rx
2.33 is non-decreasing F ( x ) = P( X ≤ x ) = −∞ f (u)du
and limx→−∞ F ( x ) =
0 and limx→∞ F ( x ) = 1
Mean and variance for a con- R∞
µ = E( X ) = −∞ x f ( x )dx
2.34 tinuous random variable X R∞
σ2 = E[( X − µ)2 ] = −∞ ( x − µ)2 f ( x )dx

Mean and variance of a linear


function
E( aX + b) = a E( X ) + b
2.54 The mean and variance of a
linear function of a random V( aX + b) = a2 V( X )
variable X.
Mean and variance of a linear E ( a 1 X1 + a 2 X2 + · · · + a n X n ) =
combination a 1 E ( X1 ) + a 2 E ( X2 ) + · · · + a n E ( X n )
2.56 The mean and variance of a V ( a 1 X1 + a 2 X2 + . . . + a n X n ) =
linear combination of random
a21 V( X1 ) + a22 V( X2 ) + · · · + a2n V( Xn )
variables.
Chapter A A.2 PROBABILITY AND SIMULATION 4

Description Formula R command

Covariance
The covariance between be
2.58 Cov( X, Y ) = E [( X − E[ X ])(Y − E[Y ])]
two random variables X and
Y.
Chapter A A.2 PROBABILITY AND SIMULATION 5

A.2.1 Distributions

Here all the included distributions are listed including some important theorems and definitions
related specifically with a distribution.

Description Formula R command

Binominal distribution
f ( x; n, p) = P( X = x ) dbinom(x, size, prob)
n is the number of indepen-  
n x pbinom(q, size, prob)
dent draws and p is the prob- = p (1 − p ) n − x
x qbinom(p, size, prob)
2.20 ability of a success in each  
n n! rbinom(n, size, prob)
draw. The Binominal pdf de- where = where
scribes the probability of x x x!(n − x )!
size=n, prob=p
succeses.
Mean and variance of a bino-
µ = np
2.21 mial distributed random vari-
able. σ2 = np(1 − p)

Hypergeometric distribution f ( x; n, a, N ) = P( X = x ) dhyper(x,m,n,k)


n is the number of draws ( xa )( Nn−−xa) phyper(q,m,n,k)
without replacement, a is = qhyper(p,m,n,k)
2.24 ( Nn )
number of succeses and N is   rhyper(nn,m,n,k)
the population size. a a!
where = where
b b!( a − b)!
m=a, n=N − a, k=n
Mean and variance of a hyper- a
µ=n
geometric distributed random N
2.25 a ( N − a) N − n
variable. σ2 = n
N2 N−1

Poisson distribution dpois(x,lambda)


λ is the rate (or intensity) i.e. ppois(q,lambda)
the average number of events λ x −λ qpois(p,lambda)
2.27 f ( x; λ) = e
per interval. The Poisson pdf x! rpois(n,lambda)
describes the probability of x where
events in an interval. lambda=λ
Mean and variance of a Pois-
µ=λ
2.28 son distributed random vari-
able. σ2 = λ

Uniform distribution 0

 for x < α
α and β defines the range of f ( x; α, β) = 1
for x ∈ [α, β] dunif(x,min,max)
β−α
possible outcomes. random 
 punif(q,min,max)
0 for x > β
variable following the uni- qunif(p,min,max)
2.35

form distribution has equal 0

 for x < α runif(n,min,max)
density at any value within a F ( x; α, β) = x −α
for x ∈ [α, β] where
defined range.  β−α
 min=α, max=β
0 for x > β
Chapter A A.2 PROBABILITY AND SIMULATION 6

Description Formula R command

Mean and variance of a uni- 1


µ= (α + β)
form distributed random vari- 2
2.36 1
able X. σ2 = ( β − α )2
12
dnorm(x,mean,sd)
pnorm(q,mean,sd)
Normal distribution ( x − µ )2
1 qnorm(p,mean,sd)
2.37 Often also called the Gaussian f ( x; µ, σ) = √ e− 2σ2
σ 2π rnorm(n,mean,sd)
distribution.
where
mean=µ, sd=σ.
Mean and variance of a nor-
µ
2.38 mal distributed random vari-
able. σ2

Transformation of a normal
distributed random variable X−µ
2.43 Z=
X into a standardized normal σ
random variable.

dlnorm(x,meanlog,sdlog)
Log-normal distribution
plnorm(q,meanlog,sdlog)
α is the mean and β2 is the 2
1 − (ln x−α) qlnorm(p,meanlog,sdlog)
2.46 variance of the normal distri- f (x) = √ e 2β2
x 2πβ rlnorm(n,meanlog,sdlog)
bution obtained when taking
where
the natural logarithm to X.
meanlog=α, sdlog=β.
Mean and variance of a log- 2 /2
µ = eα+ β
normal distributed random
2 2
2.47 variable. σ2 = e2α+ β (e β − 1)

dexp(x,rate)
( pexp(q,rate)
2.48
Exponential distribution λe−λx for x ≥ 0 qexp(p,rate)
λ is the mean rate of events. f ( x; λ) = rexp(n,rate)
0 for x < 0
where
rate=λ.
Mean and variance of a ex- 1
µ=
ponential distributed random λ
2.49 1
variable. σ2 = 2
λ
dchisq(x,df)
pchisq(q,df)
χ2 -distribution
1 x qchisq(p,df)
 x 2 −1 e − 2 ;
ν
Γ ν2 is the Γ-function and ν is

2.78 f (x) = x≥0
2 Γ 2
ν ν
2 rchisq(n,df)
the degrees of freedom.
where
df=ν.
Chapter A A.2 PROBABILITY AND SIMULATION 7

Description Formula R command

Given a sample of size n from


the normal distributed ran-
dom variables Xi with vari-
ance σ2 , then the sample vari-
ance S2 (viewed as random
( n − 1) S2
2.81 χ2 =
variable) can be transformed σ2
to follow the χ2 distribution
with the degrees of freedom
ν = n − 1.
Mean and variance of a χ2 dis- E( X ) = ν
2.83
tributed random variable. V ( X ) = 2ν
t-distribution
ν is the degrees of freedom Γ ( ν+ 1 − ν+2 1
2 )

2.86 t2
f T (t) = √ 1+
and Γ() is the Gamma func- νπ Γ( 2ν ) ν

tion.
dt(x,df)
Relation between normal pt(q,df)
random variables and χ2 - Z qt(p,df)
2.87 X= √ ∼ t(ν)
distributed random variables. Y/ν rt(n,df)
Z ∼ N (0, 1) and Y ∼ χ2 (ν). where
df=ν.
For normal distributed ran-
dom variables X1 , . . . , Xn , the
random variable follows the
t-distribution, where X is the X−µ
2.89 T= √ ∼ t ( n − 1)
sample mean, µ is the mean of S/ n
X, n is the sample size and S
is the sample standard devia-
tion.
Mean and variance of a t- µ = 0; ν>1
2.93 distributed variable X. σ2 =
ν
; ν>2
ν−2
  ν21 df(x,df1,df2)
F-distribution 1 ν1
f F (x) = pf(q,df1,df2)
ν1 an ν2 are the degrees of ν1 ν2

B 2, 2 ν2
qf(p,df1,df2)
2.95 freedom and B(·, ·) is the Beta  − ν1 +2 ν2
ν1 ν1 rf(n,df1,df2)
function. · x 2 −1 1 + x where
ν2
df1=ν1 ,df2=µ2 .
The F-distribution appears as
the ratio between two inde-
U/ν1
2.96 pendent χ2 -distributed ran- ∼ F (ν1 , ν2 )
V/ν2
dom variables with U ∼
χ2 (ν1 ) and V ∼ χ2 (ν2 ).
Chapter A A.2 PROBABILITY AND SIMULATION 8

Description Formula R command

X1 , . . . , Xn1 and Y1 , . . . , Yn2


with the mean µ1 and µ2
S12 /σ12
2.98 and the variance σ12 and σ22 ∼ F (n1 − 1, n2 − 1)
is independent and sampled S22 /σ22
from a normal distribution.
Mean and variance of a F- ν2
µ= ; ν2 > 2
distributed variable X. ν2 − 2
2.101 2ν22 (ν1 + ν2 − 2)
σ= ; ν2 > 4
ν1 (ν2 − 2)2 (ν2 − 4)
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 9

A.3 Statistics for one and two samples

Description Formula R command

1 n 2
 
The distribution of the mean σ
3.3
of normal random variables.
X̄ = ∑
n i =1
Xi ∼ N µ,
n
The distribution of the σ-
X̄ − µ
√ ∼ N 0, 12

3.5 standardized mean of normal Z=
σ/ n
random variables
The distribution of the S-
X̄ − µ
3.5 standardized mean of normal T= √ ∼ t ( n − 1)
S/ n
random variables
Standard Error of the mean s
SEx̄ = √
3.7 n
The one sample confidence in- s
3.9 x̄ ± t1−α/2 · √
terval for µ n
X̄ − µ
3.14 Central Limit Theorem (CLT) Z= √
σ/ n
" #
( n − 1 ) s 2 ( n − 1) s2
σ2 : ;
Confidence interval for the χ21−α/2 χ2α/2
3.19 variance and standard devia- "s s #
( n − 1) s2 ( n − 1) s2
tion σ: ;
χ21−α/2 χ2α/2

The p-value is the probability of obtain-


ing a test statistic that is at least as ex-
treme as the test statistic that was actu-
3.22 The p-value P(T>x)=2(1-pt(x,n-1))
ally observed. This probability is calcu-
lated under the assumption that the null
hypothesis is true.

p-value = 2 · P( T > |tobs |)


x̄ − µ0
The one-sample t-test statistic tobs = √
3.23 s/ n
and p-value
H0 : µ = µ0

Rejected: p-value < α


3.24 The hypothesis test
Accepted: otherwise
3.29 Significant effect An effect is significant if the p-value< α
The critical values: α/2- and
1 − α/2-quantiles of the t-
3.31 tα/2 and t1−α/2
distribution with n − 1 de-
grees of freedom
The one-sample hypothesis Reject: |tobs | > t1−α/2
3.32
test by the critical value accept: otherwise
Chapter A A.3 STATISTICS FOR ONE AND TWO SAMPLES 10

Description Formula R command

x̄ ± t1−α/2 · √sn
3.33 Confidence interval for µ
acceptance region/CI: H0 : µ = µ0
Test: H0 : µ = µ0 and H1 : µ 6= µ0 by
p-value = 2 · P( T > |tobs |)
3.36 The level α one-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The one-sample confidence
z1−α/2 ·σ 2
3.63 interval (CI) sample size for- n= ME
mula
The one-sample sample size  2
z +z
3.65 n = σ 1−(µβ −µ1−)α/2
formula 0 1

naive approach: pi = ni , i = 1, . . . , n
The Normal q-q plot with
3.42 commonly aproach: pi = in−+0.5 1, i =
n > 10
1, . . . , n

δ = µ2 − µ1
The (Welch) two-sample t-test H0 : δ = δ0
3.49 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s1 /n1 +s2 /n2

( X̄ − X̄ )−δ
T = √ 21 2 2 0
S /n1 +S2 /n2
1 2
The distribution of the s s2
2
3.50 1
n +n
2
(Welch) two-sample statistic ν=
1 2
(s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1

Test: H0 : µ1 − µ2 = δ0 and H1 : µ1 −
µ2 6= δ0 by p-value = 2 · P( T > |tobs |)
3.51 The level α two-sample t-test
Reject: p-value < α or |tobs | > t1−α/2
Accept: Otherwise
The pooled two-sample esti- (n1 −1)s21 +(n2 −1)s22
3.52 s2p = n1 + n2 −2
mate of variance

δ = µ1 − µ2
The pooled two-sample t-test H0 : δ = δ0
3.53 ( x̄ − x̄ )−δ
statistic tobs = √ 21 2 2 0
s p /n1 +s p /n2

The distribution of the pooled ( X̄ − X̄ )−δ


3.54 T = √ 21 2 2 0
two-sample t-test statistic S p /n1 +S p /n2
q
s21 s22
x̄ − ȳ ± t1−α/2 · n1 + n2
2
The two-sample confidence s2 s2

1 2
3.47 n1 + n2
interval for µ1 − µ2 ν= (s21 /n1 )2 (s22 /n2 )2
n1 −1 + n2 −1
Chapter A A.4 SIMULATION BASED STATISTICS 11

A.4 Simulation based statistics

Description Formula R command

The non-linear approximative  2


∂f
4.3 σ2f (X
1 ,...,Xn )
= ∑in=1 σi2
error propagation rule ∂xi

1. Simulate k outcomes
Non-linear error propagation 2. Calculate the
4.4
by simulation q standard deviation by
s f (X ,...,Xn ) = k−1 1 ∑ik=1 ( f j − f¯)2
sim
1

Confidence interval for any 1.Simulate k samples


4.7 feature θ by parametric boot- 2.Calculate the hstatistic θ̂ i
strap ∗
3.Calculate CI: q100 , q ∗
(α/2)% 100(1−α/2)%

Two-sample confidence in- 1.Simulate k sets of 2 samples


terval for any feature com- ∗ − θ̂ ∗
2.Calculate the statistic θ̂ xk
4.10 yk
parison θ1 − θ2 by parametric h i
∗ ∗
3.Calculate CI: q100(α/2)% , q100(1−α/2)%
bootstrap
Chapter A A.5 SIMPLE LINEAR REGRESSION 12

A.5 Simple linear regression

Description Formula R command

∑in=1 (Yi − Ȳ )( xi − x̄ )
β̂ 1 =
Sxx
5.4 Least square estimators β̂ 0 = Ȳ − β̂ 1 x̄
where Sxx = ∑in=1 ( xi − x̄ )2

σ2 x̄2 σ2
V[ β̂ 0 ] = +
n Sxx
σ 2
5.8 Variance of estimators V[ β̂ 1 ] =
Sxx
x̄σ2
Cov[ β̂ 0 , β̂ 1 ] = −
Sxx

β̂ 0 − β 0,0
Tβ0 =
σ̂β0
Tests statistics for H0 : β 0 = 0
5.12 β̂ 1 − β 0,1
and H0 : β 1 = 0 Tβ1 =
σ̂β1

Test H0,i : β i = β 0,i vs. H1,i : β i 6= β 0,i


with p-value = 2 · P( T > |tobs,βi |) D <- data.frame(
β̂ i − β 0,i x=c(), y=c())
5.14 Level α t-tests for parameter where tobs,βi = σ̂βi . fit <- lm(y~x, data=D)
If p-value < α then reject H0 , summary(fit)
otherwise accept H0

β̂ 0 ± t1−α/2 σ̂β0
Parameter confidence inter- confint(fit,level=0.95)
5.15 β̂ 1 ± t1−α/2 σ̂β1
vals

predict(fit,
newdata=data.frame(),
Confidence interval for the line:
q interval="confidence",
1 ( xnew − x̄ )2
Confident and prediction in- β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ n + Sxx
level=0.95)
5.18 predict(fit,
terval Interval for a new point prediction:
q newdata=data.frame(),
1 ( xnew − x̄ )2 interval="prediction",
β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ 1+ n + Sxx
level=0.95)

β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
5.23
the simple linear regression RSS
σ̂2 =
model n−2

Coefficient of determination ∑i (yi −ŷi )2


r2 = 1 − ∑i (yi −ȳ)2
R2
5.25
Chapter A A.5 SIMPLE LINEAR REGRESSION 13

Description Formula R command

> Check the normality assumption with qqnorm(fit$residuals)


a q-q plot of the residuals. qqline(fit$residuals)
Model validation of assump-
5.7 > Check the systematic behavior by
tions plot(fit$fitted.values,
plotting the residuals ei as a function of
fitted values ŷi fit$residuals)
Chapter A A.6 MULTIPLE LINEAR REGRESSION 14

A.6 Multiple linear regression

Description Formula R command

Test H0,i : β i = β 0,i vs. H1,i : β i 6= β 0,i D<-data.frame(x1=c(),


with p-value = 2 · P( T > |tobs,βi |) x2=c(),y=c())
β̂ i − β 0,i
6.2 Level α t-tests for parameter where tobs,βi = σ̂βi .
fit <- lm(y~x1+x2,
If p-value < α the reject H0 , data=D)
otherwise accept H0 summary(fit)

Parameter confidence inter-


6.5 β̂ i ± t1−α/2 σ̂βi confint(fit,level=0.95)
vals
predict(fit,
newdata=data.frame(),
Confident interval for the line interval="confidence",
β̂ 0 + β̂ 1 x1,new + · · · + β̂ p x p,new level=0.95)
Confident and prediction in-
6.9 predict(fit,
terval (in R)
Interval for a new point prediction newdata=data.frame(),
β̂ 0 + β̂ 1 x1,new + · · · + β̂ p x p,new + ε new interval="prediction",
level=0.95)

β̂ = ( X T X )−1 X T Y
The matrix formulation of
the parameter estimators in V [ β̂] = σ2 ( X T X )−1
6.17
the multiple linear regression RSS
σ̂2 =
model n − ( p + 1)

Backward selection: start with full


6.16 Model selection procedure model and stepwise remove insignifi-
cant terms
Chapter A A.7 INFERENCE FOR PROPORTIONS 15

A.7 Inference for proportions

Description Formula R command


x
Proportion estimate and con- p̂ = n prop.test(x=, n=,
7.3 q
p̂(1− p̂)
fidence interval p̂ ± z1−α/2 correct=FALSE)
n

Approximate proportion with X −np0


7.10 Z= √ ∼ N (0, 1)
Z np0 (1− p0 )

Test: H0 : p = p0 , vs. H1 : p 6= p0
by p-value = 2 · P( Z > |zobs |)
The level α one-sample pro- prop.test(x=, n=,
7.11 where Z ∼ N (0, 12 )
portion hypothesis test correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0
Guessed p (with prior knowledge):
z −α/2 2
Sample size formula for the CI n = p(1 − p)( 1ME )
7.13
of a proportion Unknown p:
z −α/2 2
n = 14 ( 1ME )

Difference of two proportions


q
p̂1 (1− p̂1 ) p̂2 (1− p̂2 )
σ̂p̂1 − p̂2 = n1 + n2
estimator p̂1 − p̂2 and confi-
7.15
dence interval for the differ-
( p̂1 − p̂2 ) ± z1−α/2 · σ̂p̂1 − p̂2
ence

Test: H0 : p1 = p2 , vs. H1 : p1 6= p2
by p-value = 2 · P( Z > |zobs |)
prop.test(x=, n=,
7.18 The level α one-sample t-test where Z ∼ N (0, 12 )
correct=FALSE)
If p-value < α the reject H0 ,
otherwise accept H0

The multi-sample proportions Test: H0 : p1 = p2 = . . . = pc = p chisq.test(X,


7.20 (oij −eij )2
χ2 -test by χ2obs = ∑2i=1 ∑cj=1 eij
correct = FALSE)

Test: H0 : pi1 = pi2 = . . . = pic = pi


for all rows i = 1, 2, . . . , r
The r × c frequency table χ2 - ( o − e )2 chisq.test(X,
7.22 by χ2obs = ∑ri=1 ∑cj=1 ij eij ij
test correct = FALSE)
Reject if χ2obs > χ21−α (r − 1)(c − 1)


Otherwise accept
Chapter A A.8 COMPARING MEANS OF MULTIPLE GROUPS - ANOVA 16

A.8 Comparing means of multiple groups - ANOVA

Description Formula R command

k ni k ni

One-way ANOVA variation


∑ ∑ (yij − ȳ)2 = ∑ ∑ (yij − ȳi )2 +
i =1 j =1 i =1 j =1
8.2
decomposition | {z } | {z }
SST SSE
k
∑ ni (ȳi − ȳ)2
i =1
| {z }
SS(Tr)

SSE (n1 −1)s21 +···+(nk −1)s2k


MSE = n−k = n−k
One-way within group vari-
8.4
ability 1 n
s2i = n i −1 ∑i=i 1 (yij − ȳi )2

H0 : αi = 0; i = 1, 2, . . . , k,

SS( Tr )/(k −1)


One-way test for difference in F= SSE/(n−k)
8.6 anova(lm(y~treatm))
mean for k groups
F-distribution with k − 1 and n − k de-
grees of freedom
r  
SSE 1 1
ȳi − ȳ j ± t1−α/2 n−k ni + nj
Post hoc pairwise confidence
8.9 If all M = k (k − 1)/2 combinations,
intervals
then use αBonferroni = α/M

Test: H0 : µi = µ j vs. H1 : µi 6= µ j
by p-value = 2 · P( T > |tobs |)
ȳi −ȳ j
Post hoc pairwise hypothesis where tobs = s
8.10
 
1
tests MSE ni + n1
j

Test M = k (k − 1)/2 times, but each


time with αBonferroni = α/M
Least Significant Difference √
8.13 LSDα = t1−α/2 2 · MSE/m
(LSD) values
k l
∑ ∑ (yij − µ̂)2 =
i =1 j =1
| {z }
SST
Two-way ANOVA variation k l
8.20
decomposition ∑ ∑ (yij − α̂i − β̂ j − µ̂)2 +
i =1 j =1
| {z }
SSE
k l
l · ∑ α̂2i + k · ∑ β̂2j
i =1 j =1
| {z } | {z }
SS(Tr) SS(Bl)
Chapter A A.8 COMPARING MEANS OF MULTIPLE GROUPS - ANOVA 17

Description Formula R command


H0,Tr : αi = 0, i = 1, 2, . . . , k
Test for difference in means in
8.22 two-way ANOVA grouped in SS(Tr)/(k − 1) fit<-lm(y~treatm+block)
FTr = anova(fit)
treatments and in blocks SSE/((k − 1)(l − 1))
H0,Bl : β j = 0, j = 1, 2, . . . , l
SS(Bl)/(l − 1)
FBl =
SSE/((k − 1)(l − 1))

One-way ANOVA

Source of Degrees of Sums of Mean sum of Test- p-


variation freedom squares squares statistic F value
SS(Tr) MS( Tr )
Treatment k−1 SS(Tr) MS( Tr ) = k −1 Fobs = MSE P( F > Fobs )
SSE
Residual n−k SSE MSE = n−k

Total n−1 SST

Two-way ANOVA

Source of Degrees of Sums of Mean sums of Test p-


variation freedom squares squares statistic F value
SS(Tr) MS(Tr)
Treatment k−1 SS(Tr) MS(Tr) = k −1 FTr = MSE P( F > FTr )
SS(Bl) MS(Bl)
Block l−1 SS(Bl) MS(Bl) = l −1 FBl = MSE P( F > FBl )
SSE
Residual (l − 1)(k − 1) SSE MSE = (k−1)(l −1)

Total n−1 SST


Chapter A Glossaries 18

Glossaries

cumulated distribution function [Fordelingsfunktion]The cdf is the function which determines the
probability of observing an outcome of a random variable below a given value 3

Continuous random variable [Kontinuert stokastisk variabel] If an outcome of an experiment takes


a continuous value, for example: a distance, a temperature, a weight, etc., then it is represented
by a continuous random variable 3

Correlation [Korrelation] The sample correlation coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Covariance 2

Covariance [Kovarians] The sample covariance coefficient are a summary statistic that can be cal-
culated for two (related) sets of observations. It quantifies the (linear) strength of the relation
between the two. See also: Correlation 2, 4

F-distribution [F-fordelingen] The F-distribution appears as the ratio between two independent χ2 -
distributed random variables 16

Inter Quartile Range [Interkvartil bredde] The Inter Quartile Range (IQR) is the middle 50% range
of data 1

Median [Median, stikprøvemedian] The median of population or sample (note, in text no distin-
guishment between population median and sample median) 1

probability density function The pdf is the function which determines the probability of every pos-
sible outcome of a random variable 3

Quantile [Fraktil, stikprøvefraktil] The quantiles of population or sample (note, in text no distin-
guishment between population quantile and sample quantile) 1

Quartile [Fraktil, stikprøvefraktil] The quartiles of population or sample (note, in text no distin-
guishment between population quartile and sample quartile) 1

Sample variance [Empirisk varians, stikprøvevarians] 1

Sample mean [Stikprøvegennemsnit] The average of a sample 1

Standard deviation [Standard afvigelse] 1


Chapter A Acronyms 19

Acronyms

ANOVA Analysis of Variance Glossary: Analysis of Variance

cdf cumulated distribution function 3, Glossary: cumulated distribution function

CI confidence interval 10–12, 15, Glossary: confidence interval

CLT Central Limit Theorem Glossary: Central Limit Theorem

IQR Inter Quartile Range 1, Glossary: Inter Quartile Range

LSD Least Significant Difference Glossary: Least Significant Difference

pdf probability density function 3, Glossary: probability density function

You might also like