W Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 425

Simfit Simfit Simfit

Simulation, fitting, statistics, and plotting.


http://www.simfit.man.ac.uk
Scatchard Plot for the
1.00
2 2

isoform

0%

Bray-Curtis Similarity Dendrogram

0.75

Percentage Similarity

T = 21C [Ca++] = 1.310-7M

20%

y/x (M-1)

40%

1 Site Model
0.50

60%

0.25

2 Site Model

80%

100%

0.00 0.00

6 C H A 61 A 25 A 91 B 28 B 24 B 53 B 0 10 A 37 A 99 A 72 B 32 B 35 B 31 A 36 B 60 A 97 68 52 A 27 B 61 76 34 B 30 B 33 4 C H 7 PC3 PC6 PC5 PC1 PC

0.50

1.00

1.50

2.00

5 C H B 25 B 91 B 97 B 26 73 A 33 47 B 99 B 72 A 35 A 32 A 31 B 36 29 A 26 A 28 B 37 B 27 A 60 A 30 A 53 A 0 10 B 76 A 24 7 C H 4 PC 8 C H 8 PC2 PC

Fitting a Convolution Integral f*g


0.500
1.00

Deconvolution of 3 Gaussians

f(t) = exp(-t)
0.400

f(t), g(t) and f*g

g(t) = 2 t exp(-t)
0.300
0.50

y
0.200

f*g
0.100
0.00 0 1 2 3 4 5

0.000 -3.0

1.5

6.0

10.5

15.0

Time t

Orbits for a System of Differential Equations


2.50 2.00 1.50

Trinomial Parameter 95% Confidence Regions


0.85 0.75 0.65 0.55

7,11,2 70,110,20 210,330,60

y(1)

1.00 0.50

py
0.45 0.35

0.00 -0.50 -1.25

270,270,60
0.25 0.15 0.05

90,90,20
0.15 0.25 0.35

9,9,2
0.45 0.55 0.65 0.75

0.00

1.25

y(2)

px

Reference Manual: Version 6.0.20

Contents
1 Overview 1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First time users guide 2.1 The main menu . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The task bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The le selection control . . . . . . . . . . . . . . . . . . . . . 2.3.1 Multiple le selection . . . . . . . . . . . . . . . . . . 2.3.1.1 The project technique . . . . . . . . . . . . 2.3.1.2 Checking and archiving project les . . . . 2.4 First time users guide to data handling . . . . . . . . . . . . . . 2.4.1 The format for input data les . . . . . . . . . . . . . . 2.4.2 File extensions and folders . . . . . . . . . . . . . . . . 2.4.3 Advice concerning data les . . . . . . . . . . . . . . . 2.4.4 Advice concerning curve tting les . . . . . . . . . . . 2.4.5 Example 1: Making a curve tting le . . . . . . . . . . 2.4.6 Example 2: Editing a curve tting le . . . . . . . . . . 2.4.7 Example 3: Making a library le . . . . . . . . . . . . . 2.4.8 Example 4: Making a vector/matrix le . . . . . . . . . 2.4.9 Example 5: Editing a vector/matrix le . . . . . . . . . 2.4.10 Example 6: Saving data-base/spread-sheet tables to les 2.5 First time users guide to graph plotting . . . . . . . . . . . . . 2.5.1 The SIMFIT simple graphical interface . . . . . . . . . 2.5.2 The SIMFIT advanced graphical interface . . . . . . . . 2.5.3 PostScript, GSview/Ghostscript and SIMFIT . . . . . . 2.5.4 Example 1: Creating a simple graph . . . . . . . . . . . 2.5.5 Example 2: Error bars . . . . . . . . . . . . . . . . . . 2.5.6 Example 3: Histograms and cumulative distributions . . 2.5.7 Example 4: Double graphs with two scales . . . . . . . 2.5.8 Example 5: Bar charts . . . . . . . . . . . . . . . . . . 2.5.9 Example 6: Pie charts . . . . . . . . . . . . . . . . . . 2.5.10 Example 7: Surfaces, contours and 3D bar charts . . . . 2.6 First time users guide to curve tting . . . . . . . . . . . . . . 2.6.1 User friendly curve tting programs . . . . . . . . . . . 2.6.2 IFAIL and IOSTAT error messages . . . . . . . . . . . . 2.6.3 Example 1: Exponential functions . . . . . . . . . . . . 2.6.4 Example 2: Nonlinear growth and survival curves . . . . 2.6.5 Example 3: Enzyme kinetic and ligand binding data . . 2.7 First time users guide to simulation . . . . . . . . . . . . . . . 2.7.1 Why t simulated data ? . . . . . . . . . . . . . . . . . i 1 2 2 3 7 7 9 10 11 12 12 13 13 13 13 13 14 14 14 15 15 15 16 16 17 19 21 21 22 22 23 24 24 25 26 26 27 28 29 31 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Contents

2.7.2 2.7.3 2.7.4 2.7.5 2.7.6 2.7.7 3

Programs makdat and adderr . . . . . . . . . Example 1: Simulating y = f (x) . . . . . . . . Example 2: Simulating z = f (x, y) . . . . . . . Example 3: Simulating experimental error . . . Example 4: Simulating differential equations . Example 5: Simulating user-dened equations

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

31 31 32 32 33 34 35 35 35 36 37 37 38 38 38 39 39 39 39 40 40 40 40 42 45 49 50 51 56 56 58 59 59 59 60 62 63 64 64 64 65 66 67 69 69 69 72 72 73 73 73 73 74

Data analysis techniques 3.1 Types of data and measurement scales . . . . . . . . . . . . . . . . . . . . . . 3.2 Principles involved when tting models to data . . . . . . . . . . . . . . . . . 3.2.1 Limitations when tting models . . . . . . . . . . . . . . . . . . . . . 3.2.2 Fitting linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Fitting generalized linear models . . . . . . . . . . . . . . . . . . . . . 3.2.4 Fitting nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Fitting survival models . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Distribution of statistics from regression . . . . . . . . . . . . . . . . . 3.2.6.1 The chi-square test for goodness of t . . . . . . . . . . . 3.2.6.2 The t test for parameter redundancy . . . . . . . . . . . . 3.2.6.3 The F test for model discrimination . . . . . . . . . . . . 3.2.6.4 Analysis of residuals . . . . . . . . . . . . . . . . . . . . 3.2.6.5 How good is the t ? . . . . . . . . . . . . . . . . . . . . 3.2.6.6 Using graphical deconvolution to assess goodness of t . . 3.2.6.7 Testing for differences between two parameter estimates . 3.2.6.8 Testing for differences between several parameter estimates 3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Regression on ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Generalized linear models (GLM) . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 GLM examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 The SIMFIT simplied Generalized Linear Models interface . . . . . . 3.6.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Conditional binary logistic regression with stratied data . . . . . . . . 3.7 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Exponentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1.1 How to interpret parameter estimates . . . . . . . . . . . . 3.7.1.2 How to interpret goodness of t . . . . . . . . . . . . . . . 3.7.1.3 How to interpret model discrimination results . . . . . . . 3.7.2 High/low afnity sites . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Cooperative ligand binding . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Michaelis-Menten kinetics . . . . . . . . . . . . . . . . . . . . . . . . 3.7.5 Positive rational functions . . . . . . . . . . . . . . . . . . . . . . . . 3.7.6 Isotope displacement kinetics . . . . . . . . . . . . . . . . . . . . . . 3.7.7 Nonlinear growth curves . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.8 Nonlinear survival curves . . . . . . . . . . . . . . . . . . . . . . . . 3.7.9 Advanced curve tting . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.9.1 Fitting multi-function models using qnt . . . . . . . . . . 3.7.10 Differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Calibration and Bioassay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Calibration curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1.1 Turning points in calibration curves . . . . . . . . . . . . 3.8.1.2 Calibration using lint and polnom . . . . . . . . . . . . 3.8.1.3 Calibration using calcurve . . . . . . . . . . . . . . . . . 3.8.1.4 Calibration using qnt . . . . . . . . . . . . . . . . . . . 3.8.2 Dose response curves, EC50, IC50, ED50, and LD50 . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

iii

3.9

3.8.3 95% condence regions in inverse prediction . . . . . . . . . . . . . . . . . . Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Multiple tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3.1 Exhaustive analysis: arbitrary vector . . . . . . . . . . . . . . . . 3.9.3.2 Exhaustive analysis: arbitrary matrix . . . . . . . . . . . . . . . . 3.9.3.3 Exhaustive analysis: multivariate normal matrix . . . . . . . . . . 3.9.3.4 t tests on groups across rows of a matrix . . . . . . . . . . . . . . 3.9.3.5 Nonparametric tests across rows of a matrix . . . . . . . . . . . . 3.9.3.6 All possible pairwise tests (n vectors or a library le) . . . . . . . 3.9.4 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.1 1-sample t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.2 1-sample Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . 3.9.4.3 1-sample Shapiro-Wilks test for normality . . . . . . . . . . . . . 3.9.4.4 1-sample Dispersion and Fisher exact Poisson tests . . . . . . . . 3.9.4.5 2-sample unpaired t and variance ratio tests . . . . . . . . . . . . 3.9.4.6 2-sample paired t test . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.7 2-sample Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . 3.9.4.8 2-sample Wilcoxon-Mann-Whitney U test . . . . . . . . . . . . . 3.9.4.9 2-sample Wilcoxon signed-ranks test . . . . . . . . . . . . . . . . 3.9.4.10 Chi-square test on observed and expected frequencies . . . . . . . 3.9.4.11 Chi-square and Fisher-exact contingency table tests . . . . . . . . 3.9.4.12 McNemar test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.13 Cochran Q repeated measures test on a matrix of 0,1 values . . . . 3.9.4.14 The binomial test . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.15 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.16 The run test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4.17 The F test for excess variance . . . . . . . . . . . . . . . . . . . 3.9.5 Nonparametric tests using rstest . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.5.1 Runs up and down test for randomness . . . . . . . . . . . . . . . 3.9.5.2 Median test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.5.3 Moods test and Davids test for equal dispersion . . . . . . . . . 3.9.5.4 Kendall coefcient of concordance . . . . . . . . . . . . . . . . . 3.9.6 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.6.1 ANOVA (1): 1-way and Kruskal-Wallis (n samples or library le) 3.9.6.2 ANOVA (1): Tukey Q test (n samples or library le) . . . . . . . . 3.9.6.3 ANOVA (1): Plotting 1-way data . . . . . . . . . . . . . . . . . . 3.9.6.4 ANOVA (2): 2-way and the Friedman test (one matrix) . . . . . . 3.9.6.5 ANOVA (3): 3-way and Latin Square design (one matrix) . . . . . 3.9.6.6 ANOVA (4): Groups and subgroups (one matrix) . . . . . . . . . 3.9.6.7 ANOVA (5): Factorial design (one matrix) . . . . . . . . . . . . . 3.9.6.8 ANOVA (6): Repeated measures (one matrix) . . . . . . . . . . . 3.9.7 Analysis of proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.7.1 Dichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.7.2 Condence limits for analysis of two proportions . . . . . . . . . 3.9.7.3 Meta analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.7.4 Bioassay, estimating percentiles . . . . . . . . . . . . . . . . . . 3.9.7.5 Trichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.8 Multivariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.8.1 Correlation: parametric (Pearson product moment) . . . . . . . . 3.9.8.2 Correlation: nonparametric (Kendall tau and Spearman rank) . . . 3.9.8.3 Correlation: partial . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.8.4 Correlation: canonical . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77 78 78 78 79 79 81 81 86 86 87 88 88 88 90 91 91 93 94 96 97 98 99 102 103 103 104 105 107 108 108 109 109 110 112 112 114 115 115 117 117 118 121 125 125 126 127 131 131 132 132 135 136 138

iv

Contents

3.9.8.5 Cluster analysis: multivariate dendrograms . . . . . . . . . . . . . 3.9.8.6 Cluster analysis: classical metric scaling . . . . . . . . . . . . . . 3.9.8.7 Cluster analysis: non-metric (ordinal) scaling . . . . . . . . . . . 3.9.8.8 Cluster analysis: K-means . . . . . . . . . . . . . . . . . . . . . 3.9.8.9 Principal components analysis . . . . . . . . . . . . . . . . . . . 3.9.8.10 Procrustes analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.8.11 Varimax and Quartimax rotation . . . . . . . . . . . . . . . . . . 3.9.8.12 Multivariate analysis of variance (MANOVA) . . . . . . . . . . . 3.9.8.13 Comparing groups: canonical variates (discriminant functions) . . 3.9.8.14 Comparing groups: Mahalanobis distances (discriminant analysis) 3.9.8.15 Comparing groups: Assigning new observations . . . . . . . . . . 3.9.8.16 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.8.17 Biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.9 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.9.1 Time series data smoothing . . . . . . . . . . . . . . . . . . . . . 3.9.9.2 Time series lags and autocorrelations . . . . . . . . . . . . . . . . 3.9.9.3 Autoregressive integrated moving average models (ARIMA) . . . 3.9.10 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.10.1 Fitting one set of survival times . . . . . . . . . . . . . . . . . . . 3.9.10.2 Comparing two sets of survival times . . . . . . . . . . . . . . . . 3.9.10.3 Survival analysis using generalized linear models . . . . . . . . . 3.9.10.4 The exponential survival model . . . . . . . . . . . . . . . . . . . 3.9.10.5 The Weibull survival model . . . . . . . . . . . . . . . . . . . . . 3.9.10.6 The extreme value survival model . . . . . . . . . . . . . . . . . 3.9.10.7 The Cox proportional hazards model . . . . . . . . . . . . . . . . 3.9.10.8 Comprehensive Cox regression . . . . . . . . . . . . . . . . . . . 3.9.11 Statistical calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.11.1 Statistical power and sample size . . . . . . . . . . . . . . . . . . 3.9.11.2 Power calculations for 1 binomial sample . . . . . . . . . . . . . 3.9.11.3 Power calculations for 2 binomial samples . . . . . . . . . . . . . 3.9.11.4 Power calculations for 1 normal sample . . . . . . . . . . . . . . 3.9.11.5 Power calculations for 2 normal samples . . . . . . . . . . . . . . 3.9.11.6 Power calculations for k normal samples . . . . . . . . . . . . . . 3.9.11.7 Power calculations for 1 and 2 variances . . . . . . . . . . . . . . 3.9.11.8 Power calculations for 1 and 2 correlations . . . . . . . . . . . . . 3.9.11.9 Power calculations for a chi-square test . . . . . . . . . . . . . . . 3.9.11.10 Parameter condence limits . . . . . . . . . . . . . . . . . . . . . 3.9.11.11 Condence limits for a Poisson parameter . . . . . . . . . . . . . 3.9.11.12 Condence limits for a binomial parameter . . . . . . . . . . . . . 3.9.11.13 Condence limits for a normal mean and variance . . . . . . . . . 3.9.11.14 Condence limits for a correlation coefcient . . . . . . . . . . . 3.9.11.15 Condence limits for trinomial parameters . . . . . . . . . . . . . 3.9.11.16 Robust analysis of one sample . . . . . . . . . . . . . . . . . . . 3.9.11.17 Robust analysis of two samples . . . . . . . . . . . . . . . . . . . 3.9.11.18 Indices of diversity . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.11.19 Standard and non-central distributions . . . . . . . . . . . . . . . 3.9.11.20 Cooperativity analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.9.11.21 Generating random numbers, permutations and Latin squares . . . 3.9.11.22 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . 3.9.12 Numerical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.12.1 Zeros of a polynomial of degree n - 1 . . . . . . . . . . . . . . . . 3.9.12.2 Determinants, inverses, eigenvalues, and eigenvectors . . . . . . . 3.9.12.3 Singular value decomposition . . . . . . . . . . . . . . . . . . . . 3.9.12.4 LU factorization of a matrix, norms and condition numbers . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

140 142 144 145 148 151 152 153 158 161 161 163 165 171 171 172 174 176 176 178 180 180 180 181 181 182 183 183 185 185 186 187 187 188 188 189 190 190 190 191 191 191 192 193 194 195 195 197 198 200 200 200 200 201

Contents

3.9.12.5 QR factorization of a matrix . . . . . . . . . . . . . . . . . . 3.9.12.6 Cholesky factorization of a positive-denite symmetric matrix 3.9.12.7 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . 3.9.12.8 Evaluation of quadratic forms . . . . . . . . . . . . . . . . . . 3.9.12.9 Solving Ax = b (full rank) . . . . . . . . . . . . . . . . . . . . 3.9.12.10 Solving Ax = b (L1 , L2 , L norms) . . . . . . . . . . . . . . . 3.9.12.11 The symmetric eigenvalue problem . . . . . . . . . . . . . . . 3.10 Areas, slopes, lag times and asymptotes . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Models used by program inrate . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Estimating initial rates using inrate . . . . . . . . . . . . . . . . . . . . . 3.10.3 Lag times and steady states using inrate . . . . . . . . . . . . . . . . . . . 3.10.4 Model-free tting using compare . . . . . . . . . . . . . . . . . . . . . . 3.10.5 Estimating averages and AUC using deterministic equations . . . . . . . . 3.10.6 Estimating AUC using average . . . . . . . . . . . . . . . . . . . . . . . 3.11 Spline smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Fixed knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Automatic knots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Using splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Graph plotting techniques 4.1 Graphical objects . . . . . . . . . . . . . . . . . . . . . 4.1.1 Symbols . . . . . . . . . . . . . . . . . . . . . 4.1.2 Lines: standard types . . . . . . . . . . . . . . . 4.1.3 Lines: extending to boundaries . . . . . . . . . . 4.1.4 Text . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Fonts, character sizes and line thicknesses . . . . 4.1.6 Arrows . . . . . . . . . . . . . . . . . . . . . . 4.1.7 Example of plotting without data: Venn diagram 4.1.8 Polygons . . . . . . . . . . . . . . . . . . . . . 4.2 Sizes and shapes . . . . . . . . . . . . . . . . . . . . . 4.2.1 Alternative axes and labels . . . . . . . . . . . . 4.2.2 Transformed data . . . . . . . . . . . . . . . . . 4.2.3 Alternative sizes, shapes and clipping . . . . . . 4.2.4 Rotated and re-scaled graphs . . . . . . . . . . . 4.2.5 Changed aspect ratios and shear transformations 4.2.6 Reduced or enlarged graphs . . . . . . . . . . . 4.2.7 Split axes . . . . . . . . . . . . . . . . . . . . . 4.3 Equations . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Maths . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Chemical Formul . . . . . . . . . . . . . . . . 4.3.3 Composite graphs . . . . . . . . . . . . . . . . 4.4 Bar charts and pie charts . . . . . . . . . . . . . . . . . 4.4.1 Perspective effects . . . . . . . . . . . . . . . . 4.4.2 Advanced barcharts . . . . . . . . . . . . . . . . 4.4.3 Three dimensional barcharts . . . . . . . . . . . 4.5 Error bars . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Error bars with barcharts . . . . . . . . . . . . . 4.5.2 Error bars with skyscraper and cylinder plots . . 4.5.3 Slanting and multiple error bars . . . . . . . . . 4.5.4 Calculating error bars interactively . . . . . . . . 4.5.5 Binomial parameter error bars . . . . . . . . . . 4.5.6 Log-Odds error bars . . . . . . . . . . . . . . . 4.5.7 Log-Odds-Ratios error bars . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

203 204 205 205 205 206 207 208 208 209 209 211 212 212 214 215 216 216 217 219 219 219 220 221 222 223 223 224 225 226 226 226 228 228 229 230 231 232 232 233 234 235 235 236 237 238 238 239 240 241 242 242 243

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

Contents

4.6

Statistical graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Clusters, connections, correlations, and scattergrams . . . . . 4.6.2 Bivariate condence ellipses 1: basic theory . . . . . . . . . . 4.6.3 Bivariate condence ellipses 2: regions . . . . . . . . . . . . 4.6.4 Dendrograms 1: standard format . . . . . . . . . . . . . . . . 4.6.5 Dendrograms 2: stretched format . . . . . . . . . . . . . . . 4.6.6 Dendrograms 3: plotting subgroups . . . . . . . . . . . . . . 4.6.7 K-Means clustering 1: UK airports . . . . . . . . . . . . . . 4.6.8 K-Means clustering 2: highlighting centroids . . . . . . . . . 4.6.9 Principal components . . . . . . . . . . . . . . . . . . . . . . 4.6.10 Labelling statistical graphs . . . . . . . . . . . . . . . . . . . 4.6.11 Probability distributions . . . . . . . . . . . . . . . . . . . . 4.6.12 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . 4.6.13 Goodness of t to a Poisson distribution . . . . . . . . . . . . 4.6.14 Trinomial parameter joint condence regions . . . . . . . . . 4.6.15 Random walks . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.16 Power as a function of sample size . . . . . . . . . . . . . . . 4.7 Three dimensional plotting . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Surfaces and contours . . . . . . . . . . . . . . . . . . . . . 4.7.2 The objective function at solution points . . . . . . . . . . . . 4.7.3 Sequential sections across best t surfaces . . . . . . . . . . . 4.7.4 Plotting contours for Rosenbrock optimization trajectory . . . 4.7.5 Three dimensional space curves . . . . . . . . . . . . . . . . 4.7.6 Projecting space curves onto planes . . . . . . . . . . . . . . 4.7.7 Three dimensional scatter diagrams . . . . . . . . . . . . . . 4.7.8 Two dimensional families of curves . . . . . . . . . . . . . . 4.7.9 Three dimensional families of curves . . . . . . . . . . . . . 4.8 Differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Phase portraits of plane autonomous systems . . . . . . . . . 4.8.2 Orbits of differential equations . . . . . . . . . . . . . . . . . 4.9 Specialized techniques . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Deconvolution 1: Graphical deconvolution of complex models 4.9.2 Deconvolution 2: Fitting convolution integrals . . . . . . . . 4.9.3 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.4 Segmented models with cross-over points . . . . . . . . . . . 4.9.5 Plotting single impulse functions . . . . . . . . . . . . . . . . 4.9.6 Plotting periodic impulse functions . . . . . . . . . . . . . . 4.9.7 Flow cytometry . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.8 Subsidiary gures as insets . . . . . . . . . . . . . . . . . . . 4.9.9 Nonlinear growth curves . . . . . . . . . . . . . . . . . . . . 4.9.10 Ligand binding species fractions . . . . . . . . . . . . . . . . 4.9.11 Immunoassay and dose-response dilution curves . . . . . . . 4.10 Parametric curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10.1 r = r() parametric plot 1: Eight leaved rose . . . . . . . . . 4.10.2 r = r() parametric plot 2: Logarithmic spiral with tangent . . A Distributions and special functions A.1 Discrete distribution functions . . . . A.1.1 Bernoulli distribution . . . . . A.1.2 Binomial distribution . . . . . A.1.3 Multinomial distribution . . . A.1.4 Geometric distribution . . . . A.1.5 Negative binomial distribution A.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

244 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 260 261 262 263 264 265 266 267 268 269 269 270 271 271 272 273 274 275 276 277 277 278 278 279 280 280 281 283 283 283 283 284 284 284 284

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Contents

vii

A.1.7 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Normal (or Gaussian) distribution . . . . . . . . . . . . . . . . . . . . . . A.2.2.1 Example 1. Sums of normal variables . . . . . . . . . . . . . A.2.2.2 Example 2. Convergence of a binomial to a normal distribution A.2.2.3 Example 3. Distribution of a normal sample mean and variance A.2.2.4 Example 4. The central limit theorem . . . . . . . . . . . . . A.2.3 Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.4 Bivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . A.2.5 Multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . A.2.6 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.7 Cauchy distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.8 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.9 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.10 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.11 Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.12 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.13 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.14 Logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.15 Log logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Non-central distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Non-central beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Non-central chi-square distribution . . . . . . . . . . . . . . . . . . . . . A.3.3 Non-central F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.4 Non-central t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Variance stabilizing transformations . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Angular transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.2 Square root transformation . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3 Log transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Special functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 Binomial coefcient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.2 Gamma and incomplete gamma functions . . . . . . . . . . . . . . . . . . A.5.3 Beta and incomplete beta functions . . . . . . . . . . . . . . . . . . . . . A.5.4 Exponential integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.5 Sine and cosine integrals and Eulers gamma . . . . . . . . . . . . . . . . A.5.6 Fermi-Dirac integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.7 Debye functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.8 Clausen integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.9 Spence integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.10 Dawson integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.11 Fresnel integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.12 Polygamma functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.13 Struve functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.14 Kummer conuent hypergeometric functions . . . . . . . . . . . . . . . . A.5.15 Abramovitz functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.16 Legendre polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.17 Bessel, Kelvin, and Airy functions . . . . . . . . . . . . . . . . . . . . . . A.5.18 Elliptic integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.19 Single impulse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.19.1 Heaviside unit function . . . . . . . . . . . . . . . . . . . . . A.5.19.2 Kronecker delta function . . . . . . . . . . . . . . . . . . . . A.5.19.3 Unit impulse function . . . . . . . . . . . . . . . . . . . . . . A.5.19.4 Unit spike function . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

285 285 286 286 286 286 286 287 287 287 287 288 288 289 289 289 290 290 290 291 291 291 291 291 292 292 292 292 292 293 293 293 293 294 294 294 294 294 294 294 295 295 295 295 295 295 296 296 296 297 297 297 297 297

viii

Contents

A.5.19.5 Gauss pdf . . . . . . . . . . . . A.5.20 Periodic impulse functions . . . . . . . . . . A.5.20.1 Square wave function . . . . . . A.5.20.2 Rectied triangular wave . . . . A.5.20.3 Morse dot wave function . . . . A.5.20.4 Sawtooth wave function . . . . . A.5.20.5 Rectied sine wave function . . A.5.20.6 Rectied sine half-wave function A.5.20.7 Unit impulse wave function . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

297 297 298 298 298 298 298 298 298 299 299 299 300 300 301 302 302 304 304 305 306 307 308 308 308 308 309 309 309 310 310 310 310 310 310 311 311 311 313 313 313 313 314 314 314 315 315 315 315 317 317 317 318

B User dened models B.1 Supplying models as a dynamic link library . . . . . . . . . . . . . . . . B.2 Supplying models as ASCII text les . . . . . . . . . . . . . . . . . . . . B.2.1 Example 1: a straight line . . . . . . . . . . . . . . . . . . . . . B.2.2 Example 2: damped simple harmonic motion . . . . . . . . . . . B.2.3 Example 3: diffusion into a capillary . . . . . . . . . . . . . . . . B.2.4 Example 4: dening three models at the same time . . . . . . . . B.2.5 Example 5: Lotka-Volterra predator-prey differential equations . . B.2.6 Example 6: supplying initial conditions . . . . . . . . . . . . . . B.2.7 Example 7: transforming differential equations . . . . . . . . . . B.2.8 Formatting conventions for user dened models . . . . . . . . . . B.2.8.1 Table of user-dened model commands . . . . . . . B.2.8.2 Table of synonyms for user-dened model commands B.2.8.3 Error handling in user dened models . . . . . . . . B.2.8.4 Notation for functions of more than three variables . B.2.8.5 The commands put(.) and get(.) . . . . . . . . . B.2.8.6 The command get3(.,.,.) . . . . . . . . . . . . . B.2.8.7 The commands epsabs and epsrel . . . . . . . . . B.2.8.8 The commands blim(.) and tlim(.) . . . . . . . . B.2.9 Plotting user dened models . . . . . . . . . . . . . . . . . . . . B.2.10 Finding zeros of user dened models . . . . . . . . . . . . . . . B.2.11 Finding zeros of n functions in n variables . . . . . . . . . . . . . B.2.12 Integrating 1 function of 1 variable . . . . . . . . . . . . . . . . . B.2.13 Integrating n functions of m variables . . . . . . . . . . . . . . . B.2.14 Calling sub-models from user-dened models . . . . . . . . . . . B.2.14.1 The command putpar . . . . . . . . . . . . . . . . B.2.14.2 The command value(.) . . . . . . . . . . . . . . . B.2.14.3 The command quad(.) . . . . . . . . . . . . . . . . B.2.14.4 The command convolute(.,.) . . . . . . . . . . . B.2.14.5 The command root(.) . . . . . . . . . . . . . . . . B.2.14.6 The command value3(.,.,.) . . . . . . . . . . . . B.2.14.7 The command order . . . . . . . . . . . . . . . . . B.2.14.8 The command middle . . . . . . . . . . . . . . . . B.2.14.9 The syntax for subsidiary models . . . . . . . . . . . B.2.14.10 Rules for using sub-models . . . . . . . . . . . . . . B.2.14.11 Nesting subsidiary models . . . . . . . . . . . . . . B.2.14.12 IFAIL values for D01AJF, D01AEF and C05AZF . . B.2.14.13 Test les illustrating how to call sub-models . . . . . B.2.15 Calling special functions from user-dened models . . . . . . . . B.2.15.1 Table of special function commands . . . . . . . . . B.2.15.2 Using the command middle with special functions . B.2.15.3 Special functions with one argument . . . . . . . . . B.2.15.4 Special functions with two arguments . . . . . . . . B.2.15.5 Special functions with three or more arguments . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents

ix

B.2.15.6 Test les illustrating how to call special functions . . . . . . . . B.2.16 Operations with scalars and vectors . . . . . . . . . . . . . . . . . . . . . . B.2.16.1 The command store(j) . . . . . . . . . . . . . . . . . . . . . B.2.16.2 The command storef(file) . . . . . . . . . . . . . . . . . . B.2.16.3 The command poly(x,m,n) . . . . . . . . . . . . . . . . . . . B.2.16.4 The command cheby(x,m,n) . . . . . . . . . . . . . . . . . . B.2.16.5 The commands l1norm(m,n), l2norm(m,n) and linorm(m,n) B.2.16.6 The commands sum(m,n) and ssq(m,n) . . . . . . . . . . . . . B.2.16.7 The command dotprod(l,m,n) . . . . . . . . . . . . . . . . . B.2.16.8 Commands to use mathematical constants . . . . . . . . . . . . B.2.17 Integer functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.18 Logical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.19 Conditional execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.20 Arbitrary functions with arbitrary arguments . . . . . . . . . . . . . . . . . B.2.21 Using usermod with user-dened models . . . . . . . . . . . . . . . . . . . B.2.22 Locating a zero of one function of one variable . . . . . . . . . . . . . . . . B.2.23 Locating zeros of n functions of n variables . . . . . . . . . . . . . . . . . . B.2.24 Integrating one function of one variable . . . . . . . . . . . . . . . . . . . . B.2.25 Integrating n functions of m variables . . . . . . . . . . . . . . . . . . . . . B.2.26 Bound-constrained quasi-Newton optimization . . . . . . . . . . . . . . . . C Library of models C.1 Mathematical models [Library: Version 2.0] C.2 Functions of one variable . . . . . . . . . . C.2.1 Differential equations . . . . . . . . C.2.2 Systems of differential equations . . C.2.3 Special models . . . . . . . . . . . C.2.4 Biological models . . . . . . . . . C.2.5 Biochemical models . . . . . . . . C.2.6 Chemical models . . . . . . . . . . C.2.7 Physical models . . . . . . . . . . C.2.8 Statistical models . . . . . . . . . . C.2.9 Empirical models . . . . . . . . . . C.2.10 Mathematical models . . . . . . . . C.3 Functions of two variables . . . . . . . . . C.3.1 Polynomials . . . . . . . . . . . . C.3.2 Rational functions: . . . . . . . . . C.3.3 Enzyme kinetics . . . . . . . . . . C.3.4 Biological . . . . . . . . . . . . . . C.3.5 Physical . . . . . . . . . . . . . . . C.3.6 Statistical . . . . . . . . . . . . . . C.4 Functions of three variables . . . . . . . . . C.4.1 Polynomials . . . . . . . . . . . . C.4.2 Enzyme kinetics . . . . . . . . . . C.4.3 Biological . . . . . . . . . . . . . . C.4.4 Statistics . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

318 318 318 319 319 320 321 321 322 322 322 323 323 324 325 325 326 326 327 328 330 330 330 330 330 331 332 333 333 334 334 335 335 335 335 336 336 336 336 337 337 337 337 337 337 338 338 338 339 339 339 340

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

D Editing PostScript les D.1 The format of SIMFIT PostScript les . . . . D.1.1 Warning about editing PostScript les D.1.2 The percent-hash escape sequence . . D.1.3 Changing line thickness and plot size D.1.4 Changing PostScript fonts . . . . . . D.1.5 Changing title and legends . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Contents

Deleting graphical objects . . . . . . . . . . . . . . . . . Changing line and symbol types . . . . . . . . . . . . . . Adding extra text . . . . . . . . . . . . . . . . . . . . . . Standard fonts . . . . . . . . . . . . . . . . . . . . . . . Decorative fonts . . . . . . . . . . . . . . . . . . . . . . Plotting characters outside the keyboard set . . . . . . . . The StandardEncoding Vector . . . . . . . . . . . . . . . The ISOLatin1Encoding Vector . . . . . . . . . . . . . . The SymbolEncoding Vector . . . . . . . . . . . . . . . . The ZapfDingbatsEncoding Vector . . . . . . . . . . . . . SIMFIT character display codes . . . . . . . . . . . . . . D.2 editps text formatting commands . . . . . . . . . . . . . . . . . . D.2.1 Special text formatting commands, e.g. left . . . . . . . . D.2.2 Coordinate text formatting commands, e.g. raise . . . . . D.2.3 Currency text formatting commands, e.g. dollar . . . . . . D.2.4 Maths text formatting commands, e.g. divide . . . . . . . D.2.5 Scientic units text formatting commands, e.g. Angstrom D.2.6 Font text formatting commands, e.g. roman . . . . . . . . D.2.7 Poor mans bold text formatting command, e.g. pmb? . . . D.2.8 Punctuation text formatting commands, e.g. dagger . . . . D.2.9 Letters and accents text formatting commands, e.g. Aacute D.2.10 Greek text formatting commands, e.g. alpha . . . . . . . . D.2.11 Line and Symbol text formatting commands, e.g. ce . . . D.2.12 Examples of text formatting commands . . . . . . . . . . D.3 PostScript specials . . . . . . . . . . . . . . . . . . . . . . . . . D.3.1 What specials can do . . . . . . . . . . . . . . . . . . . . D.3.2 The technique for dening specials . . . . . . . . . . . . D.3.3 Examples of PostScript specials . . . . . . . . . . . . . . E Auxiliary programs E.1 Recommended software . . . . . . . . . . . . . . . . . . . . . . E.1.1 The interface between SIMFIT and GSview/Ghostscript A E.1.2 The interface between SIMFIT, L TEX and Dvips . . . . E.2 SIMFIT Microsoft Ofce, and OpenOfce . . . . . . . . . . . . E.3 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3.1 Data tables . . . . . . . . . . . . . . . . . . . . . . . . E.3.2 Labeled data tables . . . . . . . . . . . . . . . . . . . . E.3.3 Missing values . . . . . . . . . . . . . . . . . . . . . . E.3.4 SIMFIT data les . . . . . . . . . . . . . . . . . . . . . E.3.5 SIMFIT data les with labels . . . . . . . . . . . . . . . E.3.6 Clipboard data . . . . . . . . . . . . . . . . . . . . . . E.3.7 Files exported from spreadsheet programs . . . . . . . . E.4 Spreadsheet tables . . . . . . . . . . . . . . . . . . . . . . . . . E.5 Using the clipboard to transfer data into SIMFIT . . . . . . . . . E.5.1 Pasting data from the clipboard directly into SIMFIT . . E.5.2 Converting data from the clipboard into a SIMFIT le . . E.6 Using spreadsheet output les to transfer data into SIMFIT . . . E.6.1 Space-delimited les (.txt) . . . . . . . . . . . . . . . . E.6.2 Comma-delimited les (.csv) . . . . . . . . . . . . . . . E.6.3 Semicolon-delimited les (.csv) . . . . . . . . . . . . . E.6.4 Tab-delimited les (.txt) . . . . . . . . . . . . . . . . . E.6.5 Unicode (.txt) . . . . . . . . . . . . . . . . . . . . . . . E.6.6 Web documents (.xml, .html, .htm, .mht, .mhtml) . . . . E.7 Using simt6.xls with Excel to create SIMFIT data les . . . . .

D.1.6 D.1.7 D.1.8 D.1.9 D.1.10 D.1.11 D.1.12 D.1.13 D.1.14 D.1.15 D.1.16

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

341 341 342 343 343 344 345 346 347 348 349 350 350 350 350 350 350 351 351 351 351 351 352 352 353 353 353 354 355 355 355 355 355 356 356 356 356 357 357 357 357 358 360 360 360 361 361 361 361 361 361 361 362

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

Contents

xi

The functionality of simt6.xls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 The General Number format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Using the simt6.xls macro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 E.7.3.1 Step 1: Open the simt6.xls workbook . . . . . . . . . . . . . . . . . . 363 E.7.3.2 Step 2: Select the data table within the users workbook . . . . . . . . . 363 E.7.3.3 Step 3: Invoke the macro . . . . . . . . . . . . . . . . . . . . . . . . . 363 E.7.3.4 A tip on entering the pathname for the export le . . . . . . . . . . . . 363 E.8 Using transformsim.xls with Excel to create SIMFIT data les . . . . . . . . . . . . . . . . 364 E.8.1 The functionality of transformsim.xls . . . . . . . . . . . . . . . . . . . . . . . . . 364 E.8.1.1 A signicant difference between simt6.xls and transformsim.xls . . . . 364 E.8.1.2 Filling empty cells found in the data table . . . . . . . . . . . . . . . . 364 E.8.1.3 Performing transformations of the data table . . . . . . . . . . . . . . . 364 E.8.1.4 Transposing the SIMFIT table . . . . . . . . . . . . . . . . . . . . . . . 365 E.8.1.5 Inspecting and saving the modied worksheet . . . . . . . . . . . . . . 365 E.8.1.6 The History Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 E.9 Importing SIMFIT results tables into documents and spreadsheets . . . . . . . . . . . . . . 366 E.9.1 Log les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 E.9.2 Extracting tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 E.9.3 Printing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 E.9.4 Importing tables into documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 E.9.5 Importing a SIMFIT results log le table into Excel . . . . . . . . . . . . . . . . . . 366 E.9.6 Converting a SIMFIT results log le table into a Word table (manual method) . . . . 367 E.9.7 Converting a SIMFIT results log le table into a Word table using the ConvertToTable macro367 E.9.7.1 Step1: Import the SIMFIT results log le material into Word . . . . . . 368 E.9.7.2 Step 2: Click the Show/Hide button on the Standard toolbar . . . . . . . 368 E.9.7.3 Step 3: Select the lines of text which are to be converted into a Word table 368 E.9.7.4 Step 4: Invoke the macro . . . . . . . . . . . . . . . . . . . . . . . . . 368 E.9.7.5 Space characters in Row and Column labels . . . . . . . . . . . . . . . 369 E.9.7.6 Deactivating or Removing the ConvertToTable.dot Global Template . . 369 E.10 Printing and importing SIMFIT graphs into documents . . . . . . . . . . . . . . . . . . . . 370 E.10.1 Windows print quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 E.10.2 Windows bitmaps and compressed bitmaps . . . . . . . . . . . . . . . . . . . . . . 370 E.10.3 Windows Enhanced metales (.emf) . . . . . . . . . . . . . . . . . . . . . . . . . . 370 E.10.4 PostScript print quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 E.10.5 Postscript graphics les (.eps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 E.10.6 Ghostscript generated les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 E.10.7 Using Encapsulated PostScript (.eps) les directly . . . . . . . . . . . . . . . . . . 371 F The SIMFIT package F.1 SIMFIT program les . . . . . . . . . . . . . . . . . . . . F.1.1 Dynamic Link Libraries . . . . . . . . . . . . . . F.1.2 Executables . . . . . . . . . . . . . . . . . . . . . F.2 SIMFIT data les . . . . . . . . . . . . . . . . . . . . . . F.2.1 Example 1: a vector . . . . . . . . . . . . . . . . F.2.2 Example 2: a matrix . . . . . . . . . . . . . . . . F.2.3 Example 3: an integer matrix . . . . . . . . . . . . F.2.4 Example 4: appending labels . . . . . . . . . . . . F.2.5 Example 5: using begin ... end to add labels . . . . F.2.6 Example 6: various uses of begin ... end . . . . . . F.2.7 Example 7: starting estimates and parameter limits F.3 SIMFIT auxiliary les . . . . . . . . . . . . . . . . . . . . F.3.1 Test les (Data) . . . . . . . . . . . . . . . . . . . F.3.2 Library les (Data) . . . . . . . . . . . . . . . . . F.3.3 Test les (Models) . . . . . . . . . . . . . . . . . 372 372 372 373 377 377 377 378 378 378 378 379 381 381 384 385

E.7.1 E.7.2 E.7.3

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

xii

Contents

F.4

F.3.4 Miscellaneous data les . . . . . . . F.3.5 Parameter limits les . . . . . . . . . F.3.6 Error message les . . . . . . . . . . F.3.7 PostScript example les . . . . . . . F.3.8 SIMFIT conguration les . . . . . . F.3.9 Graphics conguration les . . . . . F.3.10 Default les . . . . . . . . . . . . . . F.3.11 Temporary les . . . . . . . . . . . . F.3.12 NAG library les (contents of list.nag) Acknowledgements . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

386 386 386 387 387 387 387 388 388 390

List of Tables
2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40 Data for a double graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multilinear regression . . . . . . . . . . . . . . . . . . . . . Robust regression . . . . . . . . . . . . . . . . . . . . . . . . Regression on ranks . . . . . . . . . . . . . . . . . . . . . . GLM example 1: normal errors . . . . . . . . . . . . . . . . GLM example 2: binomial errors . . . . . . . . . . . . . . . GLM example 3: Poisson errors . . . . . . . . . . . . . . . . GLM contingency table analysis: 1 . . . . . . . . . . . . . . GLM contingency table analysis: 2 . . . . . . . . . . . . . . GLM example 4: gamma errors . . . . . . . . . . . . . . . . Dummy indicators for categorical variables . . . . . . . . . . Binary logistic regression . . . . . . . . . . . . . . . . . . . . Conditional binary logistic regression . . . . . . . . . . . . . Fitting two exponentials: 1. parameter estimates . . . . . . . Fitting two exponentials: 2. correlation matrix . . . . . . . . . Fitting two exponentials: 3. goodness of t statistics . . . . . Fitting two exponentials: 4. model discrimination statistics . . Fitting nonlinear growth models . . . . . . . . . . . . . . . . Exhaustive analysis of an arbitrary vector . . . . . . . . . . . Exhaustive analysis of an arbitrary matrix . . . . . . . . . . . Statistics on paired columns of a matrix . . . . . . . . . . . . Hotelling T 2 test for H0 : means = reference . . . . . . . . . . Hotelling T 2 test for H0 : means are equal . . . . . . . . . . . Covariance matrix symmetry and sphericity tests . . . . . . . t tests on groups across rows of a matrix . . . . . . . . . . . . Nonparametric tests across rows . . . . . . . . . . . . . . . . All possible comparisons . . . . . . . . . . . . . . . . . . . . One sample t test . . . . . . . . . . . . . . . . . . . . . . . . Kolomogorov-Smirnov 1-sample and Shapiro-Wilks tests . . . Poisson distribution tests . . . . . . . . . . . . . . . . . . . . Unpaired t test . . . . . . . . . . . . . . . . . . . . . . . . . Paired t test . . . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov-Smirnov 2-sample test . . . . . . . . . . . . . . Wilcoxon-Mann-Whitney U test . . . . . . . . . . . . . . . . Wilcoxon signed-ranks test . . . . . . . . . . . . . . . . . . . Chi-square test on observed and expected frequencies . . . . . Fisher exact contingency table test . . . . . . . . . . . . . . . Chi-square and likelihood ratio contingency table tests: 2 by 2 Chi-square and likelihood ratio contingency table tests: 2 by 6 Loglinear contingency table analysis . . . . . . . . . . . . . . Observed and expected frequencies . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 43 48 49 52 52 53 54 55 55 56 57 58 60 60 61 63 67 79 81 82 83 84 85 86 86 87 88 90 92 94 95 95 96 97 99 100 100 101 101 102

xiv

List of Tables

3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.50 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.60 3.61 3.62 3.63 3.64 3.65 3.66 3.67 3.68 3.69 3.70 3.71 3.72 3.73 3.74 3.75 3.76 3.77 3.78 3.79 3.80 3.81 3.82 3.83 3.84 3.85 3.86 3.87 3.88 3.89 3.90 3.91 3.92 3.93 3.94

McNemar test . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cochran Q repeated measures test . . . . . . . . . . . . . . . . . Binomial test . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F test for exess variance . . . . . . . . . . . . . . . . . . . . . . Runs up and down test for randomness . . . . . . . . . . . . . . . Median test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mood-David equal dispersion tests . . . . . . . . . . . . . . . . . Kendall coefcient of concordance: results . . . . . . . . . . . . Kendall coefcient of concordance: data . . . . . . . . . . . . . . ANOVA example 1(a): 1-way and the Kruskal-Wallis test . . . . ANOVA example 1(b): 1-way and the Tukey Q test . . . . . . . . ANOVA example 2: 2-way and the Friedman test . . . . . . . . . ANOVA example 3: 3-way and Latin square design . . . . . . . . ANOVA example 4: arbitrary groups and subgroups . . . . . . . ANOVA example 5: factorial design . . . . . . . . . . . . . . . . ANOVA example 6: repeated measures . . . . . . . . . . . . . . Analysis of proportions: dichotomous data . . . . . . . . . . . . Analysis of proportions: meta analysis . . . . . . . . . . . . . . . Analysis of proportions: risk difference . . . . . . . . . . . . . . Analysis of proportion: meta analysis with zero frequencies . . . Correlation: Pearson product moment analysis . . . . . . . . . . Correlation: analysis of selected columns . . . . . . . . . . . . . Correlation: Kendall-tau and Spearman-rank . . . . . . . . . . . Correlation: partial . . . . . . . . . . . . . . . . . . . . . . . . . Correlation: partial correlation matrix . . . . . . . . . . . . . . . Correlation: canonical . . . . . . . . . . . . . . . . . . . . . . . Cluster analysis: distance matrix . . . . . . . . . . . . . . . . . . Cluster analysis: partial clustering for Iris data . . . . . . . . . . Cluster analysis: metric and non-metric scaling . . . . . . . . . . Cluster analysis: K-means clustering . . . . . . . . . . . . . . . . K-means clustering for Iris data . . . . . . . . . . . . . . . . . . Principal components analysis . . . . . . . . . . . . . . . . . . . Procrustes analysis . . . . . . . . . . . . . . . . . . . . . . . . . Varimax rotation . . . . . . . . . . . . . . . . . . . . . . . . . . MANOVA example 1a. Typical one way MANOVA layout . . . . MANOVA example 1b. Test for equality of all means . . . . . . . MANOVA example 1c. The distribution of Wilks . . . . . . . MANOVA example 2. Test for equality of selected means . . . . MANOVA example 3. Test for equality of all covariance matrices MANOVA example 4. Prole analysis . . . . . . . . . . . . . . . Comparing groups: canonical variates . . . . . . . . . . . . . . . Comparing groups: Mahalanobis distances . . . . . . . . . . . . Comparing groups: Assigning new observations . . . . . . . . . . Factor analysis 1: calculating loadings . . . . . . . . . . . . . . . factor analysis 2: calculating factor scores . . . . . . . . . . . . . Singular values for East Jerusalem Households . . . . . . . . . . Autocorrelations and Partial Autocorrelations . . . . . . . . . . . Fitting an ARIMA model to time series data . . . . . . . . . . . . Survival analysis: one sample . . . . . . . . . . . . . . . . . . . Survival analysis: two samples . . . . . . . . . . . . . . . . . . . GLM survival analysis . . . . . . . . . . . . . . . . . . . . . . . Robust analysis of one sample . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

103 104 105 105 106 107 109 109 110 110 111 114 115 116 118 119 119 122 126 128 129 130 133 134 136 136 137 138 140 143 144 146 147 149 151 152 154 155 155 156 157 158 159 161 162 164 166 169 172 175 177 178 181 193

List of Tables

xv

3.95 3.96 3.97 3.98 3.99 3.100 3.101 3.102 3.103 3.104 3.105 3.106 3.107 3.108 3.109 3.110 F.1 F.2 F.3 F.4 F.5 F.6 F.7

Robust analysis of two samples . . . . . . . . . . . . . . . . . . Indices of diversity . . . . . . . . . . . . . . . . . . . . . . . . . Latin squares: 4 by 4 random designs . . . . . . . . . . . . . . . Latin squares: higher order random designs . . . . . . . . . . . . Zeros of a polynomial . . . . . . . . . . . . . . . . . . . . . . . Matrix example 1: Determinant, inverse, eigenvalues, eigenvectors Matrix example 2: Singular value decomposition . . . . . . . . . Matrix example 3: LU factorization and condition number . . . . Matrix example 4: QR factorization . . . . . . . . . . . . . . . . Matrix example 5: Cholesky factorization . . . . . . . . . . . . . Matrix example 6: Evaluation of quadratic forms . . . . . . . . . Solving Ax = b: square where A1 exists . . . . . . . . . . . . . Solving Ax = b: overdetermined in 1, 2 and norms . . . . . . . The symmetric eigenvalue problem . . . . . . . . . . . . . . . . Comparing two data sets . . . . . . . . . . . . . . . . . . . . . . Spline calculations . . . . . . . . . . . . . . . . . . . . . . . . . Test le vector.tf1 . . . . . . . . . . Test le matrix.tf1 . . . . . . . . . Test le binomial.tf3 . . . . . . . . Test le cluster.tf1 (original version) Test le piechart.tf1 . . . . . . . . . Test le kmeans.tf1 . . . . . . . . . Test le gauss3.tf1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

194 194 198 198 200 201 202 202 204 204 205 206 206 207 211 218 377 378 378 379 379 380 380

List of Figures
1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 Collage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collage 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The main SIMFIT menu . . . . . . . . . . . . The SIMFIT le selection control . . . . . . . Five x,y values . . . . . . . . . . . . . . . . . The SIMFIT simple graphical interface . . . . The SIMFIT advanced graphical interface . . . The SIMFIT PostScript driver interface . . . . The simplot default graph . . . . . . . . . . . The nished plot and Scatchard transform . . . A histogram and cumulative distribution . . . . Plotting a double graph with two scales . . . . Typical bar chart features . . . . . . . . . . . . Typical pie chart features . . . . . . . . . . . . Plotting surfaces, contours and 3D-bar charts . Alternative types of exponential functions . . . Using ext to t exponentials . . . . . . . . . Typical growth curve models . . . . . . . . . . Using gct to t growth curves . . . . . . . . Original plot and Scatchard transform . . . . . Substrate inhibition plot and semilog transform The normal cdf . . . . . . . . . . . . . . . . . Using makdat to calculate a range . . . . . . . A 3D surface plot . . . . . . . . . . . . . . . . Adding random error . . . . . . . . . . . . . . The Lotka-Volterra equations and phase plane . Plotting user supplied equations . . . . . . . . Fitting exponential functions . . . . . . . Fitting high/low afnity sites . . . . . . . Fitting positive rational functions . . . . Isotope displacement kinetics . . . . . . Estimating growth curve parameters . . . Fitting three equations simultaneously . . Fitting the epidemic differential equations A linear calibration curve . . . . . . . . . A cubic spline calibration curve . . . . . Plotting LD50 data with error bars . . . . Plotting vectors . . . . . . . . . . . . . . Plot to diagnose multivariate normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 7 10 14 16 17 20 21 21 22 23 23 24 24 27 27 28 28 30 30 31 32 32 33 34 34 59 63 64 65 68 70 70 73 73 76 80 82

xvi

List of Figures

xvii

3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 3.40 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17

Observed and Expected frequencies . . . . . . . . . . . . . . . Plotting interactions in Factorial ANOVA . . . . . . . . . . . . Plotting analysis of proportions data . . . . . . . . . . . . . . . Meta analysis and log odds ratios . . . . . . . . . . . . . . . . Bivariate density surfaces and contours . . . . . . . . . . . . . Canonical correlations for two groups . . . . . . . . . . . . . . Dendrograms and multivariate cluster analysis . . . . . . . . . Classical metric and non-metric scaling . . . . . . . . . . . . . K-means clustering: example 1 . . . . . . . . . . . . . . . . . . K-means clustering: example 2 . . . . . . . . . . . . . . . . . . Principal component scores and loadings . . . . . . . . . . . . Principal components scree diagram . . . . . . . . . . . . . . . MANOVA prole analysis . . . . . . . . . . . . . . . . . . . . Comparing groups: canonical variates and condence regions . Comparing groups: principal components and canonical variates Two dimensional biplot for East Jerusalem Households . . . . . Three dimensional biplot for East Jerusalem Households . . . . Percentage variance from singular value decomposition . . . . . The T4253H data smoother . . . . . . . . . . . . . . . . . . . . Time series before and after differencing . . . . . . . . . . . . . Times series autocorrelation and partial autocorrelations . . . . Fitting an ARIMA model to time series data . . . . . . . . . . . Analyzing one set of survival times . . . . . . . . . . . . . . . Analyzing two sets of survival times . . . . . . . . . . . . . . . Cox regression survivor functions . . . . . . . . . . . . . . . . Signicance level and power . . . . . . . . . . . . . . . . . . . Noncentral chi-square distribution . . . . . . . . . . . . . . . . Kernel density estimation . . . . . . . . . . . . . . . . . . . . . Fitting initial rates . . . . . . . . . . . . . . . . . . . . . . . . Fitting lag times . . . . . . . . . . . . . . . . . . . . . . . . . Fitting burst kinetics . . . . . . . . . . . . . . . . . . . . . . . Model free curve tting . . . . . . . . . . . . . . . . . . . . . . Trapezoidal method for areas/thresholds . . . . . . . . . . . . . Splines: equally spaced interior knots . . . . . . . . . . . . . . Splines: user spaced interior knots . . . . . . . . . . . . . . . . Splines: automatically spaced interior knots . . . . . . . . . . . Symbols, ll styles, sizes and widths. Lines: standard types . . . . . . . . . Lines: extending to boundaries . . . . Text, maths and accents. . . . . . . . Arrows and boxes . . . . . . . . . . . Venn diagrams . . . . . . . . . . . . Polygons . . . . . . . . . . . . . . . Axes and labels . . . . . . . . . . . . Plotting transformed data . . . . . . . Sizes, shapes and clipping. . . . . . . Rotating and re-scaling . . . . . . . . Aspect ratios and shearing effects . . Resizing fonts . . . . . . . . . . . . . Split axes . . . . . . . . . . . . . . . Plotting mathematical equations . . . Plotting chemical structures . . . . . Chemical formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 120 126 130 132 139 141 145 146 147 150 151 157 159 160 167 169 170 171 173 174 175 176 178 182 185 196 198 209 210 210 211 213 215 216 217 219 220 221 222 223 224 225 226 227 228 228 229 230 231 232 233 234

xviii

List of Figures

4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 4.37 4.38 4.39 4.40 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 4.49 4.50 4.51 4.52 4.53 4.54 4.55 4.56 4.57 4.58 4.59 4.60 4.61 4.62 4.63 4.64 4.65 4.66 4.67 4.68 4.69

Perspective in barcharts, box and whisker plots and piecharts . Advanced bar chart features . . . . . . . . . . . . . . . . . . Three dimensional barcharts . . . . . . . . . . . . . . . . . . Error bars 1: barcharts . . . . . . . . . . . . . . . . . . . . . Error bars 2: skyscraper and cylinder plots . . . . . . . . . . . Error bars 3: slanting and multiple . . . . . . . . . . . . . . . Error bars 4: calculated interactively . . . . . . . . . . . . . . Error bars 4: binomial parameters . . . . . . . . . . . . . . . Error bars 5: log odds . . . . . . . . . . . . . . . . . . . . . . Error bars 6: log odds ratios . . . . . . . . . . . . . . . . . . Clusters and connections . . . . . . . . . . . . . . . . . . . . Correlations and scattergrams . . . . . . . . . . . . . . . . . Condence ellipses for a bivariate normal distribution . . . . . 95% condence regions . . . . . . . . . . . . . . . . . . . . Dendrograms 1: standard format . . . . . . . . . . . . . . . . Dendrograms 2: stretched format . . . . . . . . . . . . . . . Dendrograms 3: plotting subgroups . . . . . . . . . . . . . . Dendrograms 3: plotting subgroups . . . . . . . . . . . . . . K-means clustering for UK airports . . . . . . . . . . . . . . Highlighting K-means cluster centroids . . . . . . . . . . . . Principal components . . . . . . . . . . . . . . . . . . . . . . Labelling statistical graphs . . . . . . . . . . . . . . . . . . . Probability distributions . . . . . . . . . . . . . . . . . . . . Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . Goodness of t to a Poisson distribution . . . . . . . . . . . . Trinomial parameter joint condence contours . . . . . . . . Random walks . . . . . . . . . . . . . . . . . . . . . . . . . Power as a function of sample size . . . . . . . . . . . . . . . Three dimensional plotting . . . . . . . . . . . . . . . . . . . The objective function at solution points . . . . . . . . . . . . Sequential sections across best t surfaces . . . . . . . . . . . Contour diagram for Rosenbrock optimization trajectory . . . Space curves and projections . . . . . . . . . . . . . . . . . . Projecting space curves onto planes . . . . . . . . . . . . . . Three dimensional scatter plot . . . . . . . . . . . . . . . . . Two dimensional families of curves . . . . . . . . . . . . . . Three dimensional families of curves . . . . . . . . . . . . . Phase portraits of plane autonomous systems . . . . . . . . . Orbits of differential equations . . . . . . . . . . . . . . . . . Deconvolution 1: Graphical deconvolution of complex models Deconvolution 2: Fitting convolution integrals . . . . . . . . . Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . Models with cross over points . . . . . . . . . . . . . . . . . Plotting single impulse functions . . . . . . . . . . . . . . . . Plotting periodic impulse functions . . . . . . . . . . . . . . Flow cytometry . . . . . . . . . . . . . . . . . . . . . . . . . Subsidiary gures as insets . . . . . . . . . . . . . . . . . . . Growth curves . . . . . . . . . . . . . . . . . . . . . . . . . Ligand binding species fractions . . . . . . . . . . . . . . . . Immunoassay and dose-response dilution curves . . . . . . . r = r() parametric plot 1. Eight leaved Rose . . . . . . . . . r = r() parametric plot 2. Logarithmic Spiral with Tangent .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

235 236 237 238 239 240 241 242 242 243 244 244 245 246 247 248 249 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 277 278 278 279 280 281

Part 1

Overview
SIMFIT is a free package for simulation, curve tting, plotting, statistics, and numerical analysis, supplied in compiled form for end-users, and source form for programmers. It runs in Windows and also Linux (under Wine). The academic version is free for student use, but the professional version has more features, and uses the NAG library DLLs. .

Applications
analysis biology biochemistry biophysics chemistry ecology epidemiology immunology mathematics medicine pharmacology pharmacy physics physiology statistics inverses, eigenvalues, determinants, SVD, zeros, quadrature, optimization, allometry, growth curves, bioassay, ow cytometry, ligand binding studies, cooperativity analysis, metabolic control modelling, enzyme kinetics, initial rates, lag times, asymptotes, chemical kinetics, complex equilibria, Bray-Curtis similarity dendrograms, K-means clusters, principal components, population dynamics, parametric and nonparametric survival analysis, nonlinear calibration with 95% x-prediction condence limits, plotting phase portraits, orbits, 3D curves or surfaces, power and sample size calculations for clinical trials, dose response curves, estimating LD50 with 95% condence limits, pharmacokinetics, estimating AUC with 95% condence limits, simulating and tting systems of differential equations, solute transport, estimating diffusion constants, or data exploration, tests, tting generalized linear models.

Summary
SIMFIT consists of some forty programs, each dedicated to a special set of functions such as tting specialized models, plotting, or performing statistical analysis, but the package is driven from a program manager which also provides options for viewing results, editing les, using the calculator, printing les, etc. SIMFIT has on-line tutorials describing available functions, and test data les are provided so all that rst time users need to do to demonstrate a program is to click on a [Demo] button, select an appropriate data set, then observe the analysis. Results are automatically written to log les, which can be saved to disk, or browsed interactively so that selected results can be printed or copied to the clipboard. SIMFIT data sets can be stored as ASCII text les, or transferred by the clipboard into SIMFIT from spreadsheets. Macros are provided (e.g., simfit4.xls) to create les from data in MS Excel, and documents are supplied to explain how to incorporate SIMFIT graphics into word processors such as MS Word, or how to use PostScript fonts for special graphical effects.

S IMFIT reference manual: Part 1

SIMFIT has many features such as: wide coverage, great versatility, fast execution speed, maximum likelihood estimation, automatic tting by user-friendly programs, constrained weighted nonlinear regression using systems of equations in several variables, or the ability to handle large data sets. Students doing statistics for the rst time will nd it very easy to get started with data analysis, such as doing t or chi-square tests, and advanced users can supply user dened mathematical models, such as systems of differential equations and Jacobians, or sets of nonlinear equations in several independent variables for simulation and tting. SIMFIT also supports statistical power calculations and many numerical analysis techniques, such as nonlinear optimization, nding zeros of n functions in n variables, integrating n functions of m variables, calculating determinants, eigenvalues, singular values, matrix arithmetic, etc.

1.1 Installation
The latest details for installation and conguration will be found in the les install.txt and configure.txt which are distributed with the package. A summary follows. The SIMFIT installation program simfit_setup.exe This can be obtained from http://www.simfit.man.ac.uk. and it contains the whole SIMFIT package together with documentation and test les. You can uninstall any existing SIMFIT installations before installing if you want, but the installation program will simply overwrite any existing SIMFIT les, as long as they do not have the read-only attribute. You should install the package in the default top-level SIMFIT folder, say C:\Program Files\Simfit, by double clicking on the installation program and accepting the default options, unless you have very good reasons not to. The installation program will create binary, demonstration, documentation, results, and user sub-folders. The SIMFIT driver w_simfit.exe You can make a desktop shortcut to the SIMFIT driver (w_simfit.exe) in the SIMFIT binary subfolder if you want to drive the package from a desk-top icon. The SIMFIT top-level folder There should be no les at all in the SIMFIT top-level folder. the SIMFIT auxiliary programs You can specify your own editor, clipboard viewer, and calculator or use the Windows defaults. To read the manual you must install the Adobe Acrobat Reader, and for professional PostScript graphics hardcopy you should also install the GSview/Ghostscript package. The SIMFIT conguration options Run the driver then use the [Congure], [Check] and [Apply] buttons until all paths to auxiliary les are correct. The [Check] option will tell you if any les cannot be located and will search your computer for the correct paths and lenames if necessary. The Spanish language version of SIMFIT can be obtained from http://simfit.usal.es.

1.2 Documentation
There are several sources of documentation and help. Each individual program has a tutorial describing the functions provided by that program, and many of the dedicated controls to the more complex procedures have a menu option to provide help. Also there is a help program which provides immediate access to further information about SIMFIT procedures, test les, and the readme les (which have technical information for more advanced users). However, the main source of information is the reference manual which is provided in PostScript (w_manual.ps), and portable document format (w_manual.pdf). Advice about how to use this manual follows.

Overview

The SIMFIT manual is in ve parts. u Part 1 This summarizes the functions available, but does not explain how to use SIMFIT. A very brief overview of the package is given, with collages to illustrate the graphical possibilities, and advice to help with installation. u Part 2 This guides the rst time user through some frequently used SIMFIT procedures, such as creating data les, performing curve tting, plotting graphs, or simulating model systems. Anybody interested in exploiting the functionality provided by the SIMFIT package would be well advised to work through the examples given, which touch upon many of the standard procedures, but with minimal theoretical details. u Part 3 This takes each SIMFIT procedure and explains the theory behind the technique, giving worked examples of how to use the test les provided with the package to observe the analysis of correctly formatted data. Before users attempt to analyze their own data, they should read the description of how that particular technique performs with the test data provided. It should be obvious that, if SIMFIT fails to read or analyze a user-supplied data set but succeeds with the test data supplied, then the user-supplied data le is not formatted correctly, so the test les provided should be consulted to understand the formatting required. Suggestions as to how users might employ SIMFIT to analyze their own data are given. u Part 4 This explains how to use the more advanced SIMFIT plotting functions. This is where users must turn in order to nd out how to get SIMFIT to create specialized graphs, and anybody interested in this aspect should browse the example plots displayed in part 4. u Part 5 This contains several appendices dealing with advanced features and listing all the programs and les that make up the SIMFIT package. There are sections which outline the library of mathematical and statistical models and display the necessary equations, describe the syntax required to develop user-dened models giving numerous examples, explain how to edit PostScript les to create special graphical effects, list all SIMFIT programs and test les, and discuss interfaces to other software, like GSview/Ghostscript and Microsoft Ofce.

1.3 Plotting
SIMFIT has a simple interface that enables users to create default plots interactively, but it also has an advanced users interface for the leisurely sculpturing of masterpieces. To get the most out of SIMFIT graphics, users should learn how to save ASCII text coordinate les for selected plots, bundle them up into library les or project archives to facilitate plotting many graphs simultaneously, and create conguration les to act as templates for frequently used plotting styles. SIMFIT can drive any printer and is able to create graphics les in all formats. PostScript users should save the SIMFIT industry standard encapsulated PostScript les (*.eps), while others should save Windows enhanced metales (*.emf). In order to give some idea of the type of plots supported by SIMFIT, gures 1.1, 1.2, and 1.3 should be consulted. These demonstrate typical plots as collages created by SIMFIT program editps from SIMFIT PostScript les.

S IMFIT reference manual: Part 1

Binding Curve for the


2.00

2 2

isoform at 21 C
1.00

Scatchard Plot for the

2 2

isoform
1.00

Data Smoothing by Cubic Splines

Ligand Bound per Mole of Protein

T = 21C [Ca++] = 1.310-7M


1.50 0.75

0.75

y/x (M-1)

1 Site Model
0.50

Y-values

1 Site Model
1.00

0.50

2 Site Model
0.50

0.25

2 Site Model

0.25

0.00 0.00 0 10 20 30 40 50 0.00 0.00 0.50 1.00 1.50 2.00 0 1 2 3 4 5 6

Concentration of Free Ligand(M)

X-values

GOODNESS OF FIT TO A NORMAL DISTRIBUTION


1.00

Inhibition Kinetics: v = f([S],[I])


100 1.00

Absorbance, Enzyme Activity and pH


8.00

Enzyme Activity (units) and pH

Sample Distribution Function

0.75

80

Absorbance at 280nm

Sample
0.50

v([S],[I])/ M.min-1

[I] = 0 [I] = 0.5mM [I] = 1.0mM

0.75

6.00

60 40 20 0

0.50

4.00

N(,2) =0 =1

0.25

2.00

0.25

[I] = 2.0mM

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.00

0.00 -2.50

-1.25

0.00

1.25

2.50

20

40

60

80 Absorbance

Fraction Number
Enzyme Units pH of Eluting Buffer

Sample Values (m)

[S]/mM

Plotting a Surface and Contours for z = f(x,y)


1

Three Dimensional Bar Chart


1.00

Survival Analysis

100% Z 0

Estimated Survivor Function

Kaplan-Meier Estimate

50%

0.50

1992 0.000 Y 1.000 1.000 X 1993 0.000 1994 1995 1996 May April March February January

0% June

MLE Weibull Curve


0.00

10

15

20

25

Time

Best Fit Line and 95% Limits


15

SIMFIT 3D plot for z = x2 - y2

x 1 exp 2

t 1 2

1.00

dt
0.75

1
10

0.50

Z
5

-1
0 0 2 4 6 8 10

-1 -1 Y 1 1 X

0.25

-3

-2

-1

x
x(t), y(t), z(t) curve and projection onto y = - 1

Binomial Probability Plot for N = 50, p = 0.6


0.120 0.100 0.080

Using CSAFIT for Flow Cytometry Data Smoothing


500

400

1.000

Number of Cells

Pr(X = x)

300 Z 200 0.000 -1.000 Y 0 50 100 150 200 250 1.000 1.000 X

0.060 0.040 0.020 0.000

100

-1.000

10

20

30

40

50

Channel Number

125

Percentage of Average Final Size

MALE FEMALE
100

Using GCFIT to fit Growth Curves

Log Odds Plot


5 100

ANOVA (k = no. groups, n = no. per group)

4 75

80

k= 2 k= 4 k= 8 k= 16

Power (%)

t/Weeks

50

40
2

60

32

25

20

= 1 (variance) = 1 (difference)

10

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

20

40

60

80

Time (weeks)

log10[p /(1 - p )]

Sample Size (n)

Figure 1.1: Collage 1

Overview

A kinetic study of the oxidation of p-Dimethylaminomethylbenzylamine

Trinomial Parameter 95% Confidence Regions


0.85 0.75 0.65 0.55

"bb " " " b b b b""


d dt

CH2 NH2

CH=0

"bb [O] " " - " b b b b""


k
(
1

+ NH3

"bb [O] " " - " b b b b""

C02 H

7,11,2 70,110,20 210,330,60

CH2 N(Me)2

CH2 N(Me)2 0 k+2 ) k k


2 2

CH2N(Me)2

0x1 0 @ y A=@
z

k+1 k+1 0

k+2

10 x 1 0 x 1 0 1 1 0 A @ y A @ y0 A = @ 0 A
;

py
0.45 0.35

z0

270,270,60
0.25 0.15 0.05

1.20

x(t) y(t)

90,90,20
0.15 0.25 0.35

9,9,2
0.45 0.55 0.65 0.75

1.00

z(t)

px

x(t), y(t), z(t)

0.80

Orbits for a System of Differential Equations


2.50 2.00 1.50

0.60

0.40
y(1)
1.00 0.50 0.00

0.20

0.00 0.00

1.00

2.00

3.00

4.00

5.00
-0.50 -1.25 0.00 1.25

t (min)
Using SIMPLOT to plot a Contour Diagram
0.000 Key Contour
1 2 3 4 5 6 7 8 9 10 9.02510-2 0.181 0.271 0.361 0.451 0.542 0.632 0.722 0.812 0.903

y(2)
Using QNFIT to fit Beta Function pdfs and cdfs
20.0 1.00 2

Phase Portrait for Lotka-Volterra Equations

2 7 3 6 10 9 4 8 5 1 2 3 5 4

Histogram and pdf fit

0.80

Step Curve and cdf fit

0.60 10.0 0.40

3 2 4 5 2

y(1)
0 -1 -1

0.20

1.000 0.000

0.0 0.00 1.000

0.25

0.50

0.75

0.00 1.00

Random Number Values

y(2)

Deconvolution of 3 Gaussians
0.500

Illustrating Detached Segments in a Pie Chart


April

Box and Whisker Plot


Range, Quartiles and Medians
7.00 4.75 2.50 0.25 -2.00 February January March April May

0.400
June July

May

March February January December

0.300

y
0.200

August

0.100
September October

November

0.000 -3.0

1.5

6.0

10.5

15.0

Month

Bar Chart Features


55% Stack
9

1-Dimensional Random Walk

3-Dimensional Random Walk

Position

0%

Box/Whisker

O ve rla pp
-35%

N or m al G ro up ro up
Hanging Group
-3 Y 0 10 20 30 40 50 11 1 X 0 -1 -11 -11

in g G

Number of Steps

Figure 1.2: Collage 2

S IMFIT reference manual: Part 1

100%

80%

60%

40%

20%

0%

K-Means Clusters

PC1 PC2 PC5 PC8 PC6 HC8 PC3 PC4 PC7 HC7 HC4 24A 33B 76B 30B 100A 34 53A 76 30A 61B 60A 27A 27B 52 37B 68 28A 97A 26A 60B 29 36A 36B 31B 31A 35B 32A 32B 35A 72A 72B 99A 99B 37A 47 100B 33A 53B 73 24B 26B 28B 97B 91A 91B 25A 25B 61A HC5 HC6

Contours for Rosenbrock Optimization Trajectory


1.500
8 7 5 3 3 5 2 2 4 7 6 1

Diffusion From a Plane Source

Key Contour
1 2 3 4 5 6 7 8 9 10 1.425 2.838 5.663 11.313 22.613 45.212 90.412 1.808102 3.616102 7.232102

1.6 1.2 0.8 0.4

4 6

10 9 8 9

10

0.00 0.25 -2 0.50 0.75 1.00 1.25 3 2 1 -1 0

0.0 -3

-1.500 -1.500 1.500

Y
X

Slanting and Multiple Error Bars

Simfit Cylinder Plot with Error Bars


11

6.0

4.0

Values

y
2.0 0.0 0.0 2.0 4.0 6.0

0 Case 1 Case 2 Case 3 Case 4 Case 5 Month 7 Month 6 Month 5 Month 4 Month 3 Month 2 Month 1

Figure 1.3: Collage 3

Part 2

First time users guide


2.1 The main menu
The menu displayed in gure 2.1 will be referred to as the main SIMFIT menu.
File Edit View Fit Calibrate Plot Statistics Area/Slope Simulate Modules Help A/Z Results

SIMFIT SIMFIT SIMFIT


A package for Simulation, Curve fitting, Graph plotting, and Statistical Analysis. W. G. Bardsley, University of Manchester, U.K. http://www.simfit.man.ac.uk

Manual

FAQ

Recent

Editor

Explorer

Calculator

Configure

Figure 2.1: The main SIMFIT menu

From this you can select from pop-up menus according to the functions required, and then you can choose which of the forty or so SIMFIT programs to use. When you proceed to run a program you will be in an isolated environment, dedicated to the chosen program. On exit from the chosen program you return to the main SIMFIT menu. If you get lost in program sub-menus and do not know where you are, use the closure cross which is always activated when SIMFIT is in table display mode.

S IMFIT reference manual: Part 2

A brief description of the menus and task bar buttons will now be given. File This option is selected when you want to create a data le by typing in your own data, or transforming clipboard data or text les with data tables from a spreadsheet. You can also dene a set of data les for a library le. Edit PostScript le. This option is selected when you want to edit a data le, or create a graphics le from a

View This option is selected when you want to view any ASCII text les, such as test les, data les, results les, model les, etc. A particularly useful feature is to be able to view lists of les analyzed and les created in the current session. Also, if GSview/Ghostscript or some other PostScript browser has been installed, you can view PostScript les, such as the SIMFIT gures and manuals. Adobe Acrobat can also be used to view *.pdf les. Fit From this option you can t things like exponentials, binding models or growth curves, using dedicated user-friendly programs, or you can t model-free equations, like polynomials or splines. Advanced users can do comprehensive curve tting from libraries or user supplied equations. Calibrate Choosing this option allows you to perform calibration using lines, polynomials (gentle curves), logistic polynomials (sigmoid curves), cubic splines (complicated curves), or deterministic models (if a precise mathematical form is required). You can also analyze dose response curves for minimum values, half saturation points, half times, IC50, EC50, or LD50 estimates. Plot Some explanation is required concerning this option. All the SIMFIT programs that can generate graphical display do so in such a way that a default graph is created, and there are limited options for editing. At this stage you can drive a printer or plotter or make a graphics le, but the output will only be of draft quality. To sculpture a graph to your satisfaction then obtain publication quality hardcopy, here is what to do: either transfer directly to advanced graphics or, for each data set or best-t curve plotted, save the corresponding coordinates as ASCII text les. When you have such a set of les, you are ready to select the graph plotting option. Read the ASCII coordinate les into program simplot and edit until the graph meets your requirements. Then print it or save it as a graphics le. PostScript les are the best graphics les and, to re-size these, rotate, make collages, overlays, insets and so on, you input them into program editps. Statistics SIMFIT will do all the usual statistical tests, but the organization is very different from any statistics package. That is because SIMFIT is designed as a teaching and research tool to investigate statistical problems that arise in curve tting and mathematical modelling; it is not designed as a tool for routine statistical analysis. Nevertheless, the structure is very logical; there are programs designed around specic distributions, and there is a program to help you nd your way around the statistics options and do all the usual tests. So, if you want to know how to do a chi-square or t -test, analyze a contingency table, perform analysis of variance or carry out nonparametric testing, just select program simstat. It tells you about tests by name or by properties, and it also does some very handy miscellaneous tasks, such as exhaustive analysis of a sample, multiple correlations and statistical arithmetic. For many users, the program simstat is the only statistics program they will ever need. Area/Slope Many experimental procedures call for the estimation of initial rates, lag times, nal asymptotes, minimum or maximum slopes or areas under curves (AUC). A selection of programs to do these things, using alternative methods, is available from this option. Simulate If you want to simulate data for a Monte Carlo study, or for graph plotting, program

makdat creates exact data from a library of models, or from a user-supplied model. Random error to simulate an experiment can then be added by program adderr. There is program deqsol for simulating and tting systems of nonlinear differential equations, program makcsa for simulating ow cytometry experiments, and program rannum for generating pseudo random numbers.

First time users guide

Modules From this menu you can use your own specied editor, explorer, SIMFIT modules or, in fact, any chosen Windows program. There are specialized SIMFIT modules which can be accessed using this option. Help current release. A/Z From this menu you can run the SIMFIT help program and obtain technical data about the

This provides a shortcut to named programs in alphabetical order.

Results This option allows you to view, print or save the current and ten most recent results les. Note that the default is f$result.txt, but you can congure SIMFIT so that, each time you start a program, you can specify if the results should be stored on a named log le. All SIMFIT results les are formatted ready to be printed out with all tables tabbed and justied correctly for a monospaced font, e.g., Courier. However, at all stages during the running of a SIMFIT program, a default log le is created so that you can always copy selected results to the clipboard for pasting into a word processor. The main menu task bar also has buttons to let you view or print any ASCII text le, such as a data or results le.

2.2 The task bar


At the bottom of the main SIMFIT menu will be found a task bar, which is provided to facilitate the interface between SIMFIT and other programs. Manual This option allows you to open the pdf version of the SIMFIT manual in Adobe Acrobat. The pdf manual has book-marks, and extensive hyperlinks between the contents, list of gures, and index, to facilitate on-line use. You should open the SIMFIT manual at the start of a SIMFIT session, then keep the manual open/minimized on the main Windows task bar, so that it is always ready for reference. The manual contains details of all the SIMFIT numerical procedures, statistical test theory, mathematical equations and analytical and plotting facilities, and has more details than the SIMFIT help program. FAQ This option allows you to run the frequently asked questions section of the SIMFIT help program which gives useful advice about the SIMFIT procedures but is not so comprehensive as the reference manual, which should be consulted for details, and to see worked examples. Recent This option allows you to view, save, print, or copy to the clipboard your recent les analyzed, les created, or SIMFIT results les. Editor This option opens your chosen text editor program, which you can specify using the [Configure] option. There are many excellent free text editors, such as emacs, which can be specied, and which are far more powerful than Windows Notepad. You should never edit any SIMFIT test le or model le and, to protect against this, experienced users could decide to make all the SIMFIT test les read-only. Explorer This option opens your chosen disk explorer program, which you can specify using the [Configure] option. Calculator This option opens your chosen calculator program, which you can specify using the [Configure] option. Configure This option starts up the SIMFIT conguration procedure. Use this to congure SIMFIT to your own requirements. Note that, if you select the [Check] option and SIMFIT reports missing les, you can specify les exactly or just provide a search path. The [Apply] option must be used to change the conguration by creating a new conguration le w_simfit.cfg.

10

S IMFIT reference manual: Part 2

Some things you can specify from this option are as follows. r Switching on or off the displaying of start-up messages, or saving of results les. r Suppressing or activating warning messages. r Changing the size of fonts in menus, or the percentage of screen area used for plotting. r Checking for consistent paths to auxiliary programs, and locating incorrectly specied les. r Altering the specication of auxiliary programs and modules. r Adjusting the variable colors on the SIMFIT color palette. r Setting the default symbols, line-types, colors, ll-styles and labels for plotting. r Dening mathematical constants that are used frequently for data transformation.

2.3 The le selection control


The SIMFIT le dialogue control is displayed in gure 2.2.

File Edit View Help

Open ...
C:\Program Files\Simfit\dem\normal.tf1

OK Analyzed Back <<

Browse Created Next >> Paste Swap_Type Demo Step from NAG Analyzed file list item 1

Figure 2.2: The SIMFIT le selection control

This control helps you to create new les (i.e., in the Save As . . . mode) or analyze existing les (i.e., in the Open . . . mode). The top level [File], [Edit], [View], and [Help] menus allow you to select appropriate test les to use for practise or browse to understand the formatting. Below this is an edit box and a set of buttons which will now be described. File Name You can type the name of a le into the edit box but, if you do this, you must type in the full path. If you just type in a le name you will get an error message, since SIMFIT will not let you create les in the SIMFIT folder, or in the root, to avoid confusion. OK This option indicates that the name in the edit box is the le name required.

Browse This option simply transfers you to the Windows control but, when you know how to use the SIMFIT le selection control properly, you will almost never use the Windows control.

First time users guide

11

Analyzed This history option allows you to choose from a list of the last les that SIMFIT has analyzed, but the list does not contain les recently saved. Created This history option allows you to choose from a list of the last les that SIMFIT has created, but the list does not contain les recently analyzed. Of course, many les will rst be created then subsequently analyzed, when they would appear in both Analyzed and Created lists. Paste This option is only activated when SIMFIT detects ASCII text data on the clipboard and, if you choose it, then SIMFIT will attempt to analyze the clipboard data. If the clipboard data are correctly formatted, SIMFIT will create a temporary le, which you can subsequently save if required. If the data are not properly formatted, however, an error message will be generated. When highlighting data in your spreadsheet to copy to the clipboard, write to a comma delimited ASCII text le, or use a with a macro like simfit4.xls, you must be very careful to select the columns for analysis so that they all contain exactly the same number of rows. Demo This option provides you with a set of test les that have been prepared to allow you to see SIMFIT in action with correctly formatted data. Obviously not all the les displayed are consistent with all the possible program functions. With programs like simstat, where this can happen, you must use the [Help] option to decide which le to select. When you use a SIMFIT program for the rst time, you should use this option before analyzing your own data. NAG This option provides you with a set of test les that have been prepared to allow you to use SIMFIT to see how to use NAG library routines. Back This option allows you to scroll backwards through recent les and edit the lename, if required, before selecting. Next This option allows you to scroll forwards through recent les and edit the lename, if required, before selecting. Swap Type This option toggles between Created and Analyzed le types.

If you name les sensibly, like results.1, results.2, results.3, and so on, and always give your data short meaningful titles describing the data and including the date, you will nd the [Back], [Next], [Created] and [Analyzed] buttons far quicker and more versatile than the [Browse] pipe to Windows.

2.3.1 Multiple le selection


It often happens that users need to select multiple les. Examples could be: t collecting a set of graphics les together in order to create a composite graph in simplot; t selecting a family of vector les for analysis of variance or correlation analysis; t building a consistent package of PostScript les to generate a collage using editps, or t gathering together results les for tting several model functions to experimental data in qnt. The problem with the Windows multiple le selection protocol is that it does not offer a convenient mechanism for selecting subsets from a pool of related les, nor does it provide the opportunity to select les of restricted type, vector les only, for instance. The SIMFIT library le method is the best technique to submit a selected set of les for repeated analysis, but it is not so versatile if users want to add or subtract les to a basic set interactively, which requires the project technique.

12

S IMFIT reference manual: Part 2

2.3.1.1 The project technique

The SIMFIT project technique has been developed to meet these needs. Where the multiple selection of les is called for, a menu is presented offering users the opportunity to input a library le, or a set of individually chosen les, or to initiate a project. The project technique provides these opportunities: r choosing individual les by the normal procedure; r selecting les by multiple le selection using the shift and control keys; r deleting and restoring/moving individual les; r suppressing multiple les from the project, or r harvesting les from a project list.
2.3.1.2 Checking and archiving project les

Before les are accepted into a project, a quick check is undertaken to ensure that les are consistent with the type required. Further, users are able to add les to project lists interactively after les have been created, e.g., after saving ASCII text coordinate les to replot using simplot. The project archives for recent les are as follows: a_recent.cfg: c_recent.cfg: f_recent.cfg: g_recent.cfg: m_recent.cfg: p_recent.cfg: v_recent.cfg: any type of le covariance matrix les curve tting les graphics ASCII coordinate les matrix les for statistics encapsulated PostScript les vector les for statistics.

Files added to a project archive are kept in the order of addition, which sometimes permits duplication but keeps les grouped conveniently together for multiple selection. Search paths and le types can be set from the normal le selection control and missing les are deleted from the archives.

First time users guide to data handling

13

2.4 First time users guide to data handling


Data must be as tables of numerical data (with no missing values) in ASCII text format, as will be clear by using the [View] button on the main SIMFIT menu to browse the test les. Such les can be created using any text editor but are best made by using the SIMFIT editors, or transferring data from a spreadsheet using the clipboard and maksim, or a macro such as simfit4.xls. First observe the notation used by SIMFIT when creating data tables. Scientic/computer notation is used for real numbers, where E+mn means 10 to the power mn. Examples: 1.23E-02 = 0.0123, 4.56E+00 = 4.56 and 7.89E+04 = 78900.0. This notation confuses non-scientists and inexperienced computer users, but it has the advantage that the numbers of signicant gures and orders of magnitude can be seen at a glance. However, correlation coefcients and probabilities are output in decimal notation to four decimal places, which is about the limit for meaningful signicance tests, while integers are usually displayed as such. Note that formatting of input data is not so strict: you can use any formatting convention to represent numbers in your own data tables, as SIMFIT converts all data supplied into double precision numbers for internal calculations. For instance: 1, 1.0, 1.0E+00, 10.0E-01, and 0.1E+01 are all equivalent. Either commas or spaces can be used to separate numbers in input lists, but commas must not be used as decimal points in data les. For instance: 1 2 3 and 1,2,3 are equivalent.

2.4.1 The format for input data les


The SIMFIT data le format is as follows. a) b) c) d) e) Line 1: Informative title for the data set ( 80 characters) Line 2: No. rows (m) No. columns (n) (dimensions of the data set) Lines 3 to m + 2: Block of data elements (as a m by n matrix) Line m + 3: Number of further text lines (k) Lines m + 4 to m + 3 + k: extra text controlling program operation or describing the data.

2.4.2 File extensions and folders


SIMFIT does not add le extensions to le names, nor is it sensitive to the data le extensions. So you should use the extension .txt if you want your text editor or word processor to create, or read SIMFIT les. However, you will be prompted to use accepted le extensions (.eps, .jpg, .bmp) for graphics les, and SIMFIT will refuse to open executable les (.exe, .dll, .bat, .obj, .com), or create les in the root directory (e.g., C:), or the SIMFIT folder (e.g., C:\Program Files\Simfit). Data les should be given meaningful names, e.g., data.001, data.002, data.003, data.004, first.set, second.set, third.set, fourth.set, etc., so that the names or extensions are convenient for copying/deleting.

2.4.3 Advice concerning data les


a) Use an informative title for your data and include the date of the experiment. b) The rst extra text line controls some programs, e.g., calcurve in expert mode, but most programs ignore the extra text. This is where you enter details of your experiment. c) If you enter a vector into programs makmat/editmt, do not rearrange into increasing or decreasing order if you wish to do run, paired t or any test depending on natural order. d) With big data sets, make small les (makl/makmat), then join them together (edit/editmt).

2.4.4 Advice concerning curve tting les


a) Use makl to make a main master le with x and y values and with all s = 1. Keep replicates in the natural order in which they were made to avoid bias in the run test. b) Enter all your data, not means of replicates, so the run and sign tests have maximum power and the correct numbers of degrees of freedom are used in statistics. If you do not have sample variances from replicates, try an appropriate multiple of the measured response (7% ?) as a standard deviation estimate. Nothing is saved by using means and standard errors of means but, if you do this, the parameter estimates will be alright, but statistics will be biased.

14

S IMFIT reference manual: Part 2

c) To change values, units of measurement, delete points, add new data, change weights, etc., input the main master le into edit. d) If you have single measurements (or < 5 replicates ?), t the main master le with all s = 1 and compare the result with s = 7%|y| say, obtained using edit. e) If you have sufcient replicates ( 5 ?) at each x, input the main master le into edit and generate a le with s = sample standard deviations. Compare with results from, e.g., smoothed weighting. f) For les with means and std. dev., or means 95% condence limits for error bars, use edit on the data, or generate interactively from replicates using the graphics [Advanced] option.

2.4.5 Example 1: Making a curve tting le


Select makl and request to create a le, say fivexy.1st, containing ve pairs of x, y values and choosing to set all s = 1. When asked, give the data an informative title such as ...Five x,y values..., and then proceed to type in the following ve x, y values (which contain a deliberate mistake). x 1 2 3 4 5 y 1 2 3 5 4
y
5.00

4.00

3.00

2.00

When nished request a graph, which will clearly show the mistake as in the dotted line in gure 2.3, namely, that the y values are reversed at the last two x values. Of course you could correct the mistake at this stage but, to give you an excuse to use the curve tting le editor, you should now save the le fivexy.1st in its present form.

1.00 1.00

2.00

3.00

4.00

5.00

Figure 2.3: Five x,y values

2.4.6 Example 2: Editing a curve tting le


Read the le, fivexy.1st, that you have just created into edit, and ask to create a new le, say fivexy.2nd. Then change the values of y at lines 4 and 5 so that the y values are equal to the x values. Now you should see the perfectly straight continuous line as in gure 2.3 instead of the bent dotted line. You are now nished, but before you exit please note some important features about program edit. 1) This editor takes in an old le which is never altered in any way. 2) After editing to change the data and title, a new le is created with the edited data. 3) In this way your original data are never lost, and you can always delete the original le when you are sure that the editing is correct. 4) There are a vast number of powerful editing options, such as fusing les, changing baselines, scaling into new units, weighting, creating means and standard errors or error bar les from groups of replicates.

2.4.7 Example 3: Making a library le


Select maklib and make a library le, say mylib.1st, with the title ...Two data sets... and containing the two les you have just made. Browse this le in the SIMFIT le viewer and you will discover that it looks like the following: Two data sets fivexy.1st fivexy.2nd You could have made this le yourself with the edit or notepad Windows tools, as it is just a title and two lenames. Now, to appreciate the power of what you have just done, select program simplot, choose a standard x, y plot and read in this library le, mylib.1st, to get a plot like gure 2.3.

First time users guide to data handling

15

2.4.8 Example 4: Making a vector/matrix le


Select makmat and request to make a vector le called, for instance, vector.1st, with a title, say ...Some numbers between 0 and 1..., then type in ten numbers between 0 and 1. For example: 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5 Save this le then make another one called, for example, vector.2nd with the numbers 0.975, 0.95, 0.925, 0.9, 0.85, 0.8, 0.75, 0.7, 0.6, 0.5 We shall use these two vector les later to practise statistical tests.

2.4.9 Example 5: Editing a vector/matrix le


Read in the le called fivexy.1st which you made previously and do the same editing that you did with edit to correct the mistakes. Now you will be able to appreciate the similarities and differences between makl/edit and makmat/editmt: makl/edit have dedicated functions to handle les with column 1 in increasing order and column 3 positive, while makl/edit can handle arbitrary matrices and vectors.

2.4.10 Example 6: Saving data-base/spread-sheet tables to les


Since spread-sheet and data-base programs can write out tables in ASCII text format, it is easy to transform them into SIMFIT style. For instance, read fivexy.2nd into maksim and, after discarding the two header and trailer lines, you can create a data le in SIMFIT format. maksim is much more than just a utility for re-formatting, it can also do selections of sub-sets of data for statistical analysis according to the following rules. a) Only hard returns on the input le can act as row separators. b) Non-printing characters, except hard returns, act as column separators. c) Spaces or commas are interpreted as column separators and double commas are interpreted as bracketing an empty column. d) Each row of the input table is regarded as having as many columns as there are words separated by commas or spaces. e) Commas must not be used as decimal points or as thousands separators in numbers. For example, use 0.5 (not 0,5 or 1/2) for a half, and 1000000 (not 1,000,000) for a million. f) Single column text strings must be joined and cannot contain spaces or commas. For example, use strings like Male.over.40 (not Male over 40) for a label or cell entry. g) Simple tables of numbers can be entered directly, but titles, row and column counters, trailing text and the like must be deleted until every row has the same number of columns before selection of sub-matrices can commence. h) Row and column entries can be selected for inclusion in an output le as long as both the row and column Boolean selection criteria are satised. To achieve this it is often best to start by globally suppressing all rows and columns and then including as required, e.g., columns 3 and 4, all rows with Smith in column 1, all rows with values between 40 and 60 in column 2. In order to understand the functionality provided by maksim you should create some tables using a text editor such as Windows notepad then copy to the clipboard and read into maksim. There are also two special test les, maksim.tf1 and maksim.tf2, that are designed to exploit and illustrate some of the the procedures available in maksim. Note that program maksim does not allow editing, for that you use your text editor. It does have a useful visual interface for browsing smallish ASCII text tabular data, so that you can see at any stage what the submatrix of selected data looks like. Like all SIMFIT editors, it will never discard or overwrite your primary data le.

16

S IMFIT reference manual: Part 2

2.5 First time users guide to graph plotting


There are three basic graphical operations as follows. 1) Obtaining a set of coordinates to be plotted. Every SIMFIT program that creates graphs lets you print a default graph, or save ASCII coordinate les, which are tables of coordinates. This phase is easy, but name les systematically. 2) Transferring the graph to a peripheral. If you have a PostScript printer, use the SIMFIT driver not the Windows driver. If not, drive your printer directly or, for more options, use PostScript output with GSview/Ghostscript. 3) Including graphics les in documents. Bitmap les (*.bmp) are ne for photographs and histology slides, but are inappropriate for scientic graphs, since they are large and give poor resolution when printed. Vector les (e.g., PostScript les) are better as they are compact, easily edited, and give publication quality hardcopy. Windows word processing packages can import PostScript les but, to view them on screen, you may have to use GSview/Ghostscript to add a preview section. The best graphics les for Windows users with no PostScript facilities are enhanced metales (*.emf) not the obsolete Windows metales (*.wmf).

2.5.1 The SIMFIT simple graphical interface


The SIMFIT simple graphical interface, displayed in gure 2.4 for data in gcfit.tf2, best t logistic curve,

Help

Data and best-fit curve


Edit
1.20

Advanced
0.90

PS Windows Cancel

Size

0.60

0.30

0.00 0.0 2.5 5.0 7.5 10.0

Time

Figure 2.4: The SIMFIT simple graphical interface and asymptote obtained using gct, provides the options now described. Help Edit This provides a short tutorial about SIMFIT graphics. This provides only very simple options for editing the graph.

First time users guide to graph plotting

17

Advanced This lets you create ASCII coordinate les, which can be added to your project archive for retrospective use, and is the most powerful option in the hands of experienced users. Alternatively, you can transfer directly into simplot for immediate editing. PS This creates a temporary PostScript le which can be viewed, saved to le, printed using

GSview, or copied directly to a PostScript printer, for the highest possible quality.

Windows This is provided for users who do not have PostScript printing facilities. Only three of the options should be contemplated; printing as a high resolution bitmap, or copying to the clipboard, or saving as an enhanced Windows metale *.emf (not as *.wmf). Cancel This returns you to the executing program for further action.

2.5.2 The SIMFIT advanced graphical interface


The SIMFIT advanced graphical interface, displayed in gure 2.5, results from transferring the data from
Menu Titles Legends Labels Style Data Colors Transform Configure

>Txt

Data and best-fit curve


1.20 1.00 0.80

Text

>A/L/B^

T=0

>A/L/B_

A/L/B

>Obj

A=0

Size

>Pnl

0.60 0.40

Object

Help

O=0

0.20
PS Panel

0.00
Win

0.0

2.0

4.0

6.0

8.0

10.0

X=0

Time
Quit Y=1

Figure 2.5: The SIMFIT advanced graphical interface gure 2.4 directly into simplot, providing further options, as now described.

The advanced graphics top level options


Menu This provides for editing from a menu without having to redraw the graph after each edit, and is designed to save time for advanced users plotting many large data sets. Titles Legends This allows you to edit the plot titles. This allows you to edit the plot legends.

18

S IMFIT reference manual: Part 2

Labels This allows you to change the range of data plotted, and alter the number or type of tick marks and associated labels. Style This allows you to alter the aspect ratio of the plot and perform clipping. A graph paper effect can be added to aid the placement of graphical objects, and offsets or frames can be specied. Data This allows you to change line or symbol types, add or suppress error bars, edit current data values, add new data sets, or save edited data sets. Colors This allows you to specify colors.

Transform This allows you to specify changes of coordinates. Note that there is a specic title and set of plot legends for each transformation, so it makes sense to choose a transformation before editing the titles or legends. Configure This allows you to create conguration les containing all the details from editing the current plot, or you can read in existing conguration les from previous editing to act as templates.

The advanced graphics right hand options


Text T = # any time. A/L/B This allows you to select a text string to label features on the graph. This indicates which text string is currently selected. Only one string can be selected at

This allows you to select an arrow, line, or box to label features on the graph.

A = # This indicates which arrow, line, or box is currently selected. Only one arrow, line, or box can be selected at any time. Object This allows you to select a graphical object to label features of the graph.

0 = # This indicates which graphical object is currently selected. Only one graphical object can be selected at any time. Panel This allows you to specify an information panel linking labels to line types, symbols, ll-styles, etc., to identify the data plotted. X = # Y = # This is X the coordinate for the current hot spot. This is the Y coordinate for the current hot spot.

The advanced graphics left hand options


The left hand buttons in gure 2.5 allow you to move graphical objects about. The way this works is that the red arrow can be dragged anywhere on the graph, and its tip denes a hot spot with the coordinates just discussed. This hot spot is coupled to the current, text, arrow, line, box, graphical object that has been selected and also to the left hand buttons. Help and hardcopy are also controlled by left hand buttons. Note that the appropriate right hand buttons must be used to make a specic text string, arrow, line, box, or graphical object the selected one before it can be coupled to the hot spot. Also, observe that, to drag a horizontal outline box in order to surround a text string, the head and tail moving buttons are coupled to opposite corners of the horizontal rectangle.

First time users guide to graph plotting

19

>Txt >A/L/B >A/L/B >Obj >Pnl Help PS Win Quit

Move the selected text (if any) to the hot spot. Move the selected arrow, line or box head (if any) to the hot spot. Move the selected arrow, line or box tail (if any) to the hot spot. Move the selected graphical object (if any) to the hot spot. Move the information side panel (if any) to the hot spot. This gives access to a menu of topics on specic subjects. PostScript hardcopy, as for the PostScript option with simple graphics. Windows hardcopy, as for the Windows option with simple graphics. This prompts you to save a conguration le, then closes down the current graph.

2.5.3 PostScript, GSview/Ghostscript and SIMFIT


SIMFIT creates EPS standard PostScript les in a special format to facilitate introducing maths, or symbols like pointing hands or scissors (ZapfDingbats) retrospectively. Note the following advice. a) The default display uses TrueType fonts which are not exactly the same dimensions as PostScript fonts, so text strings on the display and in the PostScript hardcopy will have identical starting coordinates, orientation, and color, but slightly differing sizes. Before making PostScript hardcopy, check the PostScript display to make sure that the text strings are not under- or over-sized. b) To get the full benet from PostScript install the GSview/Ghostscript package which drives all devices from PostScript les. c) Save ASCII coordinate les from the default plot or transfer directly into simplot. d) Collect your selected ASCII coordinate les into a set and input them individually or collectively (as a library le or project archive) into program simplot. e) Edit the graph on screen until it is in the shape and form you want, using the ordinary alphabet and numbers at this stage where you want Greek, subscripts, superscripts, etc. f) Now edit to change characters into subscripts, Greek, maths, as required, or add accents like acute, grave, tilde, and nally save or print. g) With the le, you can make slides, posters, incorporate into documents, create hardcopy of any size or orientation (using program editps), transform into another format, view on screen, drive a nonPostScript printer, etc.

The SIMFIT PostScript driver interface


The SIMFIT PostScript driver interface, displayed in gure 2.6, which can be used from either the simple or advanced graphics controls, provides the options now described. Terminal This allows you to select the terminal to be used for PostScript printing, usually LPT1. If a non-PostScript printer is attached to this terminal, the ASCII text for the PostScript le will be printed, not the PostScript plot. To prevent this happening, Terminal 0 should be selected to lock out the printing option. Shape This allows you to switch between portrait or landscape, but it also offers enhanced options where you can stretch, slide or clip the PostScript output without changing the aspect ratio of the fonts or plotting symbols. This is very useful with crowded graphs such as dendrograms, as explained further on

20

S IMFIT reference manual: Part 2

The SIMFIT PostScript driver PS-printer port Orientation Stretch/clip/slide Number of colors Font type Font size X-axis offset Y-axis offset LPT1 Portrait Suppressed 72 Helvetica 1.00 1.00 3.50

Terminal Shape X,Y offset Scale axes Line width Font

File/View/Print File means Save As *.eps then make bmp/jpg/pdf/tif if required. View means use GSview to view, print, add eps preview, etc., if required. Print drives a PostScript printer.

File View Print Quit

Figure 2.6: The SIMFIT PostScript driver interface page 248, or with map plotting, as discussed on page 250 where aspect ratios have to be altered so as to be geometrically correct. X,Y-offset This allows you to set a xed offset if the defaults are not satisfactory, but it is better to leave the defaults and edit retrospectively using editps. Scale axes This allows you to set a xed scaling factor if the defaults are not satisfactory, but it is better to leave the defaults and edit retrospectively using editps. Line width This allows you to set a xed line width for all hardcopy, i.e. both Windows and PostScript, if the default is not satisfactory. However, note that the relative line widths are not affected by this setting, and if extreme line widths are selected, SIMFIT will re-set line widths to the defaults on start up. Font This allows you to set the default font type, which would normally be Helvetica for clarity, or Helvetica Bold for presentation purposes such as slides. File This allows you to create a PostScript le, but it is up to you to add the extension .eps to indicate that the le will be in the encapsulated PostScript format with a BoundingBox. View This allows you to visualize the PostScript le using your PostScript viewer, which would normally be GSview/Ghostscript. Print Quit This allows you to copy the PostScript le directly to a PostScript printer. This returns you to the executing program for further action.

Note that all the parameters controlling PostScript output are written to the le w_ps.cfg, and a detailed discussion of SIMFIT PostScript features will be found in the appendix (page 338).

First time users guide to graph plotting

21

2.5.4 Example 1: Creating a simple graph


From the main SIMFIT menu select [Plot], then simplot, and choose to create a graph with stanOriginal x,y Coordinates dard x, y axes. Input the library le w_simfig1.tfl 1.92 which identies simplot.tf1, simplot.tf2, and simplot.tf3, which now display as gure 2.7. 1.48 Here, simplot has used defaults for the title, legends, plotting symbols, line types and axes. This is 1.05 how simplot works. Every graphical object such as a data set, a le with error bars, a best-t curve, etc. 0.61 must be contained in a correctly formatted ASCII plotting coordinates text le. Normally these would 0.18 0.2 11.7 23.1 34.5 46.0 be created by the SIMFIT programs, and you would x make a library le with the objects you want to plot. Then you would choose the shape, style, axes, titles, plotting symbols, line-types, colors, extra text, Figure 2.7: The simplot default graph arrows, accents, math symbols, Greek, subscripts, superscripts and so on and, when the plot is ready, a printer would be driven or a graphics le created. Finally, before quitting the graph, a conguration le would be written for re-use as a template. To see how this works, read in the conguration le w_simfig1.cfg to get gure 2.8, while with w_simfig2.cfg you will get the corresponding Scatchard plot.
Binding Curve for the
2.00
2 2

isoform at 21 C
1.00

Scatchard Plot for the

2 2

isoform

Ligand Bound per Mole of Protein

T = 21C [Ca++] = 1.310-7M


1.50 0.75

y/x (M-1)

1 Site Model
1.00

1 Site Model
0.50

2 Site Model
0.50

0.25

2 Site Model

0.00

10

20

30

40

50

0.00 0.00

0.50

1.00

1.50

2.00

Concentration of Free Ligand(M)

Figure 2.8: The nished plot and Scatchard transform

2.5.5 Example 2: Error bars


Figure 2.8 has three objects; means and error bars in simplot.tf1, with best t curves for two possible models in simplot.tf2, and simplot.tf3. Actually, you can make a le like simplot.tf1 yourself. The le mmfit.tf4 contains curve tting data, and you make a le with means and 95% condence limits from this using program edit. The procedure used in SIMFIT is always to do curve tting using all replicates, not means. Then, when a plot is needed, a le with error bars is generated. To see how this is done, choose the [Edit] option from the main menu, read mmfit.tf4 into program edit, then select the option to create a le with means and error bars for plotting and create an error bar le like simplot.tf1 from replicates in mmfit.tf4. This illustrates a very important principle in SIMFIT. You never input data consisting of means from replicates into S IMFIT programs. So, if you calculate means from groups of replicates yourself, you are doing something wrong, as S IMFIT always performs analysis using complete data sets. For instance, means with error bars for plotting can always be are calculated on demand from replicates (arranged in nondecreasing order), e.g., using the [Data] option in program simplot.

22

S IMFIT reference manual: Part 2

2.5.6 Example 3: Histograms and cumulative distributions


To illustrate histograms we will use normal.tf1, with fty random numbers from a normal distribution ( = 0, = 1), generated by program rannum. Choose [Statistics] from the main menu, then simstat and pick the option to test if a sample is from a normal distribution. Read in normal.tf1, create a histogram with twelve bins between 3.0 and 3.0, then display the plots as in gure 2.9.
FITTING A NORMAL DISTRIBUTION TO A HISTOGRAM
Sample Distribution Function
15

GOODNESS OF FIT TO A NORMAL DISTRIBUTION


1.00

0.75

Frequency

10

Sample
0.50

N(,2) =0 =1

0.25

0 -3.00

-1.50

0.00

1.50

3.00

Sample Values (M)

0.00 -2.50

-1.25

0.00

1.25

2.50

Sample Values (m)

Figure 2.9: A histogram and cumulative distribution A best t pdf curve can also be created using qnt and a pdf le, e.g., with error bars calculated using a binomial distribution as in gure 2.9 Note that, whereas the histogram is easy to interpret but has an ambiguous shape, the cumulative distribution has a xed shape but is featureless. There is much to be said for showing best t pdfs and cdfs side by side as in gure 2.9 since, in general, statistical tests for goodness of t of a pdf to a distribution should be done on the cumulative distribution, which does not suffer from the ambiguity associated with histograms. However trends in data are more easily recognized in histograms from large samples than in cumulative distributions, i.e., stair step plots.

2.5.7 Example 4: Double graphs with two scales


Frequently different scales are required, e.g., in column chromatography, with absorbance at 280nm representing protein concentration, at the same time as enzyme activity eluted, and the pH gradient. Table 2.1 is typical where absorbance could require a scale of zero to unity, while enzyme activity uses a scale of zero to
Fraction Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Absorbance 0.0 0.1 1.0 0.9 0.5 0.3 0.1 0.3 0.4 0.2 0.1 0.1 0.3 0.6 0.9 Enzyme Activity 0.1 0.3 0.2 0.6 0.1 0.8 1.5 6.3 8.0 5.5 2.0 1.5 0.5 1.0 0.5 Buffer pH 6.0 6.0 6.0 6.0 6.2 6.7 7.0 7.0 7.0 7.0 7.2 7.5 7.5 7.5 7.5

Table 2.1: Data for a double graph eight, and pH could be on a scale of six to eight. If absorbance and activity were plotted on the same scale

First time users guide to graph plotting

23

Original x,y Coordinates


1.00 8.00 1.00

Absorbance, Enzyme Activity and pH


8.00

Enzyme Activity (units) and pH

Absorbance at 280nm

0.75

6.03

0.75

6.00

E-xtra axis

0.50

4.00

0.50

4.05

0.25

2.00

0.25

2.08

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.00

0.00 1.0

4.5

8.0

11.5

0.10 15.0

Fraction Number
Absorbance Enzyme Units pH of Eluting Buffer

Figure 2.10: Plotting a double graph with two scales the plot would be dominated by activity, so you could change the units of enzyme activity to be compatible with the absorbance scale. However, to illustrate how to create a double graph, a plot with absorbance on the left hand axis and enzyme activity and pH together on the right hand axis will be constructed. Obviously this requires three separate objects, i.e., les for program simplot. You could create the following les using program makmat, and the data in table 2.1. File 1: The rst column together with the second column (as in plot2.tf1) File 2: The rst column together with the third column (as in plot2.tf2) File 3: The rst column together with the fourth column (as in plot2.tf3) Select program simplot and choose to make a double graph. Input the rst le (absorbance against fraction) scaled to the left hand axis with the other two scaled to the right hand axis to get the left panel of gure 2.10. To transform the left plot into the nished product on the right panel in gure 2.10 proceed as follows: a) Edit the overall plot title and both plot legends. b) Edit the data ranges, notation and offset on the axes. c) Edit the three symbol and line types corresponding to the three les. d) Include an information panel and edit the corresponding keys. e) Choose HelveticaBold when creating the nal PostScript le.

2.5.8 Example 5: Bar charts

Binomial Samples (p=0.5, size=20)


Number of Successes in 20 Trials Range, Quartiles and Medians
16 12 8 4 0 T S R Q P O N M L K J I H G F E D C B A 7.00 4.75 2.50 0.25 -2.00 January

Box and Whisker Plot

February

March

April

May

Successive Samples

Month

Figure 2.11: Typical bar chart features Figure 2.11 was created by simplot using the Advanced Bar Chart option with barchart.tf1 and barchart.tf5. The rst plot illustrates groupings, while the second is a box and whisker plot with ranges,

24

S IMFIT reference manual: Part 2

quartiles, and median. To see the possibilities, plot and browse the barchart.tf? test les. An easy way to prepare advanced bar chart les is to read in a matrix then save an advanced bar chart le. You can supply labels on the le, and change the position of the horizontal axis (the x-axis) to create hanging bar effects.

2.5.9 Example 6: Pie charts

Pie Chart Fill Styles


Style 3 Style 4 Style 2 Pie key 1 Pie key 2 Pie key 3 Style 5 Style 1 Pie key 4 Pie key 5 Style 6 Style 10 Pie key 6 Pie key 7 Pie key 8 Style 7 Style 8 Style 9 Pie key 9 Pie key 10

Illustrating Detached Segments in a Pie Chart


April May March February January December August November September October

June July

Figure 2.12: Typical pie chart features Figure 2.12 was produced using the Advanced Pie Chart plotting option in simplot and piechart.tf1 and piechart.tf2. By consulting the w_readme.* les or browsing these test les the convention for ll styles, labels, colors, segment offsets, etc. will be obvious. An easy way to create advanced pie chart les is to read in a vector with positive segment sizes then save an advanced pie chart.

2.5.10 Example 7: Surfaces, contours and 3D bar charts


Plotting a Surface and Contours for z = f(x,y)
1

Three Dimensional Bar Chart

100%
Z 0

50%

0% Year 1 Year 2 Year 3 Year 4 Year 5 June May April March February January

0.000 Y 1.000 1.000 X

0.000

Figure 2.13: Plotting surfaces, contours and 3D-bar charts Figure 2.13 illustrates a surface plot made using surface.tf1 with the Surface/Contour option in program simplot, together with a three dimensional bar chart resulting from barcht3d.tf1 (after editing legends). Surface plotting requires a mathematical expression for z = f (x, y) and the program makdat should be used, since it is too tedious to type in sufcient data to generate a smooth surface. Three dimensional bar chart les do not usually require so much data, so they can easily be typed in, using program makmat. The format for surface plotting les will be found in the w_readme.* les. You will nd that, once a surface le has been input into simplot, it is possible to plot just the surface, contours only, surface with contours, or a skyscraper plot. There are also many features to rotate the plot, change the axes, edit the legends, choose colors, add special effects, etc. Run all the surface.tf? les to appreciate the possibilities.

First time users guide to curve tting

25

2.6 First time users guide to curve tting


Linear regression is trivial and gives unique solutions, but constrained nonlinear regression is extremely complicated and does not give unique solutions. So you might ask: Why bother with nonlinear regression ? The answer is that nature is nonlinear, so nonlinear regression is the only approach open to honest investigators. Sometimes it is possible to transform data and use linear techniques, as in generalized linear interactive modelling (GLM), but this just bypasses the central issue; nding a mathematical model derived using established physical laws, and involving constants that have a well dened physical meaning. Logistic regression, for instance, involving tting a polynomial to transformed data, may seem to work; but the polynomial coefcients have no meaning. Estimating rate constants, on the other hand, allows comparisons to be made of kinetic, transport or growth processes under different treatments, and helps to explain experimental results in terms of processes such as diffusion or chemical reaction theory. Nonlinear regression involves the following steps. 1. Obtaining data for responses yi , i = 1, 2, . . . , n at exactly known values of xed variables xi . 2. Estimating weighting factors wi = 1/s2 i , where the si are standard deviations, or smoothed estimates, obtained by analyzing the behaviour of replicates at the xed xi values if possible, or si = 1 otherwise. 3. Selecting a sensible deterministic model from a set of plausible mathematical models for yi = f (xi , ) + i , where = 1 , 2 , . . . , k are parameters, and i are uncorrelated errors with zero mean. 4. Choosing meaningful starting parameter estimates for the unknown parameter vector . 5. Normalizing the data so that internal parameters, objective function and condition number of the Hessian matrix are of order unity (in internal coordinates) at the solution point. 6. Assessing goodness of t by examining the weighted residuals ))/si ri = (yi f (xi , is the best t parameter vector. where 7. Investigating parameter redundancy by examining the weighted sum of squared residuals W SSQ W SSQ = ri2 ,
i=1 n

and the estimated parameter variance-covariance matrix. Curve tting is controversial, so the SIMFIT philosophy will be stated at this point. Weighting should only be attempted by users who have at least four replicates per design point and are prepared to investigate the relative effects of alternative weighting schemes. Caution is needed when interpreting goodness of t statistics, and users should demand convincing evidence before concluding that models with more than say four parameters are justied. If there are no parameter constraints, a modied Gauss-Newton or Levenburg-Marquardt method can be used, but if constraints are required a sequential quadratic programming or quasi-Newton method should be used. You must have good data over a wide range to dene asymptotes etc., t all replicates, not means, and use sensible models and starting estimates.

26

S IMFIT reference manual: Part 2

2.6.1 User friendly curve tting programs


Unfortunately, xi cannot be xed exactly, wi have to be estimated, we are never certain that f (.) is the correct model, experimental errors are not uncorrelated and normally distributed, and W SSQ minimization is is not guaranteed to give a unique or sensible solution with nonlinear models. Nevertheless SIMFIT has these linear and nonlinear regression programs that greatly simplify model tting and parameter estimation.
lint ext gct hlt mmt polnom rft sft csat inrate

linear/multi-linear regression and generalized linear modelling (GLM) sum of exponentials, choosing from 6 possible types (unconstrained) growth models (exponential, monomolecular, Richards, Von Bertalanffy, Gompertz, Logistic, Preece-Baines) with or without constant terms (unconstrained) sum of high/low afnity ligand binding sites with a constant term (constrained) sum of Michaelis-Menten functions (constrained) polynomials (Chebyshev) in sequence of increasing degree (unconstrained) positive n : n rational functions (constrained) saturation function for positive or negative binding cooperativity (constrained) ow cytometry histograms with stretch and shift (constrained) Hill-n/Michaelis-Menten/line/quadratic/lag-phase/monomolecular (unconstrained)

The user-friendly nonlinear regression programs calculate starting estimates, scale data into internal coordinates, then attempt to minimize the objective function W SSQ/NDOF , which has expectation 1 with correct model and weights. However, if incorrect models or weights are used, or W SSQ/NDOF 1.0E-6, or 1.0E6, the programs may not converge. If you have insufcient replicates to estimate weights and have set s = 1, the programs do unweighted regression, replacing the chi-square test by calculation of the average cv% cv% = 100 W SSQ/NDOF , and NDOF = n no. of parameters. (1/n) n i=1 |yi |

These programs must be supplied with all observations, not means of replicates or else biased statistics will be output and, after tting, options are available to plot residuals, identify outliers, calculate error bars interactively from groups of replicates (arranged in nondecreasing order), etc.

2.6.2 IFAIL and IOSTAT error messages


As curve tting is iterative you are likely to encounter error messages when curve tting as follows. IFAIL errors ag computational failure, and will occur if the data lead to a singularity in evaluating some expression. For instance, the formula y = 1/x will lead to overow when x becomes so small that y would exceed the largest number allowed on your computer. IFAIL messages look like this: FATAL : IFAIL = 1 from C05AZF/ZSOLVE. which means that a fault leading to IFAIL = 1 on exit from C05AZF has occurred in subroutine ZSOLVE. The order of severity of SIMFIT error messages is ADVICE < CAUTION < WARNING << FATAL then self-explanatory text. If a nonzero IFAIL value is returned and you want to know what it means, you can look it up in the NAG library handbook by simply searching for the routine on the web. For instance, searching for C05AZF should lead you to an appropriate web document, e.g., http://www.nag.co.uk/numeric/fl/manual/C05/C05azf.pdf, where you can nd out what C05AZF does and what the IFAIL message indicates. To exit from executing programs use the closure control, which is always activated when SIMFIT is in table displaying mode. IOSTAT errors ag input or output failure, and will occur if a program is unable to read data correctly from les, e.g., because the le is formatted incorrectly, end-of-le has been encountered leading to negative IOSTAT, or data has become corrupted, e.g., with letters instead of numbers.

First time users guide to curve tting

27

2.6.3 Example 1: Exponential functions


Graphs for the exponential function (page 59) f (t ) = A1 exp(k1t ) + A2 exp(k2t ) + + An exp(knt ) + C (2.1)

are shown in gure 2.14. Note that all these curves can be tted by ext using equation 2.1, but with different
Type1: Exponential Decay
1

Type 2: Exponential Decay to a Baseline


2 1

Type 3: Exponential Growth

f(t)

f(t)

f(t)
0 1 2 3 0 0

Type 4: Exponential Growth from a Baseline


2
1

Type 5: Up-Down Exponential


2

Type 6: Down-Up Exponential

f(t)

f(t)

f(t)
0 0 1 2 3 4 5

Figure 2.14: Alternative types of exponential functions strategies for initial parameter estimates and scaling, depending on curve type. To practise, choose [Fit] from the main menu, then select ext and read in exfit.tf4, which has data Fitting 1-Exponential and 2-Exponentials for two exponentials of type 1. Choose type 1, low2.00 est order 1, highest order 2, and a short random search, then watch. You will see ext attempt to nd 1.50 starting estimates by analyzing the data for potential parameter values, then rening these by a random 1.00 search, before proceeding to normalize into internal coordinates, optimize, estimate parameters and compare the t with one and two exponentials. Af0.50 ter tting, you will see a plot like gure 2.15, with data and best t 1 and 2-exponential functions. For 0.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 further practise try tting exfit.tf5 using type 5, t and exfit.tf6 using type 6. There are six types because type 1 is required for independent, unlinked processes, e.g., time-dependent denaturation of two Figure 2.15: Using ext to t exponentials different isoenzymes, type 3 represents the complement, i.e., amount of inactivated enzymes, while the constant in types 2 and 4 adds a baseline correction. In pharmacokinetics and many chemical schemes, the coefcients and time constants are not independent. For instance in consecutive chemical reactions with models 5 and 6 which, of course, must have at least two exponential terms.
Data and Best Fit Curves

28

S IMFIT reference manual: Part 2

2.6.4 Example 2: Nonlinear growth and survival curves


Model 1: Unlimited Exponential Growth
2

Model 2: Limited Exponential Growth


1 1

Model 3: Sigmoidal Growth

f(t)

f(t)

f(t)
0 1 2 3 0 0

Figure 2.16: Typical growth curve models Three growth curves (page 66) are shown in gure 2.16. Model 1 is exponential growth, which is only encountered in the early phase of development, Model 2 is limited exponential growth, concave down to an asymptote tted by the monomolecular model (Type 3 of gure 2.14), and several models can t sigmoidal proles as for Model 3 in gure 2.16, e.g., the logistic equation 2.2 f (t ) = A . 1 + B exp(kt ) (2.2)

Select [Fit] from the main menu then gct to t growth curves. Input gcfit.tf2 then t models 1, 2 and 3 in sequence to obtain gure 2.17, i.e., the exponential model gives a very poor t, the monomolecular model leads to an improved t, but the logistic is much better. This is the usual sequence of tting with gct but it does much more. It can t up to ten models sequentially and gives many statistics, such as maximum growth rate, to assist advanced users. The reason why there are alternative models, such as those of Gompertz, Richards, Von Bertalanffy, Preece and Baines, etc., is that the logistic model is often too restrictive, being symmetrical about the mid point, so generating biased t. The table of compared ts displayed by gct helps in model selection, however, none of these models can accommodate turning points, and all benet from sufcient data to dene the position of the horizontal asymptote.

Fitting Alternative Growth Models


1.25

Data and Best Fit Curves

1.00 Data Points Model 1 Model 2 Model 3

0.75

0.50

0.25

0.00 0 2 4 6 8 10

Figure 2.17: Using gct to t growth curves

Program gct also ts survival models using the Weibull distribution, or Maximum Likelihood to allow for censoring where a sample of random survival times is available. Sometimes a differential equation, such as the Von Bertalanffy allometric equation 2.3 dx = Ax Bx dt (2.3)

or systems of differential equations have to be tted, using program deqsol. No library of growth and survival models could ever be complete, so you may have to write your own model, as described in the appendix and the w_readme.* les.

First time users guide to curve tting

29

2.6.5 Example 3: Enzyme kinetic and ligand binding data


For simple kinetics mmt is used to t the Michaelis-Menten equation (page 64) v= Vmax [S] Km + [S] (2.4)

while sft (page 64) ts saturation as a function of ligand concentration according to y= K [x] + C, 1 + K [x] (2.5)

which allows to have xed value 1 (for fractional saturation), has association rather than dissociation constants, and permits a baseline correction term. To practise tting equation 2.4, use mmt with the mmfit.tf? les, while to practise tting equation 2.5, use sft with the sffit.tf? les. With accurate data over a wide range, it is often important to see if higher order models are justied, indicating multiple binding sites. For n isoenzymes or independent sites (page 64) the pseudo steady-state rate equation is Vmax(1) [S] Vmax(2) [S] Vmax(n) [S] v= + + + , (2.6) Km(1) + [S] Km(2) + [S] Km(n) + [S] while n types of low and high afnity binding sites (page 63) requires y= 1 K1 [x] 2 K2 [x] n Kn [x] + + + + C. 1 + K1[x] 1 + K2[x] 1 + Kn [x] (2.7)

Program mmt ts equation 2.6 while hlt ts equation 2.7 to hlfit.tf4 giving a plot like gure 2.8. By simulation you will nd that it is only with high quality data that two sites (n = 2) can be differentiated from 1 (n = 1), and that higher order cases (n > 2) can almost never be justied on statistical grounds. If you suspect two or more classes of sites with differing afnities, prepare data as accurately as possible over as a wide a range as is practicable, normalize so that C = 0, and use mmt to t equation 2.6 with successive orders n = 1, then n = 2. Take the possibility of multiple sites seriously only if it is supported by the statistical tests performed by mmt. Often a single macromolecule has n sites which are not independent, but communicate with each other, giving rise to positive or negative cooperative binding as in this equation y= Z (1 [x] + 22 [x]2 + + nn [x]n ) + C. n(1 + 1[x] + 2 [x]2 + + n [x]n ) (2.8)

Program sft (page 64) should then be used to t equation 2.8, where it is preferable to normalize so that C = 0. The scaling factor Z would be xed at Z = 1 if y() = 1, but otherwise Z can be estimated. Note that in equation 2.6 the Km(i) are dissociation constants, in equation 2.7 the Ki are individual association constants, but in equation 2.8 the i are overall, not individual, association constants. These conventions are described in the tutorial for program sft. Kinetic measurements have zero rate at zero substrate concentration, but this is not always possible with ligand binding experiments. Displacement assays, for instance, always require a baseline to be estimated which can have serious consequences as gure 2.18 illustrates. If a baseline is substantial, then the correct procedure is to estimate it independently, and subtract it from all values, in order to t using C = 0. Alternatively, hlt can be used to estimate C in order to normalize to zero baseline. Figure 2.18 is intended to serve as a warning as to possible misunderstanding that can arise with a small baseline correction that is overlooked or not corrected properly. It shows plots of (1 t )x y= +t 1+x for the cases with t = 0.05 (positive baseline), t = 0 (no baseline) and t = 0.05 (negative baseline). The plots cannot be distinguished in the original space, but the differences near the origin are exaggerated in

30

S IMFIT reference manual: Part 2

Original x,y Coordinates


0.900 2.00

Scatchard Plot

y = (1 - t)x/(1 + x) + t

y = (1 - t)x/(1 + x) + t

-0.050 0.0 10.0

y/x
-0.05 -0.050 0.900

x
t = 0.05 t = 0.0 t = -0.05 t = 0.05

y
t = 0.0 t = -0.05

Figure 2.18: Original plot and Scatchard transform Scatchard space, giving the false impression that two binding sites are present. Decisions as to whether one or more binding sites are present should be based on the statistics calculated by SIMFIT programs, not on the plot shapes in transformed axes. When initial rates are determined accurately over a wide range of substrate concentration, deviations from Michaelis-Menten kinetics are usually observed; in the form of substrate activation or substrate inhibition. The appropriate equation for this situation is the rational function f (x) = 0 + 1 x + 2 x2 + + n xn 0 + 1 x + 2 x2 + + n xn (2.9)

where normally 0 = 1, i 0 and j 0. Such positive rational functions have many applications, ranging from tting data from activator or inhibitor studies to data smoothing, and the program to use is rft (page 64). Users should note that rft is a very advanced program for tting equation 2.9, and it is imperative to understand the order of equation required n, and special structure dictated by xing the coefcients. For instance, steady state data has v = 0 when [S] = 0, so it would be logical to set 0 = 0. Again, substrate inhibition would normally require n = 0. Making such constraints considerably facilitates the curve tting. Practise tting equation 2.9 with the test les rffit.tf?, reading the le titles to determine any parameter constraints. Figure 2.19 illustrates a substrate inhibition curve and the semilogarithmic transform, which is the best way to view such ts.
Data, Best-Fit Curve and Previous Fit
0.250 0.250

X-semilog Plot

0.200

0.200

1:1 function
0.150 0.150

1:1 function

2:2 function
0.100

y
0.100

2:2 function
0.050 0.050

0.000 0.00 1.00 2.00 3.00

0.000 -2.00 -1.00 0.00 1.00

log x

Figure 2.19: Substrate inhibition plot and semilog transform

First time users guide to simulation

31

2.7 First time users guide to simulation


Simulation in SIMFIT involves creating exact data then adding pseudo random error to simulate experimental error. Exact data can be generated from a library of models or user-dened models.

2.7.1 Why t simulated data ?


Statistical tests used in nonlinear regression are not exact, and optimization is not guaranteed to locate the best-t parameters. This depends on the information content of your data and parameter redundancy in the model. To see how reliable your results are, you could perform a sensitivity analysis, to observe how results change as the data set is altered, e.g., by small variations in parameter values. For instance, suppose hlt concludes that ligand binding data requires two binding sites by an F or run test, but only one binding constant is accurately determined as shown by a t test. You can then use makdat to make an exact data set with the best-t parameters found by hlt, use adderr to simulate your experiment, then t the data using hlt. If you do this repeatedly, you can collect weighted sums of squares, parameter estimates, t and F values and run test statistics, so judging the reliability of the result with your own data. This is a Monte Carlo method for sensitivity analysis.

2.7.2 Programs makdat and adderr


makdat makes exact f (x), g(x, y) or h(x, y, z) data. You choose from a library of models or supply your

own model, then input parameter values and calculate, e.g., y = f (x) for a range Xstart x Xstop , either by xing Xstart , Xstop , or by choosing Ystart = f (Xstart ), Ystop = f (Xstop ), then allowing makdat to nd appropriate values for Xstart and Xstop . You must provide starting estimates for Xstart , Xstop to use the second method, which means that you must understand the mathematics of the model. With complicated models or differential equations, x Xstart and Xstop and observe how the graph of y = f (x) changes as you change parameters and/or end points. When you have a good idea where your end points lie, try the option to x y and calculate Xstart and Xstop . This is needed when f (x) is not monotonic in the range of interest. Output les from program makdat contain exact data, which can then be input into program adderr to add random errors.

2.7.3 Example 1: Simulating y = f (x)


Z

(x) =

1 p 2

(
exp

1 t 2

1.00 ) 2

dt
0.75

0.50

0.25

-3

-2

-1

Figure 2.20: The normal cdf

The procedure is to select a model equation, set the model parameters to xed values, decide on the range of x to explore, then plot the graph and save a le if appropriate. For example, run the program makdat, select functions of one variable, pick statistical distributions, choose the normal cdf, decide to have a zero constant term, set the mean p(1) = 0, x the standard deviation p(2) = 1, input the scaling factor p(3) = 1 and then generate gure 2.20. Observe that there are two distinct ways to choose the range of x values; you can simply input the rst and last x values, or you can input the rst and last y values and let makdat nd the corresponding x values numerically. This requires skill and an understanding of the mathematical behavior of the function chosen. Once you have simulated a model satisfactorily you can save a curve tting type le which can be used by program adderr to add random error.

32

S IMFIT reference manual: Part 2

To illustrate the process of nding a range of x for simulation when this depends on xed values of y, that is to nd x = x(y) when there is no simple explicit expression for x(y), consider gure 2.21. Here the problem is to nd x = x(y) when y is the solution to the Von Bertalanffy growth differential equation dy = Aym Byn , dx where A > 0, B > 0 and n > m. After setting the parameters A = B = m = 1, n = 2, and initial condition y0 = 0.001, for instance, program makdat estimated the following results: Xstart = 0, Xstop = 5, y1 = 0.1 : x1 = 2.3919, Xstart = 0, Xstop = 9, y2 = 0.9 : x2 = 6.7679, providing the roots required to simulate this equation between the limits y1 = 0.1 and y2 = 0.9.
y

dy/dx = Aym - Byn


1.00

0.80

0.60

0.40

0.20

0.00 0.0 2.0 4.0 6.0 8.0 10.0

Figure 2.21: Using makdat to calculate a range

Note that, when attempting such root-nding calculations, makdat will attempt to alter the starting estimates if a root cannot be located by decreasing Xstart and increasing Xstop , but it will not change the sign of these starting estimates. In the event of problems locating roots, there is no substitute for plotting the function to get some idea of the position of the roots, as shown in gure 2.21.

2.7.4 Example 2: Simulating z = f (x, y)


Simulating a function of two variables is very straightforward, and as a example we shall generate data for the function z = x2 y2
SIMFIT 3D plot for z = x2 - y2

illustrated in gure 2.22. Again use makdat but this time select a function of two variables and then Z choose a polynomial of degree two. Do not include the constant term but choose the set of values p(1) = -1 0, p(2) = 0, p(3) = 1, p(4) = 0 and p(5) = 1. -1 -1 Now choose the extreme values of x = 1 to x = 1 Y X and y = 1 to y = 1 with 20 divisions, i.e., 400 coordinates in all. Note that you can show the data 1 1 in gure 2.22 as a surface, a contour, a surface with contours or a bar chart (skyscraper plot). You should Figure 2.22: A 3D surface plot plot the wire frame with a monochrome printer but the facet or patch designs can be used with a color printer. After simulating a surface you can save the coordinates for re-use by program simplot.

2.7.5 Example 3: Simulating experimental error


The output les from program makdat contain exact data for y = f (x), which is useful for graphs or data simulation. You may, however, want to add random error to exact data to simulate experimental error. To do this, the output le then becomes an input le for program adderr. After adding random error, the input le is left unchanged and a new output le is produced. Model makdat Exact Simulated adderr data data

First time users guide to simulation

33

There are numerous ways to use program adderr, including generating replicates. If in doubt, pick 7% constant relative error with 35 replicates, as this mimics many situations. Note: constant relative error cannot be used where y = 0 (which invokes a default value). Read the test le adderr.tf1 into program adderr and explore the various ways to add error. In most experiments a useful model for the variance of observations is 2 2 V (y) = 2 0 + 1 y , so that the error resembles white noise at low response levels with a transition to constant relative error at high response levels. Constant variance (1 = 0) fails to account for the way variance always increases as the signal increases, while constant relative error (0 = 0) exaggerates the importance of small response values. However, a useful way to simulate error is to simulate four or ve replicates with ve to ten percent constant relative error as this is often fairly realistic. Using program adderr you can also simulate the effect of outliers or use a variety of error generating probability density functions, such as the Cauchy distribution (page 288) which is a often a better model for experimental error. Points for plotting can be spaced by a SIMFIT algorithm to ensure continuity under transformations of axes, but to simulate experiments a geometric, or uniform spacing should be chosen. Then exact data simulated by program makdat is perturbed by program adderr. This is the method used to create many SIMFIT test les, e.g., mmfit.tf4 from mmfit.tf3, as in gure 2.23. There are many ways to use program adderr, and care is needed to simulate realistically. If constant relative error is used, it is easy to preserve a mental picture of what is going on, e.g., 10% error conveys a clear meaning. However this type of error generation exaggerates the importance of small y values, biasing the t in this direction. Constant variance is equally unrealistic, and over-emphasizes the importance of large y values. Outliers can also be simulated.

Exact Data and Added Error


2.00

1.50

1.00

0.50

0.00 0 10 20 30 40 50

Figure 2.23: Adding random error

2.7.6 Example 4: Simulating differential equations


The best way to see how this is done is to run deqsol with the library of models provided. These are supplied with default choices of parameters and ranges so you can quickly see how to proceed. Try, for instance the Briggs-Haldane scheme which has ve differential equations for substrate, product and enzyme species. The program can also be used for tting data. So, to explore the tting options, choose to simulate/t a system of two differential equations and select the Lotka-Volterra predator-prey scheme dy1 /dx = dy2 /dx = p 1 y1 p 2 y1 y2

p 3 y2 + p 4 y1 y2 .

After you have simulated the system of equations and seen how the phase portrait option works you can try to t the data sets in the library le deqsol.tfl. More advanced users will appreciate that a valuable feature of deqsol is that the program can simulate and t linear combinations of the system of equations as dened by a transformation matrix. This is very valuable in areas like chemical kinetics where only linear combinations of intermediates can be measured, e.g., by spectroscopy. This is described in the w_readme les and can be explored using the test le deqmat.tf1. Figure 2.24 illustrates the simulation of a typical system, the Lotka-Volterra predator prey equations. After simulating the Lotka-Volterra equations you can select tting and read in the library test le deqsol.tfl with predator-prey data for tting. Much pleasure and instruction will result from using program deqsol with the library models provided and then, eventually, with your own models and experimental data.

34

S IMFIT reference manual: Part 2

The Lotka Volterra Equations


200

Phase Portrait for Lotka Volterra Equations


200

Predator and Prey

100

Prey
0 2 4 6 8 10

100

0 0 100 200

Predator

Figure 2.24: The Lotka-Volterra equations and phase plane

2.7.7 Example 5: Simulating user-dened equations


Figure 2.25 illustrates how to use usermod simulate a simple system of models. First select to simulate a set

1.00

0.50

y = f(x)

0.00

-0.50

-1.00 -5.00 -2.50 cos x 0.5 cos 2x


Figure 2.25: Plotting user supplied equations of four equations, then read in the test le usermodn.tf1 which denes the four trigonometric functions f (1) = p1 cos x, f (2) = p2 sin x, f (3) = p3 cos 2x, f (4) = p4 sin 2x.

0.00

2.50

5.00 sin x 0.5 sin 2x

Part 3

Data analysis techniques


3.1 Types of data and measurement scales
Before attempting to analyze your own experimental results you must be clear as to the nature of your data, as this will dictate the possible types of procedures that can be used. So we begin with the usual textbook classication of data types. The classication of a variable X with scores xA and xB on two objects A and B will usually involve one of the following scales. 1. A nominal scale can only have xA = xB or xA = xB , such as male or female. 2. An ordinal scale also allows xA > xB or xA < xB , for instance bright or dark. 3. An interval scale assumes that a meaningful difference can be dened, so that A can be xA xB units different from B, as with temperature in degrees Celsius. 4. A ratio scale has a meaningful zero point so that we can say A is xA /xB superior to B if xA > xB and xB = 0, as with temperature in degrees Kelvin. To many, these distinctions are too restrictive, and variables on nominal and ordinal scales are just known as categorical or qualitative variables, while variables on interval or ratio scales are known as quantitative variables. Again, variables that can only take distinct values are known as discrete variables, while variables that can have any values are referred to as continuous variables. Binary variables, for instance, can have only one of two values, say 0 or 1, while a categorical variable with k levels will usually be represented as a set of k (0, 1) dummy variables where only one can be nonzero at each category level. Alternatively, taking a very broad and simplistic view, we could also classify experiments into those that yield objective measurements using scientic instruments, like spectrophotometers, and those that generate numbers in categories, i.e., counts. Measurements tend to require techniques such as analysis of variance or deterministic model tting, while counts tend to require analysis of proportions or analysis of contingency tables. Of course, it is very easy to nd exceptions to any such classications; for instance modelling the size of a population using a continuous growth model when the population is a discrete not a continuous variable, but some commonsense is called for here.

3.2 Principles involved when tting models to data


A frequently occurring situation is where an investigator has a vector of n observations y1 , y2 , . . . , yn , with errors i , at settings of some independent variable xi which are supposedly known exactly, and wishes to t a model f (x) to the data in order to estimate m parameters 1 , 2 , . . . , m by minimizing some appropriate function of the residuals ri . If the true model is g(x) and a false model f (x) has been tted, then the relationship 35

36

S IMFIT reference manual: Part 3

between the observations, errors and residuals ri would be yi = g(xi ) + i ri = yi f (xi )

= g(xi ) f (xi ) + i .

That is, the residuals would be sums of a model error term plus a random error term as follows Model Error = g(xi ) f (xi ) Random Error = i . If the model error term is appreciable, then tting is a waste of time, and if the nature of the error term is not taken into account, any parameters estimated are likely to be biased. An important variation on this theme is when the control variables xi are not known with high precision, as they would be in a precisely controlled laboratory experiment, but are themselves random variables as in biological experiments, and so best regarded as covariates rather than independent variables. The principle used most often in data tting is to choose those parameters that make the observations as likely as possible, that is, to appeal to the principle of maximum likelihood. This is seldom possible to do as the true model is generally unknown so that an approximate model has to be used, and the statistical nature of the error term is not usually known with certainty. A further complication is that iterative techniques must often be employed to estimate parameters by maximum likelihood, and these depend heavily on starting estimates and frequently locate false local minima rather than the desired global minimum. Again, the values of parameters to be estimated often have a physical meaning, and so are constrained to restricted regions of parameter space, which means that constrained regression has to be used, and this is much more problematical than unconstrained regression.

3.2.1 Limitations when tting models


It must be emphasized that, when tting a function of k variables such as y = f (x1 , x2 , . . . , xk , 1 , 2 , . . . , m ) to n data points in order to estimate m parameters, a realistic approach must be adopted to avoid overinterpretation as follows. r Independent variables If the data y are highly accurate measurements, i.e. with high signal to noise ratios (page 183), and the variables x can be xed with high precision, then it is reasonable to regard x as independent variables and attempt to t models based upon physical principles. This can only be the case in disciplines such as physics and chemistry where the y would be quantities such as absorption of light or concentrations, and the x could be things like temperatures or times. The model would then be formulated according to the appropriate physical laws, such as the law of mass action, and it would generally be based on differential equations. r Covariates In biological experiments, the the data y are usually much more noisy and there may even be random variation in the x variables. Then it would be more appropriate to regard the x as covariates and only t simple models, like low order rational functions or exponentials. In some cases models such as nonlinear growth models could be tted in order to estimate physically meaningful parameters, such as the maximum growth rate, or nal asymptotic size but, in extreme cases, it may only make sense to t models like polynomials for data smoothing, where the best-t parameters are purely empirical and cannot be interpreted in terms of established physical laws. r Categorical variables Where categorical variables are encountered then parallel shift models must be tted. In this case each variable with l levels is taken to be equivalent to l dummy indicator variables which can be either 0 or 1. However one of these is then suppressed arbitrarily to avoid aliasing and the levels of categorical

Fitting models to data

37

variables are simply interpreted as factors that contribute to the regression constant. Clearly this is a very primitive method of analysis which easily leads to over-interpretation where there are more than a couple of variables and more than two or three categories. In all cases, the number of observations must greatly exceed the number of parameters that are to be estimated, say for instance by a factor of ten.

3.2.2 Fitting linear models


If the assumed model is of the form f (x) = 1 1 (x) + 2 2 (x) + + m m (x)

it is linear in the parameters j , and so can be easily tted by linear regression if the errors are normally distributed with zero mean and known variance 2 i , since maximum likelihood in this case is equivalent to minimizing the weighted sum of squares W SSQ =
n

i=1

yi f (xi ) i

with respect to the parameters. SIMFIT provides model free tting by cubic splines, simple linear regression as in f (x) = 1 + 2 x, multilinear regression polynomial regression f (x) = 1 x1 + 2 x2 + + m xm , f (x) = 1 + 2 x + 3 x2 + + m xm1 , X = X (x) Y = Y (y) and a polynomial is tted to Y (X ). Models like these are used for data smoothing, preliminary investigation, and tting noisy data over a limited range of independent variable. That is, in situations where developing meaningful scientic models may not be possible or protable. With linear models, model discrimination is usually restricted to seeing if some reduced parameter set is sufcient to explain the data upon using the F test, t tests are employed to check for parameter redundancy, and goodness of t tends to be based on chisquare tests on W SSQ and normality of studentized residuals. The great advantage of linear regression is the attractively simple conceptual scheme and ease of computation. The disadvantage is that the models are not based on scientic laws, so that the parameter estimates do not have a physical interpretation. Another serious limitation is that prediction is not possible by extrapolation, e.g., if growth data are tted using polynomials and the asymptotic size is required.

and also transformed polynomial regression, where new variables are dened by

3.2.3 Fitting generalized linear models


These models are mostly used when the errors do not follow a normal distribution, but the explanatory variables are assumed to occur linearly in a model sub-function. The best known example would be logistic regression, but the technique can also be used to t survival models. Because the distribution of errors may follow a non-normal distribution, various types of deviance residuals are used in the maximum likelihood objective function. Sometimes these techniques have special advantages, e.g., predicting probabilities of success or failure as a functions of covariates after binary logistic regression is certain to yield probability estimates between zero and one because the model < log implies that 0 < y < 1. y 1y <

38

S IMFIT reference manual: Part 3

3.2.4 Fitting nonlinear models


Many models tted to data are constructed using scientic laws, like the law of mass action, and so these will usually be nonlinear and may even be of rather complex form, like systems of nonlinear differential equations, or convolution integrals, and they may have to be expressed in terms of special functions which have to evaluated by numerical techniques, e.g., inverse probability distributions. Success in this area is heavily dependent on having accurate data over a wide range of the independent variable, and being in possession of good starting estimates. Often, with simple models like low order exponentials f (x) = A1 exp(k1 x) + A2 exp(k2 x) + + Am exp(km x), rational functions f (x) = or growth models f (x) = A , 1 + B exp(kx) V1 x V2 x Vm x + + + , K1 + x K2 + x Km + x

good starting estimates can be estimated from the data and, where this is possible, SIMFIT has a number of dedicated user-friendly programs that will perform all the necessary scaling. However, for experts requiring advanced tting techniques a special program qnt is provided.

3.2.5 Fitting survival models


There are four main techniques used to analyze survival data. 1. Estimates of proportions of a population surviving as a function of time are available by some technique which does not directly estimate the number surviving in a populations of known initial size, rather, proportions surviving are inferred by indirect techniques such as light scattering for bacterial density or enzyme assay for viable organisms. In such instances the estimated proportions are not binomial variables so tting survival models directly by weighted least squares is justied, especially where destructive sampling has to be used so that autocorrelations are less problematical. Program gct is used in mode 2 for this type of tting (see page 67). 2. A population of individuals is observed and information on the times of censoring (i.e. leaving the group) or failure are recorded, but no covariates are measured. In this case, survival density functions, such as the Weibull model, can be tted by maximum likelihood, and there are numerous statistical and graphical techniques to test for goodness of t. Program gct is used in mode 3 for this type of tting (see page 176). 3. When there are covariates as well as survival times and censored data, then survival models can be tted as generalized linear models. The SIMFIT GLM simplied interface module is used for this type of analysis (see page 180). 4. The Cox proportional hazards model does not attempt to t a complete model, but a partial model can be tted by the method of partial likelihood as long as the proportional hazards assumption is justied independently. Actually, after tting by partial likelihood, a piece-wise hazard function can be estimated and residuals can then be calculated. The SIMFIT GLM simplied interface module is used for this type of analysis (page 181).

3.2.6 Distribution of statistics from regression


After a model has been tted to data it is important to assess goodness of t, which can only be done if assumptions are made about the model and the distribution of experimental errors. If a correct linear model is tted to data, and the errors are independently normally distributed with mean zero and known standard

Fitting models to data

39

deviation which is used for weighting, then a number of exact statistical results apply. If there are m parameters and n experimental measurements, the sum of weighted squared residuals W SSQ is a chi-square variable with n m degrees of freedom, the m ratios ti = i i s i

i , and estimated standard errors s involving the exact parameters i , estimated parameters i are t distributed with n m degrees of freedom and, if tting a model with m1 parameters results in W SSQ1 but tting the next model in the hierarchy with m2 parameters gives the weighted sum of squares W SSQ2, then F= (W SSQ1 W SSQ2)/(m2 m1 ) W SSQ2/(n m2)

is F distributed with m2 m1 and n m2 degrees of freedom. When n m the weighted residuals will be approximately unit normal variables ( = 0, = 1), their signs will be binomially distributed with parameters n and 0.5, the runs minus 1 given n will be binomially distributed with parameters n 1 and 0.5, while the runs given the number of positive and negative signs will follow a more complicated distribution (page 105). With nonlinear models and weights estimated from replicates at distinct xi , i.e., not known exactly, statistical tests are no longer exact. SIMFIT programs allow you to simulate results and see how close the statistics are to the exact ones. There are program to evaluate the probability density (or mass) function and the cumulative distribution function for a chosen distribution, as well as calculating percentage points. In addition, you can use program makmat to make les containing your statistics, and these numbers can then be tested to see if they are consistent with the chosen distribution.
3.2.6.1 The chi-square test for goodness of t

Let W SSQ = weighted sum of squares and NDOF = no. degrees of freedom (no. points - no. parameters). If all s = 1, W SSQ/NDOF estimates the (constant) variance 2 . You can compare it with any independent estimate of the (constant) variance of response y. If you had set s = exact std. dev., W SSQ would be a chi-square variable, and you could consider rejecting a t if the probability of chi-square exceeding W SSQ (i.e., P(2 W SSQ)) is <.01(1% signicance level) or <0.05(5% signicance level). Where standard error estimates are based on 35 replicates, you can reasonably decrease the value of WSSQ by 1020% before considering rejecting a model by this chi-square test.
3.2.6.2 The t test for parameter redundancy

The number T = (parameter estimate)/(standard error) can be referred to the t distribution to assess any parameter redundancy, where P(t |T |) = P(t |T |) = /2. Two tail p values are dened as p = , and parameters are signicantly different from 0 if p <.01(1%) (<.05(5%)). Parameter correlations can be assessed from corresponding elements of the correlation matrix.
3.2.6.3 The F test for model discrimination

The F test just described is very useful for discriminating between models with up to 3 or 4 parameters. For models with more than 4 parameters, calculated F test statistics are no longer approximately F distributed, but they do estimate the extent to which model error is contributing to excess variance from tting a decient model. It is unlikely that you will ever have data that is good enough to discriminate between nonlinear models with much more than 5 or 6 parameters in any case.
3.2.6.4 Analysis of residuals

The plot of residuals (or better weighted residuals) against dependent or independent variable or best-t response is a traditional (arbitrary) approach that should always be used, but many prefer the normal or half normal plots. The sign test is weak and should be taken rather seriously if rejection is recommended

40

S IMFIT reference manual: Part 3

(P(signs observed) <.01 (or .05)). The run test conditional on the sum of positive and negative residuals is similarly weak, but the run test conditional on observed positive and negative residuals is quite reliable, especially if the sample size is fairly large (> 20 ?). Reject if P(runs observed) is < .01 (1%) (or < .05 (5%))
3.2.6.5 How good is the t ?

If you set s = 1, W SSQ/NDOF should be about the same as the (constant) variance of y. You can consider rejecting the t if there is poor agreement in a variance ratio test. If s = sample standard deviation of y (which may be the best choice ?), then W SSQ is approximately chi-square and should be around NDOF . Relative residuals do not depend on s. They should not be larger than 25%, there should not be too many symbols ***, ****, or ***** in the residuals table and also, the average relative residual should not be much larger than 10%. These, and the R-squared test, are all convenient tests for the magnitude of the difference between your data and the best-t curve. A graph of the best-t curve should show the data scattered randomly above and below the tted curve, and the number of positive and negative residuals should be about the same. The table of residuals should be free from long runs of the same signs, and the plot of weighted residuals against independent variable should be like a sample from a normal distribution with = 0 and = 1, as judged by the Shapiro-Wilks test, and the normal or half normal plots. The sign, run and Durbin-Watson tests help you to detect any correlations in the residuals.
3.2.6.6 Using graphical deconvolution to assess goodness of t

Many decisions depend on differentiating nested models, e.g., polynomials, or models in sequence of increasing order, e.g., sums of Michaelis-Mentens or exponentials, and you should always use the option in qnt, ext, mmt and hlt (see page 271) to plot the terms in the sum as what can loosely be described as a graphical deconvolution before accepting the results of an F test to support a richer model. The advantage of the graphical deconvolution technique is that you can visually assess the contribution of individual component functions to the overall sum. Many who have concluded that three exponentials or three binding constants were justied on statistical grounds would immediately revise their opinion after inspecting a graphical deconvolution.
3.2.6.7 Testing for differences between two parameter estimates

This can sometimes be a useful simple procedure when you wish to compare two parameters resulting from a regression, e.g., the nal size from tting a growth curve model, or perhaps two parameters that have been derived from regression parameters e.g., AUC from tting an exponential model, or LD50 from bioassay. You input the two parameters, the standard error estimates, the total number of experimental observations, and the number of parameters estimated from the regression. A t test (page 91) for equality is then performed using the correction for unequal variances. Such t tests depend on the asymptotic normality of maximum likelihood parameters, and will only be meaningful if the data set is fairly large and the best t model adequately represents the data. Furthermore, t tests on parameter estimates are especially unreliable because they ignore non-zero covariances in the estimated parameter variance-covariance matrix.
3.2.6.8 Testing for differences between several parameter estimates

To take some account of the effect of signicant off-diagonal terms in the estimated parameter variancecovariance matrix you will need to calculate a Mahalanobis distance between parameter estimates e.g., to test if two or more curve ts using the same model but with different data sets support the presence of signicant treatment effects. For instance, after tting the logistic equation to growth data by nonlinear regression, you may wish to see if the growth rates, nal asymptotic size, half-time, etc. have been affected by the treatment. Note that, after every curve t, you can select an option to add the current parameters and covariance matrix to your parameter covariance matrix project archive, and also you have the opportunity to select previous ts

Fitting models to data

41

to compare with the current t. For instance, you may wish to compare two ts with m parameters, A in the rst set with estimated covariance matrix CA and B in the second set with estimated covariance matrix CB . The parameter comparison procedure will then perform a t test for each pair of parameters, and also calculate the quadratic form Q = (A B)T (CA + CB )1 (A B) which has an approximate chi-square distribution with m degrees of freedom. You should realize that the rule of thumb test using non-overlapping condence regions is more conservative than the above t test; parameters can still be signicantly different despite a small overlap of condence windows. This technique must be used with care when the models are sums of functions such as exponentials, Michaelis-Menten terms, High-Low afnity site isotherms, Gaussians, trigonometric terms, and so on. This is because the parameters are only unique up to a permutation. For instance, the terms Ai and ki are linked in the exponential function f (t ) = Ai exp(kit )
i=1 m

but the order implied by the index i is arbitrary. So, when testing if A1 from tting a data set is the same as A1 from tting another data set it is imperative to compare the same terms. The user friendly programs ext, mmt, and hlt attempt to assist this testing procedure by rearranging the results into increasing order of amplitudes Ai but, to be sure, it is best to use qnt, where starting estimates and parameter constraints can be used from a parameter limits le. That way there is a better chance that parameters and covariance matrices saved to project archives for retrospective testing for equality of parameters will be consistent, i.e. the parameters will be compared in the correct order.

42

S IMFIT reference manual: Part 3

3.3 Linear regression


Program lint ts a multilinear model in the form y = 0 x0 + 1 x1 + 2 x2 + + m xm , where x0 = 1, but you can choose interactively whether or not to include a constant term 0 , you can decide which variables are to be included, and you can use a weighting scheme if this is required. For each regression sub-set, you can observe the parameter estimates and standard errors, R-squared, Mallows C p (page 62), and ANOVA table, to help you decide which combinations of variables are the most signicant. Unlike nonlinear regression, multilinear regression, which is based on the assumptions Y = X + , E() = 0, Var() = 2 I , allows us to introduce the hat matrix H = X (X T X )1 X T , then dene the leverages hii , which can be used to asses inuence, and the studentized residuals ri Ri = , 1 hii which may offer some advantages over ordinary residuals ri for goodness of t assessment from residuals 1 1 plots. In the event of weighting being required, Y , X and above are simply replaced by W 2 Y , W 2 X , and 1 W 2 . Model discrimination is particularly straightforward with multilinear regression, and it is often important to answer questions like these. Is a particular parameter well-determined ? Should a particular parameter be included in the regression ? Is a particular parameter estimate signicantly different between two data sets ? Does a set of parameter estimates differ signicantly between two data sets ? Are two data sets with the same variables signicantly different ? To assess the reliability of any parameter estimate, SIMFIT lists the estimated standard error, the 95% condence limits, and the two tail p value for a t test (page 91). If the p value for a parameter is appreciably greater than 0.05, the parameter can be regarded as indistinguishable from zero, so you should consider suppressing the the corresponding variable from the regression and tting again. To select a minimum signicant set of variables for regression you should perform the F test (page 107) and, to simplify the systematic use of this procedure, SIMFIT allows you to save sets of parameters and objective functions after a regression so that these can be recalled retrospectively for the F test. With large numbers of variables it is very tedious to perform all subsets regression and a simple expedient would be to start by tting all variables, then consider for elimination any variables with ill-dened parameter estimates. It is often important to see if a particular parameter is signicantly different between two data sets, and SIMFIT provides the facility to compare any two parameters in this way, as long as you know the values, standard errors, and numbers of experimental points. This procedure (page 40) is very over-simplied, as it ignores other parameters estimated in the regression, and their associated covariances. A more satisfactory procedure is to compare full sets of parameters estimated from regression using the same model with different data sets, and also using the estimated variance-covariance matrices. SIMFIT provides the facility to store parameters and variance-covariance matrices in this way (page 40), and this is a valuable technique to asses if two data sets are signicantly different,

Linear regression

43

assuming that if two sets of parameter estimates are signicantly different, then the data sets are signicantly different. Two well documented and instructive data sets will be found in linfit.tf1 and linfit.tf2 but be warned; real life phenomena are nonlinear, and the multilinear model is seldom the correct one. In fact the data in linfit.tf1 have a singular design matrix, which deserves comment. It sometimes happens that data have been collected at settings of the variables that are not independent. This is particularly troublesome with what are known as (badly) designed experiments, e.g., 0 for female, 1 for male and so on. Another common error is where percentages or proportions in categories are taken as variables in such a way that a column in the design matrix is a linear combination of other columns, e.g., because all percentages add up to 100 or all proportions add up to 1. When the SIMFIT regression procedures detect singularity you will be warned that the covariance matrix is singular and that the parameter standard errors should be ignored. However, with multilinear regression, the tting still takes place using the singular value decomposition (SVD) and parameters and standard errors are printed. However only some parameters will be estimable and you should redesign the experiment so that the covariance matrix is of full rank and all parameters are estimable. If SIMFIT warns you that a covariance matrix cannot be inverted or that SVD has to be used then you should not carry on regardless: the results are likely to be misleading so you should redesign the experiment so that all parameters are estimable. As an example, after tting the test le linfit.tf2, table 3.1 results. From the table of parameter estimates No. parameters = 5, Rank = 5, No. points = 13, No. deg. freedom = 8 Residual-SSQ = 4.79E+01, Mallows Cp = 5.000E+00, R-squared = 0.9824 Parameter Value 95% conf. limits Std.error p Constant 6.241E+01 -9.918E+01 2.240E+02 7.007E+01 0.3991 *** B( 1) 1.551E+00 -1.663E-01 3.269E+00 7.448E-01 0.0708 * B( 2) 5.102E-01 -1.159E+00 2.179E+00 7.238E-01 0.5009 *** B( 3) 1.019E-01 -1.638E+00 1.842E+00 7.547E-01 0.8959 *** B( 4) -1.441E-01 -1.779E+00 1.491E+00 7.091E-01 0.8441 *** Number 1 2 3 4 5 6 7 8 9 10 11 12 13 Y-value 7.850E+01 7.430E+01 1.043E+02 8.760E+01 9.590E+01 1.092E+02 1.027E+02 7.250E+01 9.310E+01 1.159E+02 8.380E+01 1.133E+02 1.094E+02 Theory 7.850E+01 7.279E+01 1.060E+02 8.933E+01 9.565E+01 1.053E+02 1.041E+02 7.567E+01 9.172E+01 1.156E+02 8.181E+01 1.123E+02 1.117E+02 Residual 4.760E-03 1.511E+00 -1.671E+00 -1.727E+00 2.508E-01 3.925E+00 -1.449E+00 -3.175E+00 1.378E+00 2.815E-01 1.991E+00 9.730E-01 -2.294E+00 Leverage 5.503E-01 3.332E-01 5.769E-01 2.952E-01 3.576E-01 1.242E-01 3.671E-01 4.085E-01 2.943E-01 7.004E-01 4.255E-01 2.630E-01 3.037E-01 Studentized 2.902E-03 7.566E-01 -1.050E+00 -8.411E-01 1.279E-01 1.715E+00 -7.445E-01 -1.688E+00 6.708E-01 2.103E-01 1.074E+00 4.634E-01 -1.124E+00

ANOVA Source Total Regression Residual

NDOF 12 4 8

SSQ 2.716E+03 2.668E+03 4.786E+01

Mean SSQ 6.670E+02 5.983E+00

F-value 1.115E+02

p 0.0000

Table 3.1: Multilinear regression it is clear that the estimated parameter condence limits all include zero, and that all the parameter p values

44

S IMFIT reference manual: Part 3

all exceed 0.05. So none of the parameters are particularly well-determined. However, from the C p value, half normal residuals plot and ANOVA table, with overall p value less than 0.05 for the F value, it appears that a multilinear model does t the overall data set reasonably well. The fact that the leverages are all of similar size and that none of the studentized residuals are of particularly large absolute magnitude (all less than 2) suggests that none of the data points are could be considered as outliers. Note that program lint also lets you explore generalized linear modelling (GLM), reduced major axis regression (minimizing the sum of areas of triangles formed between the best-t line and data points), orthogonal regression (minimizing the sum of squared distances between the best-t line and the data points), and robust regression in the L1 or L norms as, alternatives to the usual L2 norm.

Robust regression

45

3.4 Robust regression


Robust techniques are required when in the linear model Y = X + the errors are not normally distributed. There are several alternative techniques available, arising from a consideration of the objective function to be minimized by the tting technique. One approach seeks to suppress the well known effect that outliers bias parameter estimates by down-weighting extreme observations, and yet another technique uses regression on ranks. First we consider the p-norm of a n-vector x, which is dened as ||x|| p =
i=1

|xi |

1/ p

and the objective functions required for maximum likelihood estimation. There are three cases. 1. The L1 norm. If the errors are bi-exponentially distributed (page 289) then the correct objective function is the sum of the absolute values of all the residuals |. |Y X
n

i=1

2. The L2 norm. If the errors are normally distributed with known variances i (page 286) then the correct objective function, just as in weighted least squares, is )/i ]2 . [(Y X
n

i=1

3. The L norm. If the errors are uniformly distributed (page 286) then the correct objective function is the largest absolute residual |. max |Y X
n

Although SIMFITprovides options for L1 and L tting by iterative techniques, parameter standard error estimates and analysis of variance tables are not calculated. The techniques can be used to assess how serious the effects of outliers are in any given data set, but otherwise they should be reserved for situations where either a bi-exponential or uniform distribution of errors seems more appropriate than a normal distribution of errors. In actual experiments it is often the case that there are more errors in the tails of the distribution than are consistent with a normal distribution, so that the error distribution is more like a Cauchy distribution (page 288) than a normal distribution. For such circumstances a variety of techniques exist for dampening down the effect of such outliers. One such technique is bounded inuence regression, which leads to M -estimates , but it should be obvious that Garbage In = Garbage Out . No automatic technique can extract meaningful results from bad data and, where possible, serious experimentalists should, of course, strive to identify outliers and remove them from the data set, rather than use automatic techniques to suppress the inuence of outliers by down-weighting. Robust regression is a technique for when all else fails, outliers cannot be identied independently and deleted, so the experimentalist is forced to analyze noisy data sets where analysis of the error structure is not likely to be meaningful as

46

S IMFIT reference manual: Part 3

sufcient replicates cannot be obtained. An abstract of the NAG G02HAF documentation is now given to clarify the options, and this should be consulted for more details and references. If ri are calculated residuals, wi are estimated weights, (.) and (.) are specied functions, (.) is the unit normal density and (.) the corresponding distribution function, is a parameter to be estimated, and 1 , 2 are constants, then there are three main possibilities. 1. Schweppe regression

i=1

(ri /(wi ))wi xi j = 0,


(1 ) = 0.75 2 =

j = 1, 2, . . . , m

1 n 2 wi n i =1

(z/wi )(z) dz

2. Huber regression

i=1

(ri /)xi j = 0,
(1 ) = 0.75 2 =

j = 1, 2, . . . , m

(z)(z) dz

3. Mallows regression

1 (1 / wi ) = 0.75 n i=1 2 = 1 n wi n i =1

i=1 n

(ri /)wi xi j = 0,

j = 1, 2, . . . , m

(z)(z) dz

The estimate for can be obtained at each iteration by the median absolute deviation of the residuals = median(|ri |)/1
i

or as the solution to

i=1

wi ))w2 (ri /( i = (n k)2

where k is the column rank of X . For the iterative weighted least squares regression used for the estimates there are several possibilities for the functions and , some requiring additional parameters c, h1 , h2 , h3 , and d . (a) Unit weights, equivalent to least-squares. (t ) = t , (b) Hubers function (t ) = max(c, min(c, t )), (t ) = t 2 /2, |t | d d 2 /2, |t | > d (t ) = t 2 /2

Robust regression

47

(c) Hampels piecewise linear function t, h , 1 h1 ,h2 ,h3 (t ) = h1 ,h2 ,h3 (t ) = h1 (h3 t )/(h3 h2), 0, (t ) = (e) Tukeys bi-weight (t ) = t (1 t 2)2 , |t | 1 0, |t | > 1 (t ) = t 2 /2, |t | d d 2 /2, |t | > d sin t , t 0, |t | > 0 t h1 h1 t h2 h2 t h3 h3 < t

(t ) =

t 2 /2, |t | d d 2 /t , |t | > d

(d) Andrews sine wave function

(t ) =

t 2 /2, |t | d d 2 /2, |t | > d

Weights wi require the denition of functions u(.) and f (.) as follows. (i) Krasker-Welsch weights u(t ) = g1 (c/t ) g1 (t ) = t 2 + (1 t 2)(2(t ) 1) 2t (t ) f (t ) = 1/t (ii) Maronnas weights u(t ) = f (t ) = c/t 2 , |t | > c 1, |t | c u(t )

Finally, in order to estimate the parameter covariance matrix, two diagonal matrices D and P are required as follows. 1. Average over the ri Schweppe Di = Pi = 1 n 1 n wi ) (r j /( wi ) 2 (r j /(
n n

Mallows wi w2 i Di = Pi = 1 n 1 n ) (r j / ) 2(r j /(
n n

wi w2 i

j =1

j =1

j =1

j =1

2. Replace expected value by observed Schweppe wi )wi Di = (ri /(

Mallows )wi Di = (ri / )w2 Pi = 2 (ri /( i

wi )w2 Pi = 2 (ri /( i

Table 3.2 illustrates the use of robust regression using the test le g02haf.tf1. Note that the output lists all the settings used to congure NAG routine G02HAF and, in addition, it also presents the type of results usually associated with standard multilinear regression. Of course these calculations should be interpreted with

48

S IMFIT reference manual: Part 3

G02HAF settings: INDW IPSI ISIG INDC H1 = 1.5000E+00, H2 CUCV = 3.0000E+00, DCHI

> = > = = =

0, Schweppe with Krasker-Welsch weights 2, Hampel piecewise linear 0, sigma using the chi function 0, Replacing expected by observed 3.0000E+00, H3 = 4.5000E+00 1.5000E+00, TOL = 5.0000E-05

No. parameters = 3, Rank = 3, No. points = 8, No. deg. freedom = 5 Residual-SSQ = 4.64E-01, Mallows Cp = 3.000E+00, R-squared = 0.9844 Final sigma value = 2.026E-01 Parameter Value 95% conf. limits Std.error p Constant 4.042E+00 3.944E+00 4.141E+00 3.840E-02 0.0000 B( 1) 1.308E+00 1.238E+00 1.378E+00 2.720E-02 0.0000 B( 2) 7.519E-01 6.719E-01 8.319E-01 3.112E-02 0.0000 Number 1 2 3 4 5 6 7 8 Y-value 2.100E+00 3.600E+00 4.500E+00 6.100E+00 1.300E+00 1.900E+00 6.700E+00 5.500E+00 Theory 1.982E+00 3.486E+00 4.599E+00 6.103E+00 1.426E+00 2.538E+00 6.659E+00 5.546E+00 Residual 1.179E-01 1.141E-01 -9.872E-02 -2.564E-03 -1.256E-01 -6.385E-01 4.103E-02 -4.615E-02 Weighting 5.783E-01 5.783E-01 5.783E-01 5.783E-01 4.603E-01 4.603E-01 4.603E-01 4.603E-01

ANOVA Source Total Regression Residual

NDOF 7 2 5

SSQ 2.966E+01 2.919E+01 4.639E-01

Mean SSQ 1.460E+01 9.278E-02

F-value 1.573E+02

p 0.0000

Table 3.2: Robust regression great caution if the data sample has many outliers, or has errors that depart widely from a normal distribution. It should be noted that, in the SIMFITimplementation of this technique, the starting estimates required for the iterations used by the robust estimation are rst calculated automatically by a standard multilinear regression. Another point worth noting is that users of all SIMFITmultilinear regression analysis can either supply a matrix with a rst column of x1 = 1 and suppress the option to include a constant in the regression, or omit the rst column for x1 = 1 from the data le, whereupon SIMFITwill automatically add such a column, and do all the necessary adjustments for degrees of freedom.

Regression on ranks

49

3.5 Regression on ranks


It is possible to perform regression where observations are replaced by ranks, as illustrated for test data g08raf.tf1, and g08rbf.tf1 in table 3.3. File: G08RAF.TF1 (1 sample, 20 observations) parameters = 2, distribution = logistic CTS = 8.221E+00, NDOF = 2, P(chi-sq >= CTS) = 0.0164 Score Estimate Std.Err. z-value p -1.048E+00 -8.524E-01 1.249E+00 -6.824E-01 0.4950 *** 6.433E+01 1.139E-01 4.437E-02 2.567E+00 0.0103

CV matrices: upper triangle for scores, lower for parameters 6.7326E-01 -4.1587E+00 1.5604E+00 5.3367E+02 1.2160E-02 1.9686E-03 File: G08RBF.TF1 (1 sample, 40 observations) parameters = 1, gamma = 1.0e-05 CTS = 2.746E+00, NDOF = 1, P(chi-sq >= CTS) = 0.0975 Score Estimate Std.Err. z-value p 4.584E+00 5.990E-01 3.615E-01 1.657E+00 0.0975 *

CV matrices: upper triangle for scores, lower for parameters 7.6526E+00 1.3067E-01 Table 3.3: Regression on ranks It is assumed that a monotone increasing differentiable transformation h exists such that h(Yi ) = xT i + i for observations Yi given explanatory variables xi , parameters , and random errors i , when the following can be estimated, and used to test H0 : = 0. X T a, the score statistic. X T (B A)X , the estimated variance covariance matrix of the scores. = MX T a, the estimated parameters. . M = (X T (B A)X )1, the estimated variance covariance matrix of T M 1 , the 2 test statistic. CT S = i /Mii , the approximate z statistics. Here M and a depend on the ranks of the observations, while B depends on the the distribution of , which can be assumed to be normal, logistic, extreme value, or double-exponential, when there is no censoring. However, table 3.3 also displays the results from analyzing g08rbf.tf1, which contains right censored data. That is, some of the measurements were capped, i.e., an upper limit to measurement existed, and all that can be said is that an observation was at least as large as this limit. In such circumstances a parameter must be set to model the distribution of errors as a generalized logistic distribution, where as tends to zero a skew extreme value is approached, when equals one the distribution is symmetric logistic, and when tends to innity the negative exponential distribution is assumed.

50

S IMFIT reference manual: Part 3

3.6 Generalized linear models (GLM)


This technique is intermediate between linear regression, which is trivial and gives uniquely determined parameter estimates but is rarely appropriate, and nonlinear regression, which is very hard and does not usually give unique parameter estimates, but is justied with normal errors and a known model. The SIMFIT generalized models interface can be used from gct, lint or simstat as it nds many applications, ranging from bioassay to survival analysis. To understand the motivation for this technique, it is usual to refer to a typical doubling dilution experiment in which diluted solutions from a stock containing infected organisms are plated onto agar in order to count infected plates, and hence estimate the number of organisms in the stock. Suppose that before dilution the stock had N organisms per unit volume, then the number per unit volume after x = 0, 1, . . . , m dilutions will follow a Poisson dilution with x = N /2x . Now the chance of a plate receiving no organisms at dilution x is the rst term in the Poisson distribution , that is exp(x ), so if px is the probability of a plate becoming infected at dilution x, then px = 1 exp(x ), 0 = 1, 2, . . . , m. Evidently, where the px have been estimated as proportions from yx infected plates out of nx plated at dilution x, then N can be estimated using log[ log(1 px)] = log N x log 2 considered as a maximum likelihood tting problem of the type log[ log(1 px )] = 0 + 1 x where the errors in estimated proportions px = yx /nx are binomially distributed. So, to t a generalized linear model, you must have independent evidence to support your choice for an assumed error distribution for the dependent variable Y from the following possibilities: t Normal t binomial t Poisson t gamma in which it is supposed that the expectation of Y is to be estimated, i.e., E (Y ) = . The associated pdfs are parameterized as follows. 1 (y )2 normal : fY = exp 22 2 N y binomial: fY = (1 )N y y y exp() Poisson: fY = y! 1 y y gamma: fY = exp ()

1 y

It is a mistake to make the usual unwarranted assumption that measurements imply a normal distribution, while proportions imply a binomial distribution, and counting processes imply a Poisson distribution, unless the error distribution assumed has been veried for your data. Another very questionable assumption that has

Generalized Linear Models (GLM)

51

to made is that a predictor function exists, which is a linear function of the m covariates, i.e., independent explanatory variables, as in =
j =1

jx j.

Finally, yet another dubious assumption must be made, that a link function g() exists between the expected value of Y and the linear predictor. The choice for g() = depends on the assumed distribution as follows. For the binomial distribution, where y successes have been observed in N trials, the link options are the logistic, probit or complementary log-log logistic: = log probit: = 1 N N

complementary log-log: = log log 1

Where observed values can have only one of two values, as with binary or quantal data, it may be wished to perform binary logistic regression. This is just the binomial situation where y takes values of 0 or 1, N is always set equal to 1, and the logistic link is selected. However, for the normal, Poisson and gamma distributions the link options are exponent: = a identity: = log: = log() square root: = 1 reciprocal: = . In addition to these possibilities, you can supply weights and install an offset vector along with the data set, the regression can include a constant term if requested, the constant exponent a in the exponent link can be altered, and variables can be selected for inclusion or suppression in an interactive manner. However, note that the same strictures apply as for all regressions: you will be warned if the SVD has to be used due to rank deciency and you should redesign the experiment until all parameters are estimable and the covariance matrix has full rank, rather than carry on with parameters and standard errors of limited value.

3.6.1 GLM examples


The test les to investigate the GLM functions are: glm.tf1: normal error and reciprocal link glm.tf2: binomial error and logistic link (logistic regression) glm.tf3: Poisson error and log link glm.tf4: gamma error and reciprocal link where the data format for k variables, observations y and weightings s is x1 , x2 , . . . , xk , y, s except for the binomial error which has x1 , x2 , . . . , xk , y, N , s for y successes in N independent Bernoulli trials. Note that the weights w used are actually w = 1/s2 if advanced users wish to employ weighting, e.g., using s as the reciprocal of the square root of the number of replicates for replicate weighting, except that when s 0 the corresponding data points are suppressed. Also,

52

S IMFIT reference manual: Part 3

observe the alternative measures of goodness of t, such as residuals, leverages and deviances. The residuals ri , sums of squares SSQ and deviances di and overall deviance DEV depend on the error types as indicated in the examples. GLM example 1: G02GAF, normal errors Table 3.4 has the results from tting a reciprocal link with mean but no offsets to glm.tf1. Note that the No. parameters = 2, Rank = 2, No. points = 5, Deg. freedom = 3 Parameter Value 95% conf. limits Std.error p Constant -2.387E-02 -3.272E-02 -1.503E-02 2.779E-03 0.0033 B( 1) 6.381E-02 5.542E-02 7.221E-02 2.638E-03 0.0002 WSSQ = 3.872E-01, S = 1.291E-01, A = 1.000E+00 Number 1 2 3 4 5 Y-value 2.500E+01 1.000E+01 6.000E+00 4.000E+00 3.000E+00 Theory 2.504E+01 9.639E+00 5.968E+00 4.322E+00 3.388E+00 Deviance -3.866E-02 3.613E-01 3.198E-02 -3.221E-01 -3.878E-01 Leverages 9.954E-01 4.577E-01 2.681E-01 1.666E-01 1.121E-01

Table 3.4: GLM example 1: normal errors scale factor (S = 2 ) can be input or estimated using the residual sum of squares SSQ dened as follows normal error: ri = yi i SSQ = ri .
i=1 n

GLM example 2: G02GBF, binomial errors Table 3.5 shows the results from tting a logistic link and mean but no offsets to glm.tf2. The estimates are No. parameters = 2, Rank = 2, No. points = 3, Deg. freedom = 1 Parameter Value 95% conf. limits Std.error p Constant -2.868E+00 -4.415E+00 -1.322E+00 1.217E-01 0.0270 B( 1) -4.264E-01 -2.457E+00 1.604E+00 1.598E-01 0.2283 *** Deviance = 7.354E-02 Number 1 2 3 Y-value 1.900E+01 2.900E+01 2.400E+01 Theory 1.845E+01 3.010E+01 2.345E+01 Deviance 1.296E-01 -2.070E-01 1.178E-01 Leverages 7.687E-01 4.220E-01 8.092E-01

Table 3.5: GLM example 2: binomial errors dened as follows binomial error: di = 2 yi log yi i + (ti yi ) log ti yi ti i

ri = sign(yi i ) di DEV = di .
i=1 n

Generalized Linear Models (GLM)

53

GLM example 3: G02GCF, Poisson errors Table 3.6 shows the results from tting a log link and mean but no offsets to glm.tf3. The denitions are No. parameters = 9, Rank = 7, No. points = 15, Deg. freedom Parameter Value 95% conf. limits Std.error Constant 2.598E+00 2.538E+00 2.657E+00 2.582E-02 B( 1) 1.262E+00 1.161E+00 1.363E+00 4.382E-02 B( 2) 1.278E+00 1.177E+00 1.378E+00 4.362E-02 B( 3) 5.798E-02 -9.595E-02 2.119E-01 6.675E-02 B( 4) 1.031E+00 9.036E-01 1.158E+00 5.509E-02 B( 5) 2.910E-01 1.223E-01 4.598E-01 7.317E-02 B( 6) 9.876E-01 8.586E-01 1.117E+00 5.593E-02 B( 7) 4.880E-01 3.322E-01 6.437E-01 6.754E-02 B( 8) -1.996E-01 -4.080E-01 8.754E-03 9.035E-02 Deviance = 9.038E+00, A = 1.000E+00 Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y-value 1.410E+02 6.700E+01 1.140E+02 7.900E+01 3.900E+01 1.310E+02 6.600E+01 1.430E+02 7.200E+01 3.500E+01 3.600E+01 1.400E+01 3.800E+01 2.800E+01 1.600E+01 Theory 1.330E+02 6.347E+01 1.274E+02 7.729E+01 3.886E+01 1.351E+02 6.448E+01 1.294E+02 7.852E+01 3.948E+01 3.990E+01 1.904E+01 3.821E+01 2.319E+01 1.166E+01 Deviance 6.875E-01 4.386E-01 -1.207E+00 1.936E-01 2.218E-02 -3.553E-01 1.881E-01 1.175E+00 -7.465E-01 -7.271E-01 -6.276E-01 -1.213E+00 -3.464E-02 9.675E-01 1.203E+00 Leverages 6.035E-01 5.138E-01 5.963E-01 5.316E-01 4.820E-01 6.083E-01 5.196E-01 6.012E-01 5.373E-01 4.882E-01 3.926E-01 2.551E-01 3.815E-01 2.825E-01 2.064E-01 = 8 p 0.0000 0.0000 0.0000 0.4104 *** 0.0000 0.0041 0.0000 0.0001 0.0582 *

Table 3.6: GLM example 3: Poisson errors

Poisson error: di = 2 yi log

yi i

(yi i )

ri = sign(yi i ) di DEV = di ,
i=1 n

but note that an error message is output to warn you that the solution is overdetermined, i.e., the parameters and standard errors are not unique. To understand this, we point out that the data in glim.tf3 are the representation of a contingency table using dummy indicator variables (page 56) as will be clear from table 3.7. Thus, in order to obtain unique parameter estimates, it is necessary to impose constraints so that the resulting constrained system is of full rank. Let the singular value decomposition (SVD) P be represented, as in G02GKF, by P =
T D1 P1 T P0

54

S IMFIT reference manual: Part 3

Test 141 131 36

file loglin.tf1 67 114 79 39 66 143 72 35 14 38 28 16

Test file 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0

glm.tf3 0 0 0 141 0 0 0 67 1 0 0 114 0 1 0 79 0 0 1 39 0 0 0 131 0 0 0 66 1 0 0 143 0 1 0 72 0 0 1 35 0 0 0 36 0 0 0 14 1 0 0 38 0 1 0 28 0 0 1 16

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Table 3.7: GLM contingency table analysis: 1

and suppose that there are m parameters and the rank is r, so that there need to be nc = m r constraints, for example, in a m by nc matrix C where C T = 0. c are given in terms of the SVD parameters svd by Then the constrained estimates c = A svd svd , = (I P0(CT P0 )1CT ) while the variance-covariance matrix V is given by
T T A , V = AP1 D2 P1 1 provided that (CT P0 ) exists. This approach is commonly used in log-linear analysis of contingency tables, but it can be tedious to rst t the overdetermined Poisson GLM model then apply a matrix of constraints as just described. For this reason SIMFIT provides an automatic procedure (page 101) to calculate the dummy indicator matrix from the contingency table then t a log-linear model and apply the further constraints that the row sum and column sum are zero. Table 3.8 illustrates how this is done with loglin.tf1.

GLM example 4: G02GDF, gamma errors Table 3.9 shows the results from tting a reciprocal link and mean but no offsets to glm.tf4. Note that with gamma errors, the scale factor (1 ) can be input or estimated using the degrees of freedom, k, and 1 = [(yi i ]/ i ]2 . N k i=1 yi i
n

gamma: di = 2 log( i ) +
1

ri =

3(yi3 i 3 ) i 3
n i=1
1

DEV = di

Generalized Linear Models (GLM)

55

no. rows = 3, no. columns = 5 Deviance (D) = 9.038E+00, deg.free. = 8, P(chi-sq>=D) = 0.3391 Parameter Estimate Std.Err. ..95% con. lim.... p Constant 3.983E+00 3.96E-02 3.89E+00 4.07E+00 0.0000 Row 1 3.961E-01 4.58E-02 2.90E-01 5.02E-01 0.0000 Row 2 4.118E-01 4.57E-02 3.06E-01 5.17E-01 0.0000 Row 3 -8.079E-01 6.22E-02 -9.51E-01 -6.64E-01 0.0000 Col 1 5.112E-01 5.62E-02 3.82E-01 6.41E-01 0.0000 Col 2 -2.285E-01 7.27E-02 -3.96E-01 -6.08E-02 0.0137 * Col 3 4.680E-01 5.69E-02 3.37E-01 5.99E-01 0.0000 Col 4 -3.155E-02 6.75E-02 -1.87E-01 1.24E-01 0.6527 *** Col 5 -7.191E-01 8.87E-02 -9.24E-01 -5.15E-01 0.0000 Data Model Delta Residual Leverage 141 132.99 8.01 0.6875 0.6035 67 63.47 3.53 0.4386 0.5138 114 127.38 -13.38 -1.2072 0.5963 79 77.29 1.71 0.1936 0.5316 39 38.86 0.14 0.0222 0.4820 131 135.11 -4.11 -0.3553 0.6083 66 64.48 1.52 0.1881 0.5196 143 129.41 13.59 1.1749 0.6012 72 78.52 -6.52 -0.7465 0.5373 35 39.48 -4.48 -0.7271 0.4882 36 39.90 -3.90 -0.6276 0.3926 14 19.04 -5.04 -1.2131 0.2551 38 38.21 -0.21 -0.0346 0.3815 28 23.19 4.81 0.9675 0.2825 16 11.66 4.34 1.2028 0.2064 Table 3.8: GLM contingency table analysis: 2 No. parameters = 2, Rank = 2, No. points = 10, Deg. freedom = 8 Parameter Value 95% conf. limits Std.error p Constant 1.441E+00 -8.812E-02 2.970E+00 6.630E-01 0.0615 B( 1) -1.287E+00 -2.824E+00 2.513E-01 6.669E-01 0.0898 Adjusted Deviance = 3.503E+01, S = 1.074E+00, A = 1.000E+00 Number 1 2 3 4 5 6 7 8 9 10 Y-value 1.000E+00 3.000E-01 1.050E+01 9.700E+00 1.090E+01 6.200E-01 1.200E-01 9.000E-02 5.000E-01 2.140E+00 Theory 6.480E+00 6.480E+00 6.480E+00 6.480E+00 6.480E+00 6.940E-01 6.940E-01 6.940E-01 6.940E-01 6.940E-01 Deviance -1.391E+00 -1.923E+00 5.236E-01 4.318E-01 5.678E-01 -1.107E-01 -1.329E+00 -1.482E+00 -3.106E-01 1.366E+00 Leverages 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01 2.000E-01

* *

Table 3.9: GLM example 4: gamma errors

56

S IMFIT reference manual: Part 3

3.6.2 The SIMFIT simplied Generalized Linear Models interface


Although generalized linear models have widespread use, specialized knowledge is sometimes required to prepare the necessary data les, weights, offsets, etc. For this reason, there is a simplied SIMFIT interface to facilitate the use of GLM techniques in such elds as the following. Bioassay, assuming a binomial distribution and using logistic, probit, or log-log models to estimate percentiles, such as the LD50 (page 72). Logistic regression and binary logistic regression. Logistic polynomial regression, generating new variables interactively as powers of an original covariate. Contingency table analysis, assuming Poisson errors and using log-linear analysis to quantify row and column effects (page 99). Survival analysis, using the exponential, Weibull, extreme value, and Cox (i.e., proportional hazard) models (page 176). Of course, by choosing the advanced interface, users can always take complete control of the GLM analysis, but for many purposes the simplied interface will prove much easier to use for many routine applications. Some applications of the simplied interface will now be presented.

3.6.3 Logistic regression


Logistic regression is an application of the previously discussed GLM procedure assuming binomial errors and a logistic link. It is widely used in situations where there are binary variables and estimates of odds ratios or log odds ratios are required. A particularly useful application is in binary logistic regression where the yi values are all either 0 or 1 and all the Ni values are equal to 1, so that a probability p i is to be estimated as a function of some variables. Frequently the covariates are qualitative variables which can be included in the model by dening appropriate dummy indicator variables. For instance, suppose a factor has m levels, then we can dene m dummy indicator variables x1 , x2 , . . . , xm as in Table 3.10. The data le would be set up as if Level 1 2 3 ... m x1 1 0 0 ... 0 x2 0 1 0 ... 0 x3 0 0 1 ... 0 ... ... ... ... ... ... xm 0 0 0 ... 1

Table 3.10: Dummy indicators for categorical variables to estimate all m parameters for the m factor levels but because only m 1 of the dummy indicator variables are independent, one of them would have to be suppressed if a constant were to be tted, to avoid aliasing, i.e., the model would be overdetermined and the parameters could not be estimated uniquely. Suppose, for instance, that the model to be tted was for a factor with three levels, i.e., log y 1y = a 0 + a 1 x1 + a 2 x2 + a 3 x3

but with x1 suppressed. Then the estimated parameters could be interpreted as log odds ratios for the factor levels with respect to level 1, the suppressed reference level. This is because for probability estimates p 1 , p 2

Generalized Linear Models (GLM)

57

and p 3 we would have the odds estimates p 1 = exp(a0 ) 1 p 1 p 2 = exp(a0 + a2) 1 p 2 p 3 = exp(a0 + a3) 1 p 3 and estimates for the corresponding log odds ratios involving only the corresponding estimated coefcients log log p 2 /(1 p 2) p 1 /(1 p 1) p 3 /(1 p 3) p 1 /(1 p 1) = a2 = a3 .

Even with quantitative, i.e., continuous data, the best-t coefcients can always be interpreted as estimates for the log odds ratios corresponding to unit changes in the related covariates. As an example of simple binary logistic regression, t the data in test le logistic.tf1 to obtain the results shown in table 3.11. The parameters are well determined and the further step was taken to calculate an No. parameters = 3, Rank = 3, No. points = Parameter Value 95% conf. limits Constant -9.520E+00 -1.606E+01 -2.981E+00 B( 1) 3.877E+00 9.868E-01 6.768E+00 B( 2) 2.647E+00 7.975E-01 4.496E+00 Deviance = 2.977E+01 39, Deg. freedom = Std.error p 3.224E+00 0.0055 1.425E+00 0.0100 9.119E-01 0.0063 36

x( 0) = 1.000E+00, coefficient = -9.520E+00 (the constant term) x( 1) = 1.000E+00, coefficient = 3.877E+00 x( 2) = 1.000E+00, coefficient = 2.647E+00 Binomial N = 1 y(x) = 4.761E-02, Binomial probability p = 0.04761 Table 3.11: Binary logistic regression expected frequency, given the parameter estimates. It frequently happens that, after tting a data set, users wish to predict the binomial probability using the parameters estimated from the sample. That is, given the model log y 1y = 0 + 1 x1 + 2 x2 + . . . + m xm = ,

where y is recorded as either 0 (failure) or 1 (success) in a single trial, then the binomial probability p would be estimated as ) exp( p = , ) 1 + exp( is evaluated using parameter estimates with user supplied covariates. In this case, with a constant where term and x1 = x2 = 1, then p = 0.04761.

58

S IMFIT reference manual: Part 3

3.6.4 Conditional binary logistic regression with stratied data


A special case of multivariate conditional binary logistic regression is in matched case control studies, where the data consist of strata with cases and controls, and it is wished to estimate the effect of covariates, after allowing for differing baseline constants in each stratum. Consider, for example, the case of s strata with nk cases and mk controls in the kth stratum. Then, for the jth person in the kth stratum with c-dimensional covariate vector x jk , the probability Pk (x jk ) of being a case is Pk (x jk ) = exp(k + T x jk ) 1 + exp(k + T x jk )

where k is a stratum specic constant. Estimation of the c parameters i can be accomplished by maximizing the conditional likelihood, without explicitly estimating the constants k . As an example, t the test le strata.tf1 to obtain the results shown in table 3.12. Note that the input le No. parameters = 2, No. points = 7, Deg. freedom = Parameter Value 95% conf. limits Std.error B( 1) -5.223E-01 -4.096E+00 3.051E+00 1.390E+00 B( 2) -2.674E-01 -2.446E+00 1.911E+00 8.473E-01 Deviance = 5.475E+00 Strata 1 2 Cases 2 1 Controls 2 2 5 p 0.7226 *** 0.7651 ***

Table 3.12: Conditional binary logistic regression must use the s variable, i.e. the last column in the data le, to indicate the stratum to which the observation corresponds, and since the model tted is incomplete, goodness of t analysis is not available.

Nonlinear regression

59

3.7 Nonlinear regression


All curve tting problems in SIMFIT have similar features and the usual situation is as follows. r You have prepared a data le using makl with x, y, s, including all replicates and all s = 1, or else some sensible choice for the weighting factors s. r You have some idea of a possible mathematical model, such as a double exponential function, for instance. r You are prepared to consider simpler models (e.g., 1 exponential), or even more complicated models (3 exponentials), but only if the choice of model is justied by statistical tests. r You want the program to take as much automatic control as is possible over starting parameters and data scaling, but you do want options for comprehensive goodness of t criteria, residuals analysis and graph plotting. The following sections take each of the user friendly programs in turn and suggest ways you can practise with the test les. Finally we briey turn to specialized models, and comprehensive curve tting for experts.

3.7.1 Exponentials
We now enlarge upon the preliminary discussion of exponential tting given on page 27. The test le exfit.tf4 has data for the function f (t ) = exp(t ) + exp(5t ) obtained by using adderr to add error to the exact data in exfit.tf3 prepared by makdat. So you read this data into ext, select models of type 1, and then request ext to t 1 exponential then 2 exponentials, using a short random search. The result is shown in gure 3.1, namely, the t with two exponentials is sufciently better than the t with 1 exponential that we can assume an acceptable model to be f (t ) = A1 exp(k1t ) + A2 exp(k2t ).
2.00

f(t)

1.00

0.00 0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Now do it again, but this time pay more attention to the goodness of t criteria, residuals, parameter esFigure 3.1: Fitting exponential functions timates and statistical analysis. Get the hang of the way SIMFIT does goodness of t, residuals display, graph plotting and statistics, because all the curve tting programs adopt a similar approach. As for the other test les, exfit.tf1 and exfit.tf2 are for 1 exponential, while exfit.tf5 and exfit.tf6 are double exponentials for models 5 and 6. Linked sequential exponential models should be tted by program qnt not program ext, since the time constants and amplitudes are not independent in such cases. Program ext can also be used in pharmacokinetics to estimate time to half maximum response and AUC.
3.7.1.1 How to interpret parameter estimates

The meaning of the results generated by program ext after tting two exponentials to exfit.tf4 will now be explained, as a similar type of analysis is generated by all the user-friendly curve tting programs. Consider, rst of all Table 3.13 listing parameter estimates which result from tting two exponentials. The rst column gives the estimated values for the parameters, i.e., the amplitudes A(i) and decay constants k(i), although it must be appreciated that the pairwise order of these is arbitrary. Actually program ext will always try to

60

S IMFIT reference manual: Part 3

Parameter Value Std. error .. 95% Con. Lim. .. p A(1) 8.526E-01 6.77E-02 7.13E-01 9.92E-01 0.000 A(2) 1.176E+00 7.48E-02 1.02E+00 1.33E+00 0.000 k(1) 6.793E+00 8.54E-01 5.04E+00 8.55E+00 0.000 k(2) 1.112E+00 5.11E-02 1.01E+00 1.22E+00 0.000 AUC 1.183E+00 1.47E-02 1.15E+00 1.21E+00 0.000 AUC is the area under the curve from t = 0 to t = infinity Initial time point (A) = 3.598E-02 Final time point (B) = 1.611E+00 Area from t = A to t = B = 9.383E-01 Average over range (A,B) = 5.958E-01 Table 3.13: Fitting two exponentials: 1. parameter estimates rearrange the output so that the amplitudes are in increasing order, and a similar rearrangement will also occur with programs mmt and hlt. For situations where A(i) > 0 and k(i) > 0, the area from zero to innity, i.e. the AUC, can be estimated, as can the area under the data range and the average function value (page 212) calculated from it. The parameter AUC is not estimated directly from the data, but is a secondary parameter estimated algebraically from the primary parameters. The standard errors of the primary parameters are obtained from the inverse of the estimated Hessian matrix at the solution point, but the standard error of the AUC is estimated from the partial derivatives of AUC with respect to the primary parameters, along with the estimated variance-covariance matrix (page 77). The 95% condence limits are calculated from the parameter estimates and the t distribution (page 288), while the p values are the two-tail probabilities for the estimates, i.e., the probabilities that parameters as extreme or more extreme than the estimated ones could have resulted if the true parameter values were zero. The windows dened by the condence limits are useful for a quick rule of thumb comparison with windows from tting the same model to another data set; if the windows are disjoint then the corresponding parameters differ signicantly, although there are more meaningful tests (page 40). Clearly, parameters with p < 0.05 are well dened, while parameters with p > 0.05 must be regarded as ill-determined. Expert users may sometimes need the estimated correlation matrix Ci j = CVi, j , CViiCV j j

where 1 Ci j 1, Cii = 1, which is shown in Table 3.14 Parameter correlation matrix 1.000 -0.876 1.000 -0.596 0.900 1.000 -0.848 0.949 0.820 1.00 Table 3.14: Fitting two exponentials: 2. correlation matrix

3.7.1.2 How to interpret goodness of t

Table 3.15, displaying the results from analyzing the residuals after tting two exponentials to exfit.tf4, is typical of many SIMFIT programs. Residuals tables should always be consulted when assessing goodness of t. Several points should be remembered when assessing such residuals tables, where there are N observations yi , with weighting factors si , theoretical values f (xi ), residuals ri = yi f (xi ), weighted residuals ri /si , and where k parameters have been estimated.

Nonlinear regression

61

Analysis of residuals: WSSQ = 2.440E+01 P(chi-sq. >= WSSQ) = 0.553 R-squared, cc(theory,data)2 = 0.993 Largest Abs.rel.res. = 11.99 % Smallest Abs.rel.res. = 0.52 % Average Abs.rel.res. = 3.87 % Abs.rel.res. in range 10-20 % = 3.33 % Abs.rel.res. in range 20-40 % = 0.00 % Abs.rel.res. in range 40-80 % = 0.00 % Abs.rel.res. > 80 % = 0.00 % No. res. < 0 (m) = 15 No. res. > 0 (n) = 15 No. runs observed (r) = 16 P(runs =< r : given m and n) = 0.576 5% lower tail point = 11 1% lower tail point = 9 P(runs =< r : given m plus n) = 0.644 P(signs =<least no. observed) = 1.000 Durbin-Watson test statistic = 1.806 Shapiro-Wilks W (wtd. res.) = 0.939 Significance level of W = 0.084 Akaike AIC (Schwarz SC) stats = 1.798E+00 ( 7.403E+00) Verdict on goodness of fit: incredible Table 3.15: Fitting two exponentials: 3. goodness of t statistics q The chi-square test (page 98) using W SSQ =
N

i=1

yi f (xi ) si

is only meaningful if the weights dened by the si supplied for tting are good estimates of the standard deviations of the observations at that level of the independent variable; say means of at least ve replicates. Inappropriate weighting factors will result in a biased chi-square test. Also, if all the si are set equal to 1, unweighted regression will be performed and an alternative analysis based on the coefcient of variation will be performed. q The R2 value is the square of the correlation coefcient (page 132) between data and best t points. It only represents a meaningful estimate of that proportion of the t explained by the regression for simple unweighted linear models, and should be interpreted with restraint when nonlinear models have been tted. q The results based on the absolute relative residuals ai dened using machine precision as ai = 2|ri | max(, |yi | + | f (xi )|)

do not have statistical relevance, but they do have obvious empirical justication, and they must be interpreted with commonsense, especially where the data and/or theoretical values are very small. q The probability of the number of runs observed given m negative and n positive residuals is a very useful test for randomly distributed runs (page 105), but the probability of runs given N = m + n, and also the overall sign test (page 104) are weak, except for very large data sets.

62

S IMFIT reference manual: Part 3

q The Durbin-Watson test statistic DW =

N 1 i=1

(ri+1 ri )2
i=1

ri2

is useful for detecting serially correlated residuals, which could indicate correlated data or an inappropriate model. The expected value is 2.0, and values less than 1.5 suggest positive correlation, while values greater than 2.5 suggest negative serial correlation. q Where N , the number of data points, signicantly exceeds k, the number of parameters estimated, the weighted residuals are approximately normally distributed, and so the Shapiro-Wilks test (page 90) should be taken seriously. q The Akaike AIC statistic AIC = N log(W SSQ/N ) + 2k and Schwarz Bayesian criterion SC SC = N log(W SSQ/N ) + k log N are only meaningful if N log(W SSQ/N ) is equivalent to -2[Maximum Likelihood]. Note that only differences between AIC with the same data, i.e. xed N , are relevant, as in the evidence ratio ER, dened as ER = exp[(AIC(1) AIC(2))/2]. q The nal verdict is calculated from an empirical look-up table, where the position in the table is a weighted mean of scores allocated for each of the tests listed above. It is qualitative and rather conservative, and has no precise statistical relevance, but a good result will usually indicate a well-tting model. q As an additional measure, plots of residuals against theory, and half-normal residuals plots (gure3.11) can be displayed after such residuals analysis, and they should always be inspected before concluding that any model ts satisfactorily. q With linear models, SIMFIT also calculates studentized residuals and leverages, while with generalized linear models (page 50), deviance residuals can be tabulated.
3.7.1.3 How to interpret model discrimination results

After a sequence of models have been tted, tables like Table 3.16 are generated. First of all, note that the above model discrimination analysis is only strictly applicable for nested linear models with known error structure, and should be interpreted with restraint otherwise. Now, if W SSQ1 with m1 parameters is the previous (possibly decient) model, while W SSQ2 with m2 parameters is the current (possibly superior) model, so that W SSQ1 > W SSQ2, and m1 < m2 , then F= (W SSQ1 W SSQ2)/(m2 m1 ) W SSQ2/(N m2 )

should be F distributed (page 289) with m2 m1 and N m2 degrees of freedom, and the F test (page 107) for excess variance can be used. Alternatively, if W SSQ2/(N m2 ) is equivalent to the true variance, i.e., model 2 is equivalent to the true model, the Mallows C p statistic Cp = W SSQ1 (N 2m1) W SSQ2/(N m2 )

can be considered. This has expectation m1 if the previous model is sufcient, so values greater than m1 , that is C p /m1 > 1, indicate that the current model should be preferred over the previous one. However, graphical deconvolution, as illustrated on page 271, should always be done wherever possible, as with sums of exponentials, Michaelis-Mentens, High-Low afnity sites, sums of Gaussians or trigonometric functions, etc., before concluding that a higher order model is justied on statistical grounds.

Nonlinear regression

63

WSSQ-previous WSSQ-current No. parameters-previous No. parameters-current No. x-values Akaike AIC-previous Akaike AIC-current Schwarz SC-previous Schwarz SC-current Mallows Cp Num. deg. of freedom Den. deg. of freedom F test statistic (FS) P(F >= FS) P(F =< FS) 5% upper tail point 1% upper tail point

= = = = = = = = = = = = = = = = =

2.249E+02 2.440E+01 2 4 30 6.444E+01 1.798E+00, ER = 3.998E+13 6.724E+01 7.403E+00 2.137E+02, Cp/M1 = 1.069E+02 2 26 1.069E+02 0.0000 1.0000 3.369E+00 5.526E+00

Conclusion based on F test Reject previous model at 1% significance level There is strong support for the extra parameters Tentatively accept the current best fit model Table 3.16: Fitting two exponentials: 4. model discrimination statistics

3.7.2 High/low afnity sites


The general model for a mixture of n high/low afnity sites is f (x) = a1 K1 x a2 K2 x an Kn x + + + +C 1 + K1x 1 + K2x 1 + Knx
2

but usually it is only possible to differentiate between the cases n = 1 and n = 2. The test les hlfit.tf1 and hlfit.tf2 are for 1 site, while 1 hlfit.tf3 has data for 2 sites and hlfit.tf4 has the same data with added error. When tting this model you should normalize the data if possible to zero baseline, that is f (0) = 0, or alternatively C = 0, and explore whether two independent sites give a better t than one site. So, read hlfit.tf4 into program hlt, ask for lowest order 1, highest or0 der 2 and the case where C is not varied but is xed 0 1 2 3 4 5 at C = 0. The outcome is illustrated in gure 3.2 x and, from the statistics, you will learn that independent low and high afnity sites are justied in this Figure 3.2: Fitting high/low afnity sites case. To interpret the parameter estimates, you take the values for K1 and K2 as estimates for the respective association constants, and a1 and a2 as the relative number of sites in the respective categories. The total number of sites is proportional to a1 + a2 , and has got nothing to do with n, which is the number of distinct binding types that can be deduced from the binding data. Concentration at half saturation is also estimated by hlt, but cooperative binding should be tted by program sft.
f(x)

64

S IMFIT reference manual: Part 3

3.7.3 Cooperative ligand binding


To use sft you should really have some idea about the total number of binding sites on the macromolecule or receptor, i.e., n, and suspect cooperative interaction between the sites, i.e., if hlt cannot t the data. The appropriate model for cooperative ligand binding to macromolecules is f (x) = Z (1 x + 22x2 + + nnxn ) +C n(1 + 1x + 2x2 + + n xn )

where Z is a scaling factor and C is a baseline correction. In this formulation, the i are overall binding constants, but the alternative denitions for binding constants, and the convention for measuring deviations from noncooperative binding in terms of the Hessian of the binding polynomial, are in the tutorial. Test les for program sft are sffit.tf1 and sffit.tf2 for 1 site and sffit.tf3 and sffit.tf4 for 2 sites, and note that concentration at half saturation is also estimated. Always try to normalize your data so that Z = 1 and C = 0. Such normalizing is done by a combination of you nding the normalizing parameters independently, and/or using a preliminary t to estimate them, followed by scaling your data using edit.

3.7.4 Michaelis-Menten kinetics


A mixture of independent isoenzymes, each separately obeying Michaelis-Menten kinetics is modelled by v(S) = Vmax1 S Vmax2 S Vmaxn S + + + Km1 + S Km2 + S Kmn + S

and again only the cases n = 1 and n = 2 need to be considered. The appropriate test les are mmfit.tf1 and mmfit.tf2 for one enzyme, but mmfit.tf3 and mmfit.tf4 for 2 isoenzymes. Read mmfit.tf4 into mmt and decide if 1 or 2 enzymes are needed to explain the data. There is a handy rule of thumb that can be used to decide if any two parameters are really different (page 40). If the 95% condence limits given by SIMFIT do not overlap at all, it is very likely that the two parameters are different. Unfortunately, this test is very approximate and nothing can be said with any certainty if the condence limits overlap. When you t mmfit.tf4 try to decide if Vmax1 and Vmax2 are different. What about Km1 and Km2 ? Program mmt also estimates concentration at half maximum response, i.e., EC50 and IC50 (page 74), but it should not be used if the data show any signs of substrate activation or substrate inhibition. For this situation you use rft.

3.7.5 Positive rational functions


Deviations from Michaelis-Menten kinetics can be tted by the positive rational function f (x) = 0 + 1 x + 2 x2 + + n xn 1 + 1 x + 2 x2 + + n xn

0.200

0.100

0.000 0 2 4 6 8 10

Figure 3.3: Fitting positive rational functions

where i 0 and i 0. In enzyme kinetics a number of special cases arise in which some of these coefcients should be set to zero. For instance, with dead-end substrate inhibition we would have 0 = 0 and n = 0. The test les for rft are all exact data, and the idea is that you would add random error to simulate experiments. For the time being we will just t one test le, rffit.tf2, with substrate inhibition data. Input rffit.tf2 into program rft, then request lowest degree n = 2 and highest degree n = 2 with 0 = 0 and 2 = 0. Note that these are called A(0) and A(N) by the program. You will get the t shown in gure 3.3. Now you could try what happens if you t all the test les with unrestricted

f(x)

Nonlinear regression

65

rational functions of orders 1:1, 2:2 and 3:3. Also, you could pick any of these and see what happens if random error is added. Observe that program rft does all manner of complicated operations to nd starting estimates and scale your data, but be warned; tting positive rational functions is extremely difcult and demands specialized knowledge. Dont be surprised if program rft nds a good t with coefcients that bear no resemblance to the actual ones.

3.7.6 Isotope displacement kinetics


The rational function models just discussed for binding and kinetics represent the saturation of binding sites, or ux through active sites, and special circumstances apply when there is no appreciable kinetic isotope effect. That is, the binding or kinetic transformation process is the same whether the substrate is labelled or not. This allows experiments in which labelled ligand is displaced by unlabelled ligand, or where the ux of labelled substrate is inhibited by unlabelled substrate. Since the ratios of labelled ligand to unlabelled ligand in the bound state, free state, and in the total ux are equal, a modied form of the previous equations can be used to model the binding or kinetic processes. For instance, suppose that total substrate, S say, consists of labelled substrate, [Hot ] say, and unlabelled substrate, [Cold ] say. Then the ux of labelled substrate will be given by Vmax1 [Hot ] Vmax2 [Hot ] d [Hot ] Vmaxn [Hot ] = + + + dt Km1 + [Hot ] + [Cold ] Km2 + [Hot ] + [Cold ] Kmn + [Hot ] + [Cold ] So, if [Hot ] is kept xed and [Cold ] is regarded as the independent variable, then program mmt can be used to t the resulting data, as shown for the test le hotcold.tf1 in gure 3.4. Actually this gure was obIsotope Displacement Kinetics tained by tting the test le 16 using program qnt, which allows users to specify the concentration of xed [Hot ]. Data It also allows users to appreBest Fit ciate the contribution of the Component 1 individual component species Component 2 to the overall sum, by plot8 ting the deconvolution, as illustrated. Graphical deconvolution (page 271) should always be done if it is necessary to decide on the activities of kinetically distinct isoenzymes or proportions of in0 dependent High/Low afnity -2 -1 0 1 2 3 4 5 binding sites. Note that an imlog10[Cold] portant difference between using mmt in this mode rather Figure 3.4: Isotope displacement kinetics than in straightforward kinetic mode is that the kinetic constants are modied in the following sense: the apparent Vmax values estimated are actually the true values multiplied by the concentration of labelled substrate, while the apparent Km values estimated are the true ones plus the concentration of labelled substrate. A similar analysis is possible for program hlt as well as for programs sft and rft, except that here some further algebra is required, since the models are not linear summations of 1:1 rational functions. Note that, in isotope displacement mode, concentration at half maximum response can be used as an estimate for IC50, allowing for the ratio of labelled to unlabelled ligand, if required (page 74).

d[Hot]/dt

66

S IMFIT reference manual: Part 3

3.7.7 Nonlinear growth curves


We now continue the discussion of growth curves from page 28. Most growth data are monotonically increasing observations of size S(t ) as a function of time t , from a small value of size S0 at time t = 0 to a nal asymptote S at large time. The usual reason for tting models is to compare growth rates between different populations, or to estimate parameters, e.g., the maximum growth rate, maximum size, time to reach half maximum size, etc. The models used are mostly variants of the Von Bertalanffy allometric differential equation dS/dt = AS BS, which supposes that growth rate is the difference between anabolism and catabolism expressed as power functions in size. This equation denes monotonically increasing S(t ) proles and can be tted by deqsol or qnt, but a number of special cases leading to explicit integrals are frequently encountered. These have the benet that parameters estimated from the data have a physical meaning, unlike tting polynomials where the parameters have no meaning and cannot be used to estimate nal size, maximum growth rate and so on. Clearly, the following models should only be tted when data cover a sufcient time range to allow meaningful estimates for S0 and S . 1. Exponential model dS/dt = kS S(t ) = A exp(kt ), where A = S0 2. Monomolecular model dS/dt = k(A S) S(t ) = A[1 B exp(kt )], where B = 1 S0/A

5. Von Bertalanffy 2/3 model dS/dt = S2/3 S

4. Gompertz model dS/dt = kS[log(A) log(S)] S(t ) = A exp[B exp(kt )], where B = log(A/S0) S(t ) = [A1/3 B exp(kt )]3

3. Logistic model dS/dt = kS(A S)/A S(t ) = A/[1 + B exp(kt )], where B = S0 /A 1

6. Model 3 with constant f (t ) = S(t ) C d f /dt = dS/dt = k f (t )(A f (t ))/A S(t ) = A/[1 + B exp(kt )] + C 7. Model 4 with constant f (t ) = S(t ) C

where A1/3 = /, B = / S0 , k = /3

1/3

8. Model 5 with constant f (t ) = S(t ) C

d f /dt = dS/dt = k f (t )[log(A) log( f (t ))] S(t ) = A exp[B exp(kt )] + C

S(t ) = [A1/3 B exp(kt )]3 + C 9. Richards model dS/dt = Sm S

d f /dt = dS/dt = f (t )2/3 f (t )

S(t ) = [A1m B exp(kt )][1/(1m)]

1 m where A1m = /, B = / S0 , k = (1 m) if m < 1 then , , A and B are > 0

if m > 1 then A > 0 but , and B are < 0 10. Preece and Baines model f (t ) = exp[k0 (t )] + exp[k1 (t )] S(t ) = h1 2(h1 h )/ f (t ) In mode 1, gct ts a selection of these classical growth models, estimates the maximum size, maximum and minimum growth rates, and times to half maximum response, then compares the ts. As an example,

Nonlinear regression

67

Results for model 1 Parameter Value Std. err. A 1.963E-01 2.75E-02 k 1.840E-01 1.84E-02 Results for model 2 Parameter Value Std. err. A 1.328E+00 1.16E-01 B 9.490E-01 9.52E-03 k 1.700E-01 2.90E-02 t-half 3.768E+00 6.42E-01 Results for model 3 Parameter Value Std. err. A 9.989E-01 7.86E-03 B 9.890E+00 3.33E-01 k 9.881E-01 2.68E-02 t-half 2.319E+00 4.51E-02 Largest observed data size = Largest observed/Th.asymptote = Maximum observed growth rate = Time when max. rate observed = Minimum observed growth rate = Time when min. rate observed = Summary Model WSSQ NDOF 1 4.72E+03 31 2 5.42E+02 30 3 3.96E+01 30

..95% conf. lim. .. 1.40E-01 2.52E-01 1.47E-01 2.22E-01

p 0.000 0.000

..95% conf. lim. .. 1.09E+00 1.56E+00 9.30E-01 9.68E-01 1.11E-01 2.29E-01 2.46E+00 5.08E+00

p 0.000 0.000 0.000 0.000

..95% conf. lim. .. 9.83E-01 1.01E+00 9.21E+00 1.06E+01 9.33E-01 1.04E+00 2.23E+00 2.41E+00 1.086E+00 Theoretical 1.087E+00 2.407E-01 Theory max. 2.000E+00 Theoretical 4.985E-04 Theory min. 1.000E+01 Theoretical

p 0.000 0.000 0.000 0.000 asymptote = 9.989E-01 (in (in (in (in range) range) range) range) = = = = 2.467E-01 2.353E+00 4.985E-04 1.000E+01

WSSQ/NDOF 1.52E+02 1.81E+01 1.32E+00

P(C>=W) 0.000 0.000 0.113

P(R=<r) 0.000 0.075 0.500

N>10% N>40% 29 17 20 0 0 0

Av.r% 40.03 12.05 3.83

Verdict Very bad Very poor Incredible

Table 3.17: Fitting nonlinear growth models consider table 3.17, which is an abbreviated form of the results le from tting gcfit.tf2, as described on page 28. This establishes the satisfactory t with the logistic model when compared to the exponential and and asymptote S , monomolecular models. Figure 3.5 shows typical plots, i.e. data with and best-t curve S derivative of best-t curve d S/dt , and relative rate (1/S)d S/dt .

3.7.8 Nonlinear survival curves


In mode 2, gct ts a sequence of survival curves, and it is assumed that the data are uncorrelated estimates of fractions surviving 0 S(t ) 1 as a function of time t 0, e.g. such as would result from using independent samples for each time point. It is important to realize that, if any censoring has taken place, the estimated fraction should be corrected for this. In other words, you start with a population of known size and, as time elapses, you estimate the fraction surviving by any sampling technique that gives estimates corrected to the original population at time zero. The test les weibull.tf1 and gompertz.tf1 contain some exact data, which you can t to see how mode 2 works. Then you can add error to simulate reality using program adderr. Note that you prepare your own data les for mode 2 using the same format as for program makl, making sure that the fractions are between zero and one, and that only nonnegative times are allowed. It is probably best to do unweighted regression with this sort of data (i.e. all s = 1) unless the variance of the sampling technique has been investigated independently. In survival mode the time to half maximum response is estimated with 95% condence limits and this can used to estimate LD50 (page 74). The survivor function is S(t ) = 1 F (t ), the pdf is f (t ), i.e. f (t ) = dS/dt , the hazard function is h(t ) = f (t )/S(t ),

68

S IMFIT reference manual: Part 3

Data and best-fit curve


1.20 0.320

Max. at 2.353E+00, 2.467E-01


1.00

Max. at 0.000E+00, 8.974E-01

0.90

0.240

Rel. Rate (1/S)dS/dt


2.5 5.0 7.5 10.0

Growth Rate (dS/dt)

0.75

Size

0.60

0.160

0.50

0.30

0.080

0.25

0.00 0.0 2.5 5.0 7.5 10.0

0.000 0.0

0.00 0.0 2.5 5.0 7.5 10.0

Time

Time

Time

Figure 3.5: Estimating growth curve parameters and the cumulative hazard is H (t ) = log(S(t )). Plots are provided for S(t ), f (t ), h(t ), log[h(t )] and, as in mode 1, a summary is given to help choose the best t model from the following list, all of which decrease monotonically from S(0) = 1 to S() = 0 with increasing time. 1. Exponential model S(t ) = exp(At ) f (t ) = AS(t ) h(t ) = A 2. Weibull model S(t ) = exp[(At )B ] h(t ) = AB(At )B1 3. Gompertz model S(t ) = exp[(B/A){exp(At ) 1}] f (t ) = B exp(At )S(t ) h(t ) = B exp(At ) 4. Log-logistic model S(t ) = 1/[1 + (At )B] f (t ) = AB(At )B1 /[1 + (At )B]2 h(t ) = AB(At )B1 /[1 + (At )B] Note that, in modes 3 and 4, simtprogramgct provides options for using such survival models to analyze survival times, as described on page 176. f (t ) = AB[(At )B1 ]S(t )

Nonlinear regression

69

3.7.9 Advanced curve tting


Eventually there always comes a time when users want extra features, like the following. a) b) c) d) e) f) g) h) i) j) k) l) m) n) o) Interactive choice of model, data sub-set or weighting scheme. Choice of optimization technique. Fixing some parameters while others vary in windows of restricted parameter space. Supplying user-dened models with features such as special functions, functions of several variables, root nding, numerical integration, Chebyshev expansions, etc. Simultaneously tting a set of equations, possibly linked by common parameters. Fitting models dened parametrically, or as functions of functions. Estimating the eigenvalues and condition number of the Hessian matrix at solution points. Visualizing the weighted sum of squares and its contours at solution points. Inverse prediction, i.e., nonlinear calibration. Estimating rst and second derivatives or areas under best t curves. Supplying starting estimates added to the end of data les. Selecting sets of starting estimates from parameter limits les. Performing random searches of parameter space before commencing tting. Estimating weights from the data and best t model. Saving parameters for excess variance F tests in model discrimination.

Program qnt is provided for such advanced curve tting. The basic version of program qnt only supports quasi-Newton optimization, but some versions allow the user to select modied Gauss-Newton or sequential quadratic programming techniques. Users must be warned that this is a very advanced piece of software and it demands a lot from users. In particular, it scales parameters but doesnt scale data. To ensure optimum operation, users should appreciate how to scale data correctly, especially with models where parameters occur exponentially. They should also understand the mathematics of the model being tted and have good starting estimates. In expert mode, starting estimates and limits are appended to data les to facilitate exploring parameter space. Test les qnfit.tf1 (1 variable), qnfit.tf2 (2 variables) and qnfit.tf3 (3 variables) have such parameter windows. Alternatively, parameter limits les can be supplied, preferably as library les like qnfit.tfl. Fitting several equations simultaneously will now be described as an example of advanced curve tting .
3.7.9.1 Fitting multi-function models using qnt

Open program qnt and select to t n functions of one variable. Specify that three equations are required, then read in the library le line3.tfl containing three data sets and select the model le line3.mod which denes three independent straight lines. Choose unconstrained tting and, after tting without a random search, gure 3.6 will be obtained. In this instance three distinct lines have been tted to three independent data sets and, since the three component submodels are uncoupled, the off-diagonal covariance matrix elements will be seen to be zero. This is because the example has been specially selected to be just about the simplest conceivable example to illustrate how to prepare a model and data sets for multi-function tting.

3.7.10 Differential equations


Figure 3.7 illustrates how to use deqsol to t systems of differential equations for the simple three component epidemic model dy(1) = p1 y(1)y(2) dt dy(2) = p1 y(1)y(2) p2y(2) dt dy(3) = p2 y(2) dt

70

S IMFIT reference manual: Part 3

Using Qnfit to Fit Three Equations


40 Data Set 1 Best Fit 1 Data Set 2 Best Fit 2 Data Set 3 Best Fit 3

30

y1, y2, y3

20

10

10

x
Figure 3.6: Fitting three equations simultaneously

Overlay of Starting Estimates


1200 1000

Best Fit Epidemic Differential Equations


1200 1000 Susceptible Infected Resistant

y(1), y(2), y(3)

800 600 400 200 0 0 2 4 6 8 10

y(1), y(2), y(3)

800 600 400 200 0 0 2 4 6 8 10

Figure 3.7: Fitting the epidemic differential equations where y(1) are susceptible individuals, y(2) are infected, and y(3) are resistant members of the population. Fitting differential equations is a very specialized procedure and should only be undertaken by those who understand the issues involved. For example, there is a very important point to remember when using deqsol: if a system of n equations involves m parameters pi and n initial conditions y0( j) for purposes of simulation, there will actually be m + n parameters as far as curve tting is concerned, as the last n parameters pi for i = m + 1, m + 2, . . ., m + n, will be used for the initial conditions, which can be varied or (preferably) xed. To show you how to practise simulating and tting differential equations, the steps followed to create gure 3.7, and also some hints, are now given. r Program deqsol was opened, then the epidemic model was selected from the library of three component models and simulated for the default parameters p1 , p2 and initial conditions y0(1), y0(2), y0(3) (i.e. parameters p3 , p4 , and p5 ). r The data referenced in the library le epidemic.tfl was generated using parameter values 0.004, 0.3,

Nonlinear regression

71

980, 10 and 10, by rst writing the simulated (i.e. exact) data to les y1.dat, y2.dat, and y3.dat, then adding 10% relative error using adderr to save perturbed data as y1.err, y2.err, and y3.err. Program maklib was then used to create the library le epidemic.tfl, which just has a title followed by the three le names. r Curve tting was then selected in deqsol, the default equations were integrated, the library le was input and the current default differential equations were overlayed on the data to create the left hand plot. You should always overlay the starting solution over the data before tting to make sure that good starting estimates are being used. r By choosing direct curve tting, the best t parameters and initial conditions were estimated. It is also possible to request random starts, when random starting estimates are selected in sequence and the results are logged for comparison, but this facility is only provided for expert users, or for systems where small alterations in starting values can lead to large changes in the solutions. r After curve tting, the best t curves shown in the right hand plot were obtained. In this extremely simple example the initial conditions were also estimated along with the two kinetic parameters. However, if at all possible, the initial conditions should be input as xed parameters (by setting the lower limit, starting value and upper limit to the same value), as solutions always advance from the starting estimates, so generating a type of autocorrelation error. Note that, in this example, the differential equations add to zero, indicating the conservation equation y(1) + y(2) + y(3) = k for some constant k. This could have been used to eliminate one of the y(i), leading to a reduced set of equations. However, the fact that the differentials add to zero, guarantees conservation when the full set is used, so the system is properly specied and not overdetermined, and it is immaterial whether the full set or reduced set is used. r Note that, when the default graphs are transferred to simplot for advanced editing, such as changing line and symbol types, the data and best t curves are transferred alternately as data/best-t pairs, not all data then all best-t, or vice versa. r For situations where there is no experimental data set for one or more of the components, a percentage sign % can be used in the library le to indicate a missing component. The curve tting procedure will then just ignore this particular component when calculating the objective function. r Where components measured experimentally are linear combinations of components of the system of differential equations, a transformation matrix can be supplied as described in the readme les. r As the covariance matrix has to be estimated iteratively, the default setting is to calculate parameters without parameter estimates. The extra, time-consuming, step of calculating the variance covariance matrix can be selected as an extra feature where this is required. r When requesting residuals and goodness of t analysis for any given component y(i) you must provide the number of parameters estimated for that particular component, to correct for degrees of freedom for individual t, as opposed to overall t.

72

S IMFIT reference manual: Part 3

3.8 Calibration and Bioassay


Calibration and bioassay are dened in SIMFIT as follows.

Calibration
This requires tting a curve y = f (x) to a (x, y) training data set with x known exactly and y measured with limited error, so that the best t model f(x) can then be used to predict xi given arbitrary yi . Usually the model is of no signicance and steps are taken to use a data range over which the model is approximately linear, or at worst a shallow smooth curve. It is assumed that experimental errors arising when constructing the best t curve are uncorrelated and normally distributed with zero mean, so that the standard curve is a good approximation to the maximum likelihood estimate.

Bioassay
This is a special type of calibration, where the data are obtained over as wide a range as possible, nonlinearity is accepted (e.g. a sigmoid curve), and specic parameters of the underlying response, such as the time to half-maximum response, nal size, maximum rate, area AUC, EC50, LD50, or IC50 are to be estimated. With bioassay, a known deterministic model may be required, and assuming normally distributed errors may sometimes be a reasonable assumption, but alternatively the data may consist of proportions in one of two categories (e.g. alive or dead) as a function of some treatment, so that binomial error is more appropriate and probit analysis, or similar, is called for.

3.8.1 Calibration curves


Creating and using a standard calibration curve involves: 1. Measuring responses yi at xed values of xi , and using replicates to estimate si , the sample standard deviation of yi if possible. 2. Preparing a curve tting type le with x, y, and s using program makl, and using makmat to prepare a vector type data le with xi values to predict yi . 3. Finding a best t curve y = f (x) to minimize W SSQ, the sum of weighted squared residuals. 4. Supplying yi values and predicting xi together with 95% condence limits, i.e. inverse-prediction of xi = f1 (yi ). Sometimes you may also need to evaluate yi = f(xi ). It may be that the si are known independently, but often they are supposed constant and unweighted regression, i.e. all si = 1, is unjustiably used. Any deterministic model can be used for f (x), e.g., a sum of logistics or Michaelis-Menten functions using program qnt, but this could be unwise. Calibration curves arise from the operation of numerous effects and cannot usually be described by one simple equation. Use of such equations can lead to biased predictions and is not always recommended. Polynomials are useful for gentle curves as long as the degree is reasonably low ( 3 ?) but, for many purposes, a weighted least squares data smoothing cubic spline is the best choice. Unfortunately polynomials and splines are too exible and follow outliers, leading to oscillating curves, rather than the data smoothing that is really required. Also they cannot t horizontal asymptotes. You can help in several ways. a) Get good data with more distinct x-values rather than extra replicates. b) If the data approach horizontal asymptotes, either leave some data out as they are no use for prediction anyway, or try using log(x) rather than x, which can be done automatically by program calcurve. c) Experiment with the weighting schemes, polynomial degrees, spline knots or constraints to nd the optimum combinations for your problem. d) Remember that predicted condence limits also depend on the s values you supply, so either get the weighting scheme right, or set all all si = 1.

Calibration and Bioassay

73

3.8.1.1 Turning points in calibration curves

You will be warned if f (x) has a turning point, since this can make inverse prediction ambiguous. You can then re-t to get a new curve, eliminate bad data points, get new data, etc., or carry on if the feature seems to be harmless. You will be given the option of searching upwards or downwards for prediction in such ambiguous cases. It should be obvious from the graph, nature of the mathematical function tted, or position of the turning point in which direction the search should proceed.
3.8.1.2 Calibration using lint and polnom

For linear or almost linear data use program lint which just ts straight lines of the form f (x) = p0 + p1 x, but for smooth gentle curves, program polnom can t a polynomial f (x) = p0 + p1x + p2 x + + pn x , where the degree n is chosen according to statistical principles. polnom ts all polynomials up to degree 6 and gives statistics necessary to choose n but, in the case of calibration curves, it is not advisable to use a value of n greater than 2 or at most 3. To practise, read test le line.tf1 into lint or polnom and create the calibration curve shown in gy
2 n
15

Best Fit Line and 95% Limits

10

0 0 2 4 6 8 10

Figure 3.8: A linear calibration curve ure 3.8. Now predict x from y values, say for instance using polnom.tf3, or the values 2,4,6,8,10.
3.8.1.3 Calibration using calcurve

If a polynomial of degree 2 or at most 3 is not adequate, a cubic spline calibration curve could be considered. It does not matter how nonlinear your data are, calcurve can t them with splines with user-dened xed knots as described on page 214. The program has such a vast number of options that a special mode of operation is allowed, called the expert mode, where all decisions as to weighting, spline knots, transformations, etc. are added to the data le. The advantage of this is that, once a standard curve has been created, it can be reproduced exactly by reading in the standard curve data le. To practise, read in calcurve.tf1 and use expert mode to get gure 3.9. Now do inverse prediction using calcurve.tf3 and browse calcurve.tf1 to understand expert mode.
3.8.1.4 Calibration using qnt

Cubic Spline Calibration Curve


1.20

0.80

log(y)

0.40

0.00

-0.40

-0.80 0 2 4 6 8 10

Figure 3.9: A cubic spline calibration curve

Sometimes you would want to use a specic mathematical model for calibration. For instance, a mixture of two High/Low afnity binding sites or a cooperative binding model might be required for a saturation curve, or a mixture of two logistics might adequately t growth data. If you know an appropriate model for the standard curve, use qnt for inverse prediction because, after tting, the best-t curve can be used for calibration, or for estimating derivatives or areas under curves AUC if appropriate.

74

S IMFIT reference manual: Part 3

3.8.2 Dose response curves, EC50, IC50, ED50, and LD50


A special type of inverse prediction is required when equations are tted to dose response data in order to estimate some characteristic parameter, such as the half time t1/2 , the area under the curve AUC, or median effective dose in bioassay (e.g. ED50, EC50, IC50, LD50, etc.), along with standard errors and 95% condence limits. The model equations used in this sort of analysis are not supposed to be exact models constructed according to scientic laws, rather they are empirical equations, selected to have a shape that is close to the shape expected of such data sets. So, while it is is pedantic to insist on using a model based on scientic model building, it is important to select a model that ts closely over a wide variety of conditions. Older techniques, such as using data subjected to a logarithmic transform in order to t a linear model, are no longer called for as they are very unreliable, leading to biased parameter estimates. Hence, in what follows, it is assumed that data are to be analyzed in standard, not logarithmically transformed coordinates, but there is nothing to prevent data being plotted in transformed space after analysis, as is frequently done when the independent variable is a concentration, i.e., it is desired to have an the independent variable proportional to chemical potential. The type of analysis called for depends very much on the nature of the data, the error distribution involved, and the goodness of t of the assumed model. It is essential that data are obtained over a wide range, and that the best t curves are plotted and seen to be free from bias which could seriously degrade routine estimates of percentiles, say. The only way to decide which of the following procedures should be selected for your data, is to analyze the data using those candidate models that are possibilities, and then to adopt the model that seems to perform best, i.e., gives the closest best t curves and most sensible inverse predictions. t Exponential models If the data are in the form of a simple or multiphasic exponential decline from a nite value at t = 0 to zero as t , and half times t1/2 , or areas AUC are required, use ext (page 59) to t one or a sum of two exponentials with no constant term. Practise with ext and test le exfit.tf4. With the simple model f (t ) = A exp(kt ) of order 1, then the AUC = A/k and t1/2 = log(2)/k are given explicitly but, if this model does not t and a higher model has to be used, then the corresponding parameters will be estimated numerically. t Trapezoidal estimation If no deterministic model can be used for the AUC it is usual to prefer the trapezoidal method with no data smoothing, where replicates are simply replaced by means values that are then joined up sequentially by sectional straight lines. The program average (page 212) is well suited to this sort of analysis. t The Hill equation This empirical equation is

Axn , B n + xn which can be tted using program inrate (page 208), with either n estimated or n xed, and it is often used in sigmoidal form (i.e. n > 1) to estimate the maximum value A and half saturation point B, with sigmoidal data (not data that are only sigmoidal when x-semilog transformed, as all binding isotherms are sigmoidal in x-semilog space). f (x) =

t Ligand binding and enzyme kinetic models. There are three cases: a) data are increasing as a function of an effector, i.e., ligand or substrate, and the median effective ligand concentration ED50 or apparent Km = EC50 = ED50 is required, b) data are a decreasing function of an inhibitor [I ] at xed substrate concentration [S] and IC50, the concentration of inhibitor giving half maximal inhibition, is required, or c) the ux of labelled substrate [Hot ], say, is measured as a decreasing function of unlabelled isotope [Cold ], say, with [Hot ] held xed.

Calibration and Bioassay

75

If the data are for an increasing saturation curve and ligand binding models are required, then hlt (page 63) or, if cooperative effects are present, sft (page 64) can be used to t one or two binding site models. Practise with sft and sffit.tf4. More often, however, an enzyme kinetic model, such as the Michaelis-Menten equation will be used as now described. To estimate the maximum rate and apparent Km , i.e., EC50 the equation tted by mmt in substrate mode would be Vmax [S] v([S]) = Km + [S] while the interpretation of IC50 for a reversible inhibitor at concentration [i] with substrate xed at concentration S would depend on the model assumed as follows. Vmax [S] Km (1 + I /Ki) + [S] Ki (Km + [S]) IC50 = Km Vmax [S] Uncompetitive inhibition v([I ]) = Km + [S](1 + [I ]/Ki] Ki (Km + [S]) IC50 = [S] Vmax [S] Noncompetitive inhibition v([I ]) = (1 + [I ]/Ki)(Km + [S]) IC50 = Ki Competitive inhibitionv([I ]) = Vmax [S] K (1 + [I ]/Ki1) + [S](1 + [I ]/Ki2) Ki1 Ki2 (Km + [S]) IC50 = (Km Ki2 + [S]Ki1) Vmax [Hot ] Isotope displacement v([Cold ]) = Km + [Hot ] + [Cold ] IC50 = Km + [Hot ] Mixed inhibition v([I ]) = Of course, only two independent parameters can be estimated with these models, and, if higher order models are required and justied by statistics and graphical deconvolution, the apparent Vmax and apparent Km are then estimated numerically. t Growth curves. If the data are in the form of sigmoidal increase, and maximum size, maximum growth rate, minimum growth rate, t1/2 time to half maximum size, etc. are required, then use gct in growth curve mode 1 (page 66). Practise with test le gcfit.tf2 to see how a best-t model is selected. For instance, with the logistic model f (t ) = A 1 + B exp(kt ) log(B) t1/2 = k

the maximum size A and time to reach half maximal size t/2 are estimated. t Survival curves. If the data are independent estimates of fractions remaining as a function of time or some effector, i.e. sigmoidally decreasing proles tted by gct in mode 2, and t1/2 is required, then normalize the

76

S IMFIT reference manual: Part 3

data to proportions of time zero values and use gct in survival curve mode 2 (page 67). Practise with Weibull.tf1, which has the model equation S(t ) = 1 exp((At )B ) t1/2 = log(2) . AB

t Survival time models. If the data are in the form of times to failure, possibly censored, then gct should be used in survival time mode 3 (page 176). Practise with test le survive.tf2. With the previous survival curve and with survival time models the median survival time t1/2 is estimated, where
t1/2 0

1 fT (t ) dt = , 2

and fT (t ) is the survival probability density function. t Models for proportions. If the data are in the form of numbers of successes (or failures) in groups of known size as a function of some control variable and you wish to estimate percentiles, e.g., EC50, IC50, or maybe LD50 (the median dose for survival in toxicity tests), use gct in GLM dose response mode. This is because the error distribution is binomial, so generalized linear models, as discussed on page 50, should be used. You should practise tting the test le ld50.tf1 with the logistic, probit and log-log models, observing the goodness of t options and the ability to change the percentile level interactively. An example of how to use this technique follows.

Determination of LD50
1.00 1.00

Determination of LD50

0.50

Proportion Surviving
0 2 4 6 8 10

Proportion Failing

0.50

0.00

0.00 0 2 4 6 8 10

Concentration

Concentration

Figure 3.10: Plotting LD50 data with error bars Figure 3.10 illustrates the determination of LD50 using GLM. The left hand gure shows the results from using the probit model to determine LD50 using test le ld50.tf2. The right hand gure shows exactly the same analysis but carried out using the proportion surviving, i.e., the complement of the numbers in test le ld50.tf2, replacing y, the number failing (dying) in a sample of size N , by N y, the number succeeding (surviving) in a sample of size N . Of course the value of the LD50 estimate and the associated standard error are identical for both data sets. Note that, in GLM analysis, the percentile can be changed interactively, e.g., if you need to estimate LD25 or LD75, etc. The left hand gure was created as follows. a) After tting, the data and best t curve were transferred into simplot using the [Advanced] option. b) The horizontal line was added interactively by using the [Data] option to add data for y = 0.5 .

Calibration and Bioassay

77

The right hand gure was created as follows. 1) After tting, the best t curve was added to the project archive using the [Advanced] option to save an ASCII text coordinate le. 2) The data was input into the analysis of proportions procedure described on page 125, and the error bar plot was created. 3) The error bar data were transferred to simplot using the [Advanced] option, then the saved ASCII text coordinate data for the best t curve and line at y = 0.5 were added interactively using the [Data] option. The point about about using the analysis of proportions routines in this way for the error bars in the right hand gure is that exact, unsymmetrical 95% condence limits can be generated from the sample sizes and numbers of successes in this way.

3.8.3 95% condence regions in inverse prediction


polnom estimates non-symmetrical condence limits assuming that the N values of y for inverse prediction

and weights supplied for weighting are exact, and that the model tted has n parameters that are justied statistically. calcurve uses the weights supplied, or the estimated coefcient of variation, to t condence envelope splines either side of the best t spline, by employing an empirical technique developed by simulation studies. Root nding is employed to locate the intersection of the yi supplied with the envelopes. The AUC, LD50, half-saturation, asymptote and other inverse predictions in SIMFIT use a t distribution with N n degrees of freedom, and the variance-covariance matrix estimated from the regression. That is, assuming a prediction parameter dened by p = f (1 , 2 , . . . , n ), a central 95% condence region is constructed using the prediction parameter variance estimated by ( p) = V
n

i=1

f i

n i1 (i , j ). (i ) + 2 f f CV V i=2 j =1 i j

78

S IMFIT reference manual: Part 3

3.9 Statistics
The main part of the SIMFIT statistics functions are to be found in the program simstat, which is in many ways like a small scale statistics package. This provides options for data exploration, statistical tests, analysis of variance, multivariate analysis, regression, time series, power calculations, etc., as well as a number of calculations, like nding zeros of polynomials, or values of determinants, inverses, eigenvalues or eigenvalues of matrices. In addition to simstat there are also several specialized programs that can be used for more detailed work and to obtain information about dedicated statistical distributions and related tests but, before describing the simstat procedures with worked examples, a few comments about tests may be helpful.

3.9.1 Tests
A test statistic is a function evaluated on a data set, and the signicance level of a test is the probability of obtaining a test statistic as extreme, or more extreme, from a random sample, given a null hypothesis H0 , which usually species a distribution for that test statistic. If the error rate, i.e. signicance level p is less than some critical level, say = 0.05 or = 0.01, it is reasonable to consider whether the null hypothesis should be rejected. The correct procedure is to choose a test, decide whether to use the upper, lower, or two-tail test statistic as appropriate, select the critical signicance level, do the test, then accept the outcome. What is not valid is to try several tests until you nd one that gives you the result you want. That is because the probability of a Type 1 error increases monotonically as the number of tests increases, particularly if the tests are on the same data set, or some subsets of a larger data set. This multiple testing should never be done, but everybody seems to do it. Of course, all bets are off anyway if the sample does not conform to the assumptions implied by H0 , for instance, doing a t test with two samples that are known not to be normally distributed with the same variance.

3.9.2 Multiple tests


Statistical packages are designed to be used in the rather pedantic but correct manner just described, which makes them rather inconvenient for data exploration. SIMFIT, on the other hand, is biased towards data exploration, so that various types of multiple testing can be done. However, once the phase of data exploration is completed, there is nothing to stop you making the necessary decisions and only using the subset of results calculated by the SIMFIT statistical programs, as in the classical (correct) manner. Take, for example, the t test. SIMFIT does a test for normality and variance equality on the two samples supplied, it reports lower, upper and two tail test statistics and p values simultaneously, it performs a corrected test for the case of unequal variances at the same time, it allows you to follow the t test by a paired t test if the sample sizes are equal and, after doing the t test, it saves the data for a Mann-Whitney U or Kolmogorov-Smirnov 2-sample test on request. An even more extreme example is the all possible pairwise comparisons option, which does all possible t , Mann-Whitney U and Kolmogorov-Smirnov 2-sample tests on a library le of column vectors. In fact there are two ways to view this type of multiple testing. If you are just doing data exploration to identify possible differences between samples, you can just regard the p values as a measure of the differences between pairs of samples, in that small p values indicate samples which seem to have different distributions. In this case you would attach no importance as to whether the p values are less than any supposed critical values. On the other hand, if you are trying to identify samples that differ signicantly, then some technique is required to structure the multiple testing procedure and/or alter the signicance level, as in the Tukey Q test. If the experimentwise error rate is e while the comparisonwise error rate is c and there are k comparisons then, from equating the probability of k tests with no Type 1 errors it follows that 1 e = (1 c )k . This is known as the Dunn-Sidak correction, but, alternatively, the Bonferroni correction is based on the recommendation that, for k tests, the error rate should be decreased from to /k, which gives a similar value to use for c in the multiple test, given e .

Data exploration

79

3.9.3 Data exploration


SIMFIT has a number of techniques that are appropriate for exploration of data and data mining. Such techniques do not always lead to meaningful hypothesis tests, but are best used for preliminary investigation of data sets prior to more specic model building.
3.9.3.1 Exhaustive analysis: arbitrary vector

This procedure is used when you have a single sample (column vector) and wish to explore the overall statistical properties of the data. For example, read in the vector test le normal.tf1 and you will see that all the usual summary statistics are calculated as in Table 3.18, including the range, hinges (i.e. quartiles), mean Data: 50 numbers from a normal distribution mu = 0 and sigma = 1 Sample size 50 Minimum, Maximum values -2.208E+00, 1.617E+00 Lower and Upper Hinges -8.550E-01, 7.860E-01 Coefficient of skewness -1.602E-02 Coefficient of kurtosis -8.551E-01 Median value -9.736E-02 Sample mean -2.579E-02 Sample standard deviation 1.006E+00: CV% = 3.899E+03% Standard error of the mean 1.422E-01 Upper 2.5% t-value 2.010E+00 Lower 95% con lim for mean -3.116E-01 Upper 95% con lim for mean 2.600E-01 Variance of the sample 1.011E+00 Lower 95% con lim for var. 7.055E-01 Upper 95% con lim for var. 1.570E+00 Shapiro-Wilks W statistic 9.627E-01 Significance level for W 0.1153 Tentatively accept normality Table 3.18: Exhaustive analysis of an arbitrary vector x , standard deviation s, and the normalized sample moments s3 (coefcient of skewness), and s4 (coefcient of kurtosis), dened in a sample of size n by x = xi / n
i=1 n

s=
n

i=1

)2 /(n 1) (xi x

s3 = (xi x )3 /[(n 1)s3] s4 = (xi x )4 /[(n 1)s4] 3.


i=1 i=1 n

You can then do a Shapiro-Wilks test for normality (which will, of course, not always be appropriate) or create a histogram, pie chart, cumulative distribution plot or appropriate curve-tting les. This option is a very valuable way to explore any single sample before considering other tests. If you created les vector.1st and vector.2nd as recommended earlier you can now examine these. Note that once a sample has been read into program simstat it is saved as the current sample for editing, transforming or re-testing Since vectors have only one coordinate, graphical display requires a further coordinate. In the case of histograms the extra coordinate is provided by the choice of bins, which dictates the shape, but in the case of cumulative distributions it is automatically created as steps and therefore of unique shape. Pie chart segments are calculated in

80

S IMFIT reference manual: Part 3

proportion to the sample values, which means that this is only appropriate for positive samples, e.g., counts. The other techniques illustrated in gure 3.11 require further explanation. If the sample values have been
Vector Plotted as a Time Series
2.00 2.00

Vector Plotted as Zero Centred Rods

1.00

1.00

Values

-1.00

Values
0 10 20 30 40 50

0.00

0.00

-1.00

-2.00

-2.00

-3.00

-3.00

10

20

30

40

50

Position

Position

Vector Plotted in Half Normal Format


3.00 2.50

Vector Plotted in Normal Format

Ordered Absolute Values

1.25 2.00

Ordered Values
1.00 2.00 3.00

0.00

1.00

-1.25

0.00 0.00

-2.50 -2.50

-1.25

0.00

1.25

2.50

Expected Half-Normal Order Statistic

Expected Normal Order Statistic

Figure 3.11: Plotting vectors measured in some sequence of time or space, then the y values could be the sample values while the x values would be successive integers, as in the time series plot. Sometimes it is useful to see the variation in the sample with respect to some xed reference value, as in the zero centered rods plot. The data can be centered automatically about zero by subtracting the sample mean if this is required. The half normal and normal plots are particularly useful when testing for a normal distribution with residuals, which should be approximately normally distributed if the correct model is tted. In the half normal plot, the absolute values of a sample of size n are rst ordered then plotted as yi , i = 1, . . . , n, while the half normal order statistics are approximated by 1 n+i+ 2 xi = 1 , i = 1, . . . , n 9 2n + 8 which is valuable for detecting outliers in regression. The normal scores plot simply uses the ordered sample as y and the normal order statistics are approximated by xi = 1 i 3 8 , i = 1, . . . , n

1 n+ 4

which makes it easy to visualize departures from normality. Best t lines, correlation coefcients, and significance values are also calculated for half normal and normal plots. Note that a more accurate calculation for expected values of normal order statistics is employed when the Shapiro-Wilks test for normality (page 90) is used and a normal scores plot is required.

Data exploration

81

3.9.3.2 Exhaustive analysis: arbitrary matrix

This procedure is provided for when you have recorded several variables (columns) with multiple cases (rows) and therefore have data in the form of a rectangular matrix, as with Table 3.19 resulting from analyzing matrix.tf2. The option is used when you want summary statistics for a numerical matrix with no missing Data: Matrix of order 7 by 5 Row Mean Variance 1 3.6800E+00 8.1970E+00 2 6.8040E+00 9.5905E+00 3 6.2460E+00 3.5253E+00 4 4.5460E+00 7.1105E+00 5 5.7840E+00 5.7305E+00 6 4.8220E+00 6.7613E+00 7 5.9400E+00 1.7436E+00

St.Dev. 2.8630E+00 3.0969E+00 1.8776E+00 2.6666E+00 2.3939E+00 2.6003E+00 1.3205E+00

Coeff.Var. 77.80% 45.52% 30.06% 58.66% 41.39% 53.92% 22.23%

Table 3.19: Exhaustive analysis of an arbitrary matrix values. It analyzes every row and column in the matrix then, on request, exhaustive analysis of any chosen row or column can be performed, as in exhaustive analysis of a vector. Often the rows or columns of a data matrix have pairwise meaning. For instance, two columns may be measurements from two populations where it is of interest if the populations have the same means. If the populations are normally distributed with the same variance, then an unpaired t test might be appropriate (page 91), otherwise the corresponding nonparametric test (Mann-Whitney U, page 96), or possibly a KolmogorovSmirnov 2-sample test (page 94) might be better. Again, two columns might be paired, as with measurements before and after treatment on the same subjects. Here, if normality of differences is reasonable, a paired t test (page 93) might be called for, otherwise the corresponding nonparametric procedure (Wilcoxon signed rank test, page 97), or possibly a run test (page 105), or a sign test (page 104) might be useful for testing for absence of treatment effect. Table 3.20 illustrates the option to do statistics on paired rows or columns, in this case columns 1 and 2 of matrix.tf2. You identify two rows or columns from the matrix then simple plots, linear regression, correlation, and chosen statistical tests can be done. Note that all the p values calculated for this procedure are for two-tail tests, while the run, Wilcoxon sign rank, and sign test ignore values which are identical in the two columns. More detailed tests can be done on the selected column vectors by the comprehensive statistical test options to be discussed subsequently (page 88). The comprehensive analysis of a matrix procedure also allows for the data matrix to be plotted as a 2dimensional bar chart, assuming that the rows are cases and the columns are numbers in distinct categories, or as a 3-dimensional bar chart assuming that all cell entries are as in a contingency table, or similar. Alternatively, plots displaying the columns as scattergrams, box and whisker plots, or bar charts with error bars can be constructed.
3.9.3.3 Exhaustive analysis: multivariate normal matrix

This provides options that are useful before proceeding to more specic techniques that depend on multivariate normality (page 287), e.g., MANOVA and some types of ANOVA. A graphical technique is provided for investigating if a data matrix with n rows and m columns, where n >> m > 1, is consistent with a multivariate normal distribution. For example, gure 3.12 shows plots for two random samples from a multivariate normal distribution. The plot uses the fact that, for a multivariate normal distribution with sample mean x and sample covariance matrix S, (x x )T S1 (x x ) m(n2 1) Fm,nm , n(n m)

82

S IMFIT reference manual: Part 3

Unpaired t test: t = -3.094E+00 p = 0.0093 *p =< 0.01 Paired t test: t = -3.978E+00 p = 0.0073 *p =< 0.01 Kolmogorov-Smirnov 2-sample test: d = 7.143E-01 z = 3.818E-01 p = 0.0082 *p =< 0.01 Mann-Whitney U test: u = 7.000E+00 z = -2.172E+00 p = 0.0262 *p =< 0.05 Wilcoxon signed rank test: w = 1.000E+00 z = -2.113E+00 p = 0.0313 *p =< 0.05 Run test: + = 1 (number of x > y) - = 6 (number of x < y) p = 0.2857 Sign test: N = 7 (non-tied pairs) - = 6 (number of x < y) p = 0.1250 Table 3.20: Statistics on paired columns of a matrix

Multivariate Plot: r = 0.754


Ranked Transforms
n = 8, m = 4

Multivariate Plot: r = 0.980


Ranked Transforms
3.00

1.00 0.80 0.60 0.40 0.20 0.00 0.00 2.00 4.00 6.00

n = 20, m = 4
2.00

1.00

0.00 0.00 1.00 2.00 3.00 4.00

F-quantiles

F-quantiles

Figure 3.12: Plot to diagnose multivariate normality

where x is a further independent observation from this population, so that the transforms plotted against the quantiles of an F distribution with m and n m degrees of freedom, i.e. according to the cumulative probabilities for (i 0.5)/n for i = 1, 2, . . . , n should be a straight line. It can be seen from gure 3.12 that this plot is of little value for small values of n, say n 2m but becomes progressively more useful as the sample size increases, say n > 5m.

Again, there are procedures to calculate the column means x j , and m by m sample covariance matrix S,

Data exploration

83

dened for a n by m data matrix xi j with n 2, m 2 as x j = s jk = 1 n xi j n i =1 1 n (xi j x j )(xik x k ) n 1 i =1

and then exploit several techniques which use these estimates for the population mean vector and covariance matrix. The eigenvalues and determinants of the sample covariance matrix and its inverse are required for several MANOVA techniques, so these can also be estimated. It is possible to perform two variants of the Hotelling T 2 test, namely testing for equality of the mean vector with a specied reference vector of means, or testing for equality of all means without specifying a reference mean. Dealing rst with testing that a vector of sample means is consistent with a reference vector, table 3.21 resulted when the test le hotel.tf1 was analyzed using the Hotelling one sample test procedure. This tests Hotelling one sample T-square test H0: Delta = (Mean - Expected) are all zero No. rows = 10, No. columns = 4 Hotelling T-square = 7.439E+00 F Statistic (FTS) = 1.240E+00 Deg. Free. (d1,d2) = 4, 6 P(F(d1,d2) >= FTS) = 0.3869 Column Mean Std.Err. Expected Delta t 1 -5.300E-01 4.63E-01 0.00E+00 -5.30E-01 -1.15E+00 2 -3.000E-02 3.86E-01 0.00E+00 -3.00E-02 -7.78E-02 3 -5.900E-01 4.91E-01 0.00E+00 -5.90E-01 -1.20E+00 4 3.100E+00 1.95E+00 0.00E+00 3.10E+00 1.59E+00 Table 3.21: Hotelling T 2 test for H0 : means = reference the null hypothesis H0 : = 0 against the alternative H1 : = 0 , where 0 is a known mean vector and no assumptions are made about the covariance matrix . Hotellings T 2 is T 2 = n(x 0 )T S1 (x 0 ) and, if H0 is true, then an F test can be used since (n m)T 2 /(m(n 1)) is distributed asymptotically as Fm,nm . Users can input any reference mean vector 0 to test for equality of means but, when the data columns are all differences between two observations for the same subjects and the aim is to test for no signicant differences, so that 0 is the zero vector, as with hotel.tf1, the test is a sort of higher dimensional analogue of the paired t test. Table 3.21 also shows the results when t tests are applied to the individual columns of differences between the sample means x and the reference means 0 , which is suspect because of multiple testing but, in this case, the conclusion is the same as the Hotelling T 2 test: none of the column means are signicantly different from zero. Now, turning to a test that all means are equal, table 3.22 shows the results when the data in anova6.tf1 are analyzed, and the theoretical background to this test will be presented subsequently (page 121). Options are provided for investigating the structure of the covariance matrix. The sample covariance matrix and its inverse can be displayed along with eigenvalues and determinants, and there are also options to check if the covariance matrix has a special form, namely

p 0.2815 0.9397 0.2601 0.1457

84

S IMFIT reference manual: Part 3

Hotelling one sample T-square test H0: Column means are all equal No. rows = 5, No. Hotelling T-square = F Statistic (FTS) = Deg. Free. (d1,d2) = P(F(d1,d2) >= FTS) = columns = 4 1.705E+02 2.841E+01 3, 2 0.0342 Reject H0 at 5% sig.level

Table 3.22: Hotelling T 2 test for H0 : means are equal testing for compound symmetry, testing for spherical symmetry, and testing for spherical symmetry of the covariance matrix of orthonormal contrasts. For instance, using the test le hotel.tf1 produces the results of table 3.23 showing an application of a test for compound symmetry and a test for sphericity. Compound symmetry is when a covariance matrix has a special form with constant nonnegative diagonals and equal nonnegative off-diagonal elements as follows. 1 ... 1 ... = 2 ... ... ... ... ... 1 This can be tested using estimates for the diagonal and off-diagonal elements 2 and 2 as follows s2 = s2 r = 1 m sii m i =1
m i1 2 si j . m(m 1) i=2 j =1

The Wilks generalized likelihood-ratio statistic is L= (s2 s2 r)m1 [s2 + (m 1)s2r] |S| ,

where the numerator is the determinant of the covariance matrix estimated with degrees of freedom, while the denominator is the determinant of the matrix with average variance on the diagonals and average covariance as off-diagonal elements, and this is used to construct the test statistic 2 = m(m + 1)2(2m 3) log L 6(m 1)(m2 + m 4)

which, for large , has an approximate chi-squared distribution with m(m + 1)/2 2 degrees of freedom. The sphericity test, designed to test the null hypothesis H0 : = kI against H1 : = kI . In other words, the population covariance matrix is a simple multiple of the identity matrix, which is a central requirement for some analytical procedures. If the sample covariance matrix S has eigenvalues i for i = 1, 2, . . . , m then, dening the arithmetic mean A and geometric mean G of these eigenvalues as A = (1/m) i
i=1 m

G = ( i )1/m ,
i=1

Data exploration

85

Variance-Covariance matrix 2.1401E+00 -1.1878E-01 -8.9411E-01 3.5922E+00 -1.1878E-01 1.4868E+00 7.9144E-01 1.8811E+00 -8.9411E-01 7.9144E-01 2.4099E+00 -4.6011E+00 3.5922E+00 1.8811E+00 -4.6011E+00 3.7878E+01 Pearson product-moment correlations 1.0000 -0.0666 -0.3937 0.3990 -0.0666 1.0000 0.4181 0.2507 -0.3937 0.4181 1.0000 -0.4816 0.3990 0.2507 -0.4816 1.0000 Compound symmetry test H0: Covariance matrix has compound symmetry No. of groups No. of variables (k) Sample size (n) Determinant of CV Determinant of S_0 LRTS (-2*log(lambda)) Degrees of Freedom P(chi-square >= LRTS) = = = = = = = = 1 4 10 9.814E+01 1.452E+04 3.630E+01 8 0.0000 Reject H0 at 1% sig.level

Likelihood ratio sphericity test H0: Covariance matrix = k*Identity (for some k > 0) No. small eigenvalues No. of variables (k) Sample size (n) Determinant of CV Trace of CV Mauchly W statistic LRTS (-2*log(lambda)) Degrees of Freedom P(chi-square >= LRTS) = = = = = = = = = 0 (i.e. < 1.00E-07) 4 10 9.814E+01 4.391E+01 6.756E-03 4.997E+01 9 0.0000 Reject H0 at 1% sig.level

Table 3.23: Covariance matrix symmetry and sphericity tests the likelihood ratio test statistic 2

is distributed asymptotically as with (m 1)(m + 2)/2 degrees of freedom. Using the fact that the determinant of a covariance matrix is the product of the eigenvalues while the trace is the sum, the Mauchly test statistic W can also be calculated from A and G since W= |S| {Tr(S)/m}m m i=1 i = m {(i=1 i )/m}m

2 log = nm log(A/G)

so that 2 log = n log W .

Clearly, the test rejects the assumption that the covariance matrix is a multiple of the identity matrix in this

86

S IMFIT reference manual: Part 3

case, a conclusion which is obvious from inspecting the sample covariance and correlation matrices. Since the calculation of small eigenvalues is very inaccurate when the condition number of the covariance matrix is appreciable, any eigenvalues less than the minimal threshold indicated are treated as equal to that threshold when calculating the test statistic.

3.9.3.4 t tests on groups across rows of a matrix

Sometimes a matrix with n rows and m columns holds data where groups are dened across rows by membership according to columns, and it is wished to do tests based on groups down through all rows. For instance, test le ttest.tf6 has 5 rows, but in each row the rst 6 columns constitute one group (say X ), while the next 7 columns constitute a second group (say Y ). At the end of the le there is an extra text section, as follows begin{limits} 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 end{limits} where a 1 denotes membership of the X group, and -1 indicates membership of the Y group. A 0 can also be used to indicate groups other than X or Y , i.e. to effectively suppress columns. Table 3.24 shows the results from analyzing ttest.tf6. which calculates the sample means, and standard deviations required for a t test. X_bar X_std Y_bar Y_std SE_diff t p 8.7500E+00 5.8224E-01 9.7429E+00 8.1824E-01 4.00913E-01 -2.4765E+00 0.0308 8.2167E+00 1.0420E+00 9.1143E+00 1.3006E+00 6.62051E-01 -1.3558E+00 0.2023 1.7933E+01 1.8886E+00 9.4714E+00 9.8778E-01 8.16416E-01 1.0365E+01 0.0000 -1.0000E+00 -1.0000E+00 -1.0000E+00 -1.0000E+00 -1.00000E+00 -1.0000E+00 -1.0000 8.8333E+00 8.2381E-01 1.8700E+01 1.6258E+00 7.36044E-01 -1.3405E+01 0.0000 Table 3.24: t tests on groups across rows of a matrix The two-tail p values are displayed, and a -1 is used to signify that a row contains groups with zero variance, so that the test cannot be performed. The minimum sample size for each group is 2 although much larger sample size should be used if possible.

3.9.3.5 Nonparametric tests across rows of a matrix

If the assumptions of normality and constant variance required by the previous t test are not justied, the same technique can be applied using nonparametric tests. Table 3.25 shows the results from analyzing ttest.tf6 in this way using the Mann-Whitney U test. This requires larger samples than the previous t test and, if the MW_U 7.0000E+00 1.1000E+01 4.2000E+01 -1.0000E+00 0.0000E+00 MW_Z MW_2-tail_p -1.9339E+00 0.9814 -1.3609E+00 0.9277 2.9286E+00 0.0006 -1.0000E+00 -1.0000 -2.9326E+00 1.0000

Table 3.25: Nonparametric tests across rows group size is fairly large, a Kolmogorov-Smirnov 2-sample test can also be done.

Data exploration

87

3.9.3.6 All possible pairwise tests (n vectors or a library le)

This option is used when you have several samples (column vectors) and wish to explore which samples differ signicantly. The procedure takes in a library le referencing sets of vector les and then performs any combination of two-tailed t , Kolmogorov-Smirnov 2-sample, and/or Mann-Whitney U tests on all possible pairs. It is usual to select either just t tests for data that you know to be normally distributed, or just MannWhitney U tests otherwise. Because the number of tests is large, e.g., 3n(n 1)/2 for all tests with n samples, be careful not to use it with too many samples. For example, try it by reading in the library les anova1.tfl (or the smaller data set npcorr.tfl with three vectors of length 9 where the results are shown in table 3.26) and observing that signicant differences are Mann-Whitney-U/Kolmogorov-Smirnov-D/unpaired-t tests No. tests = 9, p(1%) = 0.001111, p(5%) column2.tf1 (data set 1) column2.tf2 (data set 2) N1 = 9, N2 = 9, MWU = 8.000E+00, KSD = 7.778E-01, T = -3.716E+00, column2.tf1 (data set 1) column2.tf3 (data set 3) N1 = 9, N2 = 9, MWU = 2.100E+01, KSD = 5.556E-01, T = -2.042E+00, column2.tf2 (data set 2) column2.tf3 (data set 3) N1 = 9, N2 = 9, MWU = 5.550E+01, KSD = 4.444E-01, T = 1.461E+00, = 0.005556 [Bonferroni]

p = 0.00226 * p = 0.00109 ** p = 0.00188 *

p = 0.08889 p = 0.05545 p = 0.05796

p = 0.19589 p = 0.20511 p = 0.16350

Table 3.26: All possible comparisons highlighted. This technique can be very useful in preliminary data analysis, for instance to identify potentially rogue columns in analysis of variance, i.e., pairs of columns associated with small p values. However, it is up to you to appreciate when the results are meaningful and to make the necessary adjustment to critical signicance levels where the Bonferroni principle is required (due to multiple tests on the same data).

88

S IMFIT reference manual: Part 3

3.9.4 Statistical tests


3.9.4.1 1-sample t test

This procedure is used when you have a sample that is known to be normally distributed and wish to test H0 : the mean is 0 , where 0 is a known quantity. Table 3.27 shows the results for such a 1-sample t test on No. of x-values = 50 No. of degrees of freedom = 49 Theoretical mean (mu_0) = 0.000E+00 Sample mean (x_bar) = -2.579E-02 Std. err. of mean (SE) = 1.422E-01 TS = (x_bar - mu_0)/SE = -1.814E-01 P(t >= TS) (upper tail p) = 0.5716 P(t =< TS) (lower tail p) = 0.4284 p for two tailed t test = 0.8568 Diffn. D = x_bar - x_mu = -2.579E-02 Lower 95% con. lim. for D = -3.116E-01 Upper 95% con. lim. for D = 2.600E-01 Conclusion: Consider accepting equality of means Table 3.27: One sample t test the data in test le normal.tf1. The procedure can rst do a Shapiro-Wilks test for normality (page 90) if requested and then, for n values of xi , it calculates the sample mean x , sample variance s2 , standard error of the mean sx and test statistic T S according to 1 n xi n i =1 1 n (xi x )2 n 1 i =1 s2 /n x 0 sx

x = s2 = sx = TS =

where 0 is the supposed theoretical, user-supplied population mean. The signicance levels for upper, lower, and two-tailed tests are calculated for a t distribution with n 1 degrees of freedom. You can then change 0 or select a new data set.

3.9.4.2 1-sample Kolmogorov-Smirnov test

This nonparametric procedure is used when you have a single sample (column vector) of reasonable size (say greater than 20) and wish to explore if it is consistent with some known distribution, e.g., normal, binomial, Poisson, gamma, etc. The test only works optimally with large samples where the null distribution is a continuous distribution that can be specied exactly and not dened using parameters estimated from the sample. It calculates the maximum positive difference D+ n , negative difference Dn , and overall difference between the sample cumulative distribution function S(x ) and the theoretical Dn = maximum of D+ and D i n n

Statistical tests

89

cdf F (xi ) under H0 , i.e., if frequencies f (xi ) are observed for the variable xi in a sample of size n, then S(xi ) =

j =1

f (x j )/n

F (xi ) = P(x xi )

D+ n = max(S(xi ) F (xi ), 0), i = 1, 2, . . . , n


Dn = max(D+ n , Dn ).

D n = max(F (xi ) S(xi1), 0), i = 2, 3, . . . , n

The standardized statistics Z = D n are calculated for the D values appropriate for upper-tail, lower-tail, or two-tailed tests, then the exact signicance levels are calculated by SIMFIT for small samples by solving P(Dn a/n) using the difference equation
[2a]

j =0

(1) j ((2a j) j / j!)qr j (a) = 0, r = 2[a] + 1, 2[a] + 2, . . .

with initial conditions qr (a) = 1, r = 0 = rr /r!, r = 1, . . . , [a] = r r /r ! 2a


[ra] j =0

((a + j) j1/ j!)(r a j)r j /(r j)!, r = [a + 1], . . ., 2[a],

where [a] is the largest integer a regardless of the sign of a, while the series z lim P Dn n n = 1 2 (1)i1 exp(2i2 z2 )
i=1

is used for large samples. For example, input the le normal.tf1 and test to see if these numbers do come from a normal distribution. See if your own les vector.1st and vector.2nd come from a uniform or a beta distribution. Note that there are two ways to perform this test; you can state the parameters, or they can be estimated by the program from the sample, using the method of moments, or else maximum likelihood. However, calculating parameters from samples compromises this test leading to a signicant reduction in power. If you want to see if a sample comes from a binomial, Poisson, uniform, beta, gamma, lognormal normal, or Weibull distribution, etc., the data supplied must be of a type that is consistent with the supposed distribution, otherwise you will get error messages. Before you do any parametric test with a sample, you can always use this option to see if the sample is in fact consistent with the supposed distribution. An extremely valuable option provided is to view the best-t cdf superimposed upon the sample cumulative distribution, which is a very convincing way to assess goodness of t. Superposition of the best t pdf on the sample histogram can also be requested, which is useful for discrete distributions but less useful for continuous distributions, since it requires large samples (say greater than 50) and the histogram shape depends on the number of bins selected. Table 3.28 illustrates the results when the test le normal.tf1 is analyzed to see if the data are consistent with a normal distribution using the Kolmogorov-Smirnov test with parameters estimated from the sample, and the Shapiro-Wilks test to be described shortly (page 90). Note that typical plots of the best t normal distribution with the sample cumulative distribution, and best-t density function overlayed on the sample histogram obtained using this procedure can be seen on page 22, while normal scores plots were discussed and illustrated on page 80. Note that signicance levels are given in the table for uppertail, lower-tail, and two-tail tests. In general you should only use the two-tail probability levels, reserving the one-tail tests for situations where the only possibility is either that the sample mean may be shifted to the right of the null distribution requiring an upper-tail test, or to the left requiring a lower-tail test

90

S IMFIT reference manual: Part 3

Data: 50 numbers from a normal distribution mu = 0 and sigma = 1 Parameters estimated from sample are: mu = -2.579E-02, se = 1.422E-01, 95%cl = ( -3.116E-01, 2.600E-01) sigma = 1.006E+00, sigma2 = 1.011E+00, 95%cl = (7.055E-01,1.570E+00) Sample size = 50, i.e. no. of x-values H0: F(x) equals G(y) (x & theory are comparable) against H1: F(x) not equal to G(y) (x & theory not comparable) D = 9.206E-02 z = 6.510E-01 p = 0.7559 H2: F(x) > G(y) (x tend to be smaller than theoretical) D = 9.206E-02 z = 6.510E-01 p = 0.3780 H3: F(x) < G(y) (x tend to be larger than theoretical) D = 6.220E-02 z = 4.398E-01 p = 0.4919 Shapiro-Wilks normality test: W statistic = 9.627E-01 Sign. level = 0.1153 Tentatively accept normality Table 3.28: Kolomogorov-Smirnov 1-sample and Shapiro-Wilks tests

3.9.4.3 1-sample Shapiro-Wilks test for normality

This procedure is used when you have data and wish to test H0 : the sample is normally distributed. It is a very useful general test which may perform better than the Kolmogorov-Smirnov 1-sample test just described, but the power is low, so reasonably large sample sizes (say > 20) are required. The test statistic W , where 0 W 1, is constructed by considering the regression of ordered sample values on the corresponding expected normal order statistics, so a normal scores plot should always be examined when testing for a normal distribution, and it should be approximately linear if a sample is from a normal distribution. For a sample of size n, this plot and the theory for the Shapiro-Wilks test, require the normal scores, i.e., the expected values of the rth largest order statistics given by
n! x[1 (x)]r1 [(x)]nr (x) dx, (r 1)!(n r)! 1 1 where (x) = exp x2 , 2 2

E (r, n) =

and (x) =

(u) du.

Then the test statistic W uses the vector of expected values of a standard normal sample x1 , x2 , . . . , xn and the corresponding covariance matrix, that is mi = E (xi ) (i = 1, 2, . . . , n), and vi j = cov(xi , x j ) (i, j = 1, 2, . . . , n),

Statistical tests

91

so that, for an ordered random sample y1 , y2 , . . . , yn ,

W=

i=1 T

)2 (yi y

i=1 n

a i yi
,

where a = mT V 1 [(mT V 1 )(V 1 m)] 2 . Finally, the signicance level for the statistic W calculated from a sample is obtained by transformation to an approximately standard normal deviate using z= (1 W ) ,

where is estimated from the sample and , and are the sample mean and standard deviation. Values of W close to 1 support normality, while values close to 0 suggest deviation from normality.
3.9.4.4 1-sample Dispersion and Fisher exact Poisson tests

This procedure is used when you have data in the form of non-negative integers (e.g. counts) and wish to test H0 : the sample is from a Poisson distribution. Given a sample of n observations xi with sample mean x = n i=1 xi /n from a Poisson distribution (page 285), the dispersion D given by D = (xi x )2 /x
i=1 n

is approximately chi-square distributed with n 1 degrees of freedom. A test for consistency with a Poisson distribution can be based on this D statistic but, with small samples, the more accurate Fisher exact test can be performed. This estimates the probability of the sample observed based on all partitions consistent with the sample mean, size and total. After performing these tests on a sample of nonnegative integers, this option then plots a histogram of the observed and expected frequencies (page 256). Table 3.29 shows the results from analyzing data in the test le poisson.tf1 and also the results from using the previously discussed Kolmogorov 1-sample test with the same data. Clearly the data are consistent with a Poisson distribution. The mean and expectation of a Poisson distribution are identical and three cases can arise. 1. The sample variance exceeds the upper condence limit for the sample mean indicating overdispersion, i.e. too much clustering/clumping. 2. The sample variance is within the condence limits for the sample mean indicating consistency with a Poisson distribution. 3. The sample variance is less than the lower condence limit for the sample mean indicating underdispersion, i.e. too much uniformity. Output from the Kolmogorov-Smirnov 1-sample test for a Poisson distribution indicates if the variance is suspiciously small or large.
3.9.4.5 2-sample unpaired t and variance ratio tests

This procedure is used when you have two samples x = (x1 , x2 , . . . , xm ) and y = (y1 , y2 , . . . , yn ) (i.e., two column vectors of measurements, not counts) which are assumed to come from two normal distributions with the same variance, and you wish to test H0 : the means of the two samples are equal. It is equivalent to 1-way

92

S IMFIT reference manual: Part 3

Dispersion and Fisher-exact Poisson tests Sample size = 40 Sample total = 44 Sample ssq = 80 Sample mean = 1.100E+00 Lower 95% con.lim. = 7.993E-01 Upper 95% con.lim. = 1.477E+00 Sample variance = 8.103E-01 Dispersion (D) = 2.873E+01 P(Chi-sq >= D) = 0.88632 No. deg. freedom = 39 Fisher exact Prob. = 0.91999 Kolmogorov-Smirnov one sample test H0: F(x) equals G(y) (x & theory are comparable) against H1: F(x) not equal to G(y) (x & theory not comparable) D = 1.079E-01 z = 6.822E-01 p = 0.7003 H2: F(x) > G(y) (x tend to be smaller than theoretical) D = 7.597E-02 z = 4.805E-01 p = 0.4808 H3: F(x) < G(y) (x tend to be larger than theoretical) D = 1.079E-01 z = 6.822E-01 p = 0.3501 Table 3.29: Poisson distribution tests analysis of variance (page 112) with just two columns. The test statistic U that is calculated is U= s2 p
m

x y 1 1 + m n

where x = x i / m, y = yi / n , s2 )2 /(m 1), x = (xi x s2 )2 /(n 1), y = (yi y


i=1 i=1 n i=1 m i=1 n

and s2 p=

2 (m 1)s2 x + (n 1)sy , m+n2

so that s2 p is the pooled variance estimate and U has a t distribution with m + n 2 degrees of freedom under H0 : the means are identical. The two sample means and sample variances are calculated, then the test statistic is calculated and the signicance levels for a lower, upper and two-tail test are calculated. For example, read in the test le pairs ttest.tf2 and ttest.tf3 and then, after analyzing them, read in the test le pairs ttest.tf4 and ttest.tf5. Note that before doing a t or paired t test, the program checks,

Statistical tests

93

using a Shapiro-Wilks test, to see if the samples are consistent with normal distributions, and it also does a variance ratio test to see if the samples have common variances. However, note that the Shapiro-Wilks test, which examines the correlation between the sample cumulative distribution and the expected order statistics for normal distributions, and also the F test for common variances, which calculates F given by F = max
2 s2 x sy , 2 s2 y sx

and compares this to critical values for the appropriate F distribution, are very weak with small samples, say less than 25. So you should not do these tests with small samples but, if you have large samples which do not pass these tests, you should ask yourself if doing a t test makes sense (since a t test depends upon the assumption that both samples are normal and with the same variance). Note that the Satterthwaite procedure, using a tc statistic with degrees of freedom calculated with the Welch correction for unequal variances is performed at the same time, using tc = se(x y ) = = x y se(x y ) s2 s2 y x + m n se(x y )4 2 2 2 (s2 x /m) /(m 1) + (sy /n) /(n 1)

and the results are displayed within square brackets adjacent to the uncorrected results. However, this should only be trusted if the data sets seem approximately normally distributed with fairly similar variances. Note that, every time SIMFIT estimates parameters by regression, it estimates the parameter standard error and does a t test for parameter redundancy. However, at any time subsequently, you can choose the option to compare two parameters and estimated standard errors from the curve tting menus, which does the above test corrected for unequal variances. Table 3.30 shows the results from analyzing data in ttest.tf4 and ttest.tf5 which are not paired. Clearly the correction for unequal variance is unimportant in this case and the unpaired t test supports equality of the means. Note that, if data have been input from les, simstat saves the last set of les for re-analysis, for instance to do the next test.
3.9.4.6 2-sample paired t test

This procedure is used when you have paired measurements, e.g., two successive measurements on the same subjects before and after treatments, and wish to test H0 : the mean of the differences between paired measurements is zero. Just as the unpaired t test is equivalent to analysis of variance with just two columns, the paired t test is equivalent to repeated measurements analysis of variance. For convenience, data for all paired tests also can be input as a n by 2 matrix rather than two vectors of length n. The paired t test is based on the assumption that the differences between corresponding pairs xi and yi are normally distributed, not necessarily the original data although this would normally have to be the case, and it requires the calculation of d, s2 d , and td given by d i = xi yi d = di /n 2 s2 d = (di d ) /(n 1)
i=1 i=1 n n

td =

d s2 d /n

94

S IMFIT reference manual: Part 3

Normal distribution test 1, Data: X-data for t test Shapiro-Wilks statistic W = 9.924E-01 Significance level for W = 1.0000 Tentatively accept normality Normal distribution test 2, Data: Y-data for t test Shapiro-Wilks statistic W = 9.980E-01 Significance level for W = 0.9999 Tentatively accept normality F test for equality of variances No. of x-values = 12 Mean x = 1.200E+02 Sample variance of x = 4.575E+02 Sample std. dev. of x = 2.139E+01 No. of y-values = 7 Mean y = 1.010E+02 Sample variance of y = 4.253E+02 Sample std. dev. of y = 2.062E+01 Variance ratio = 1.076E+00 Deg. of freedom (num) = 11 Deg. of freedom (denom) = 6 P(F >= Variance ratio) = 0.4894 Conclusion: Consider accepting equality of variances Unpaired t test ([ ] = corrected for unequal variances) No. of x-values = 12 No. of y-values = 7 No. of degrees of freedom = 17 [ 13] Unpaired t test statistic U = 1.891E+00 [ 1.911E+00] P(t >= U) (upper tail p) = 0.0379 [ 0.0391] P(t =< U) (lower tail p) = 0.9621 [ 0.9609] p for two tailed t test = 0.0757 [ 0.0782] Difference between means DM = 1.900E+01 Lower 95% con. limit for DM = -2.194E+00 [ -1.980E+00] Upper 95% con. limit for DM = 4.019E+01 [ 3.998E+01] Conclusion: Consider accepting equality of means Table 3.30: Unpaired t test The test statistic td is again assumed to follow a t distribution with n 1 degrees of freedom. For more details of the t distribution see page 288. Table 3.31 shows the results from a paired t test with paired data from test les ttest.tf2 and ttest.tf3, where the the test supports equality of means.
3.9.4.7 2-sample Kolmogorov-Smirnov test

This nonparametric procedure is used when you have two samples (column vectors) X of length m and Y of length n and you wish to test H0 : the samples are from the same, unspecied, distribution. It is a poor test unless both samples are fairly large (say > 20) and both come from a continuous and not a discrete distribution. The Dm,n +, Dm,n and Dm,n values are obtained from the differences between the two sample cumulative distribution functions, then the test statistic z= mn Dm,n m+n

Statistical tests

95

Paired t test No. of degrees of freedom = 9 Paired t test statistic S = -9.040E-01 P(t >= S) = 0.8052 P(t =< S) = 0.1948 p for two tailed t test = 0.3895 Mean of differences MD = -1.300E+00 Lower 95% con. limit for MD = -4.553E+00 Upper 95% con. limit for MD = 1.953E+00 Conclusion: Consider accepting equality of means Table 3.31: Paired t test is calculated. For small samples SIMFIT calculates signicance levels using the formula P(Dm,n d ) = A(m, n) m+n n

where A(m, n) is the number of paths joining integer nodes from (0, 0) to (m, n) which lie entirely within the boundary lines dened by d in a plot with axes 0 X m and 0 Y n, and where A(u, v) at any intersection satises the recursion A(u, v) = A(u 1, v) + A(u, v 1) with boundary conditions A(0, v) = A(u, 0) = 1. However, for large samples, the asymptotic formula
m,n mn Dm,n z = 1 2 (1)i1 exp(2i2 z2 ) m+n i=1

lim P

is employed. For example, use the test les ttest.tf4 and ttest.tf5 to obtain the results shown in table 3.32. The test again supports equality of means. You could also try your own les vector.1st and Size of X-data = 12 Size of Y-data = 7 H0: F(x) is equal to G(y) H1: F(x) not equal to G(y) D = 4.405E-01 z = 2.095E-01 p = 0.2653 H2: F(x) > G(y) (x tend to D = 0.000E+00 z = 0.000E+00 p = 0.5000 H3: F(x) < G(y) (x tend to D = 4.405E-01 z = 2.095E-01 p = 0.1327

(x and y are comparable) against (x and y not comparable)

be smaller than y)

be

larger than y)

Table 3.32: Kolmogorov-Smirnov 2-sample test vector.2nd (prepared previously) to illustrate a very important set of principles. For instance, it is obvious to you what the values in the two samples suggest about the possibility of a common distribution. What do the upper, lower and two tail tests indicate ? Do you agree ? What happens if you put your vector les in the other way round ? Once you have understood what happens to these data sets you will be a long way towards being able to analyze your own pairs of data sets. Note that, if data have been input from les, simstat saves the last set of les for re-analysis, for instance to do the next test.

96

S IMFIT reference manual: Part 3

3.9.4.8 2-sample Wilcoxon-Mann-Whitney U test

The Mann-Whitney U nonparametric procedure (which is equivalent to the Wilcoxon rank-sum test) is used when you have two samples (column vectors) and wish to test H0 : the samples have the same medians, against HA : the distributions are not equivalent, e.g., one sample dominates the other in distribution. Although the test only works optimally for continuous data, it can be useful for scored data, where the order is meaningful but not the numerical magnitude of differences. The two samples, x of size m and y of size n, are combined, then the sums of the ranks of the two samples in the combined sample are used to calculate exact signicance levels for small samples, or asymptotic values for large samples. The test statistic U is calculated from the ranks rxi in the pooled sample, using average ranks for ties, as follows Rx = rxi
i=1 m

U = Rx

m(m + 1) . 2

The statistic U is also the number of times a score in sample y precedes a score in sample x, counting a half for tied scores, so large values suggest that x values tend to be larger than y values. For example, do exactly as for the t test using ttest.tf4 and ttest.tf5 and compare the the results as displayed in table 3.33 with table 3.32 and table 3.30. The null hypothesis H0 : F (x) = G(y) is that two Size of X-data = 12 Size of Y-data = 7 U = 6.250E+01 z = 1.691E+00 H0: F(x) is equal to G(y) (x and y are comparable) as null hypothesis against the alternatives:H1: F(x) not equal to G(y) (x and y not comparable) p = 0.0873 H2: F(x) > G(y) (x tend to be smaller than y) p = 0.9605 H3: F(x) < G(y) (x tend to be larger than y) p = 0.0436 Reject H0 at 5% s-level Table 3.33: Wilcoxon-Mann-Whitney U test samples are identically distributed, and the appropriate rejection regions are Ux u for H1 : F (x) G(y) Uy u for H1 : F (x) G(y)

Ux u/2 or Uy u/2 for H1 : F (x) = G(y)

where the critical points u can be calculated from the distribution of U . Dening rm,n (u) as the number of distinguishable arrangements of the m X and n Y variables such that in each sequence Y precedes X exactly u times, the recursions rm,n (u) = rm,n1 (u) + rm1,n (u n) P(U = u) = rm,n = pm,n (u) = n m+n pm,n1 (u) + m m+n pm1,n (u n) m+n m

Statistical tests

97

are used by SIMFITto calculate exact tail probabilities for n, m 40 or m + n 50, but for larger samples a normal approximation is used. The parameter z in table 3.33 is the approximate normal test statistic given by z= U mn/2 0.5

V (U ) mn(m + n + 1) mnT where V (U ) = 12 (m + n)(m + n 1) t j (t j 1)(t j + 1) and T = 12 j =1 with groups of ties containing t j ties per group. The equivalence of this test using test statistic U = Ux and the Wilcoxon rank-sum test using test statistic R = Rx will be clear from the identities Ux = Rx m(m + 1)/2 Uy = Ry n(n + 1)/2

Ux + Uy = mn

Rx + Ry = (m + n)(m + n + 1)/2. Many people recommend the consistent use of this test instead of the t or Kolmogorov-Smirnov tests, so you should try to nd out why we need two nonparametric tests. For instance; do they both give the same results?; should you always use both tests?; are there circumstances when the two tests would give different results?; is rejection of H0 in the one-tail test of table 3.33 to be taken seriously with such small sample sizes, and so on. Note that the Kruskal-Wallis test (page 113) is the extension of the Mann-Whitney U test to more than two independent samples.
3.9.4.9 2-sample Wilcoxon signed-ranks test

This procedure is used when you have two paired samples, e.g., two successive observations on the same subjects, and wish to test H0 : the median of the differences is zero. Just as the Mann-Whitney U test is the nonparametric equivalent of the unpaired t test, the Wilcoxon paired-sample signed-ranks test is the nonparametric equivalent of the paired t test. If the data are counts, scores, proportions, percentages, or any other type of non-normal data, then these tests should be used instead of the t tests. Table 3.34 shows the Size of data = 10 No. values suppressed = 0 W = 1.700E+01 z = -1.027E+00 H0: F(x) is equal to G(y) (x and y are comparable) as null hypothesis against the alternatives:H1: F(x) not equal to G(y) (x and y not comparable) p = 0.2480 H2: F(x) > G(y) (x tend to be smaller than y) p = 0.1240 H3: F(x) < G(y) (x tend to be larger than y) p = 0.8047 Table 3.34: Wilcoxon signed-ranks test results from analyzing the data in ttest.tf2 and ttest.tf3, which was previously done using the paired t test (page95). The test examines the pairwise differences between two samples of size n to see if there is any evidence to support a difference in location between the two populations, i.e. a nonzero median for the vector of differences between the two samples. It is usual to rst suppress any values with zero differences and to use a zero test median value. The vector of differences is replaced by a vector of absolute differences which

98

S IMFIT reference manual: Part 3

is then ranked, followed by restoring the signs and calculating the sum of the positive ranks T + , and the sum of negative ranks T , where clearly T + + T = n(n + 1)/2. The null hypothesis H0 : M = M0 is that the median difference M equals a chosen median M0 , which is usually input as zero, and the appropriate rejection regions are T + t for H1 : M < M0 T t for H1 : M > M0

T + t/2 or T t/2 for H1 : M = M0 where the critical points t can be calculated from the distribution of T , which is either T + or T such that P(T t ) = . If un (k) is the number of ways to assign plus and minus signs to the rst n integers, then un (k) 2n un1(k n) + un1(k) = 2n which is used by SIMFITto calculate exact tail probabilities for n 80. The normal approximation z in table 3.34 is dened as P(Tn+ = k) = |A| 0.5 V where A = [T [n(n + 1) m(m + 1)]/4] z=

and V = [n(n + 1)(2n + 1) m(m + 1)(2m + 1) R/2]/24.

Here m is the number of zero differences included in the analysis, if any, and R = ri2 (ri + 1) is the sum of tied ranks, excluding any due to zero differences and, for n > 80, tail areas are calculated using this normal approximation.
3.9.4.10 Chi-square test on observed and expected frequencies

This test is used when you have a sample of n observed frequencies Oi and a corresponding set of expected frequencies Ei and wish to test that the observed frequencies are consistent with the distribution generating the expected values, by calculating the statistic C given by C= (Oi Ei )2 . Ei i=1
n

Table 3.35 illustrates the results with test les chisqd.tf2 and chisqd.tf3. The test requires rather large samples, so that the expected values are all positive integers, if possible say 5, and the number of observed values, i.e. bins n, is sufcient for a reasonable number of degrees of freedom in the chi-square test. If the total number of observations is k, then the number of bins n used to partition the data is often recommended to be of the order n k0.4 but this, and the intervals used to partition the data, depend on the shape of the assumed distribution. Of course, the signicance level for the test will depend on the number of bins used, and the intervals selected to partition the sample of observations into bins.

If the expected frequencies are exact and the observed values are representative of the supposed distribution, then C is asymptotically distributed as chi-square with degrees of freedom. In the usual case, the expected frequencies are not exact but are calculated using m parameters estimated from the sample, so that the degrees of freedom are given by = n 1 m.

Figure 3.13 illustrates a bar chart for these data that can be inspected to compare the observed and expected values visually.

Statistical tests

99

No. of partitions (bins) No. of deg. of freedom Chi-square test stat. C P(chi-square >= C) Upper tail 5% crit. point Upper tail 1% crit. point

= = = = = =

6 5 1.531E+00 0.9095 1.107E+01 1.509E+01

Consider accepting H0

Table 3.35: Chi-square test on observed and expected frequencies

Observed and Expected Frequencies


40.0

30.0

Frequencies

20.0

10.0

0.0
1 2 3 4 5 6

Bins

Figure 3.13: Observed and Expected frequencies


3.9.4.11 Chi-square and Fisher-exact contingency table tests

These procedures, which are based on the hypergeometric distribution (page 284) and the chi-square distribution (page 289), are used when you have a n rows by m columns contingency table, that is, a table of non-negative integer frequencies fi j , where i = 1, 2, . . . , n, and j = 1, 2, . . . , m, and wish to test for homogeneity, i.e., independence or no association between the variables, using a chi-square test with (n 1)(m 1) degrees of freedom. The null hypothesis of no association assumes that the cell frequencies fi j are consistent with cell probabilities pi j dened in terms of the marginal probabilities pi and p j by H0 : pi j = pi p j . For example, try the test le chisqd.tf4 and observe that a Fisher Exact test is done routinely on small contingency tables, as in table 3.36. Note that probabilities are calculated for all possible tables with the same marginals as the sample, and these are tabulated with cell (1,1) in increasing order, but such that the row 1 marginal is not greater than the the row 2 marginal, while the column 1 marginal is not greater than the column 2 marginal. With the data in chisqd.tf4, the probability of the frequencies actually observed is above the critical level, so there is no evidence to support rejection of the null hypothesis of homogeneity. However, in cases where the probability of the observed frequencies are small, it is usual to add up the probabilities for tables with even more extreme frequencies in both directions, i.e. of increasing and decreasing frequencies from the observed conguration, in order to estimate the signicance level for this test. Table 3.37 shows the results for a chi-square test on the same data. Note that Yatess continuity correction is used with 2 by 2

100

S IMFIT reference manual: Part 3

Observed frequencies 3 (0.50000) 3 (0.50000) 7 (0.77778) 2 (0.22222) p( r) = p( 0) = p( 1) = p( 2) = p( 3) = p( 4) = p( 5) = P Sums, Psum1 = Psum2 = Psum3 = Psum4 = Psum5 = Psum6 = p(r in 1,1) (rearranged so R1 = smallest marginal and C2 >= C1) 0.04196 0.25175 0.41958 0.23976 p(*), observed frequencies 0.04496 0.00200 1-tail and 2-tail test statistics 0.04196 sum of p(r) =< p(*) for r < 3 0.95305 sum of all p(r) for r =< 3 0.28671 sum of all p(r) for r >= 3 0.04695 sum of p(r) =< p(*) for r > 3 1.00000 Psum2 + Psum4 0.32867 Psum1 + Psum3 Table 3.36: Fisher exact contingency table test No. of rows = No. of columns = Chi-sq. test stat. C = No. deg. of freedom = P(chi-sq. >= C) = Upper tail 5% point = Upper tail 1% point = L = -2*log(lambda) = P(chi-sq. >= L) = Yates correction used 2 2 3.125E-01 1 0.5762 3.841E+00 6.635E+00 1.243E+00 0.2649 in chi-square

Table 3.37: Chi-square and likelihood ratio contingency table tests: 2 by 2 tables, which replaces the expression 2 = N ( f11 f22 f12 f21 )2 r1 r2 c1 c2

for frequencies f11 , f12 , f21 , f22 , marginals r1 , r2 , c1 , c2 , and sum of frequencies N by 2 = N (| f11 f22 f12 f21 | N /2)2 , r1 r2 c1 c2

although the value of this correction is disputed, and the Fisher exact test or likelihood ratio test should be used when analyzing 2 by 2 tables, especially where there are small expected frequencies. Also, contraction will be used automatically for sparse contingency tables by adding together near-empty rows or columns, with concomitant reduction in degrees of freedom. Also, SIMFIT calculates the likelihood ratio test statistic L, i.e., L = 2 log given by L = 2 fi j log( fi j /ei j )
i=1 j =1 n m

where the expected frequencies ei j are dened in terms of the observed frequencies fi j , and the marginals fi. , f. j by ei j = fi. f. j /N ,

Statistical tests

101

but this will generally be very similar in value to the test statistic C. Analysis of the data in chisqd.tf5 shown in table 3.38 and table 3.39 illustrates another feature of conObserved chi-square frequencies 6 15 10 38 16 12 9 22 No. of rows = 2 No. of columns = 6 Chi-sq. test stat. C = 1.859E+01 No. deg. of freedom = 5 P(chi-sq. >= C) = 0.0023 Upper tail 5% point = 1.107E+01 Upper tail 1% point = 1.509E+01 L = -2*log(lambda) = 1.924E+01 P(chi-sq. >= L) = 0.0017

62 36

26 5

Reject H0 at 1% sig.level

Reject H0 at 1% sig.level

Table 3.38: Chi-square and likelihood ratio contingency table tests: 2 by 6

Deviance (D) = 1.924E+01, deg.free. = 5 P(chi-sq>=D) = 0.0017 Reject H0 at 1% sig.level Parameter Estimate Std.Err. ..95% con. lim.... Constant 2.856E+00 7.48E-02 2.66E+00 3.05E+00 Row 1 2.255E-01 6.40E-02 6.11E-02 3.90E-01 Row 2 -2.255E-01 6.40E-02 -3.90E-01 -6.11E-02 Col 1 -4.831E-01 1.89E-01 -9.69E-01 2.57E-03 Col 2 -2.783E-01 1.73E-01 -7.24E-01 1.68E-01 Col 3 -6.297E-01 2.01E-01 -1.15E+00 -1.12E-01 Col 4 5.202E-01 1.28E-01 1.90E-01 8.51E-01 Col 5 1.011E+00 1.10E-01 7.27E-01 1.29E+00 Col 6 -1.401E-01 1.64E-01 -5.62E-01 2.81E-01 Data Model Delta Residual Leverage 6 13.44 -7.44 -2.2808 0.6442 15 16.49 -1.49 -0.3737 0.6518 10 11.61 -1.61 -0.4833 0.6397 38 36.65 1.35 0.2210 0.7017 62 59.87 2.13 0.2740 0.7593 26 18.94 7.06 1.5350 0.6578 16 8.56 7.44 2.2661 0.4414 12 10.51 1.49 0.4507 0.4533 9 7.39 1.61 0.5713 0.4343 22 23.35 -1.35 -0.2814 0.5317 36 38.13 -2.13 -0.3486 0.6220 5 12.06 -7.06 -2.3061 0.4628 Table 3.39: Loglinear contingency table analysis

p 0.0000 0.0168 * 0.0168 * 0.0508 ** 0.1696 *** 0.0260 * 0.0098 0.0003 0.4320 ***

tingency table analysis. For 2 by 2 tables the Fisher exact, chi-square, and likelihood ratio tests are usually adequate but, for larger contingency tables with no very small cell frequencies it may be useful to t a log-linear model. To do this, SIMFIT denes dummy indicator variables for the rows and columns (page 56), then ts a generalized linear model (page 50) assuming a Poisson error distribution and log link, but imposing the constraints that the sum of row coefcients is zero and the sum of column coefcients is zero, to avoid

102

S IMFIT reference manual: Part 3

tting an overdetermined model (page 53). The advantage of this approach is that the deviance, predicted frequencies, deviance residuals, and leverages can be calculated for the model log(i j ) = + i + j , where i j are the expected cell frequencies expressed as functions of an overall mean , row coefcients i , and column coefcients j . The row and column coefcients reect the main effects of the categories, according to the above model, where
i=1

i =

j =1

j = 0

and the deviance, which is a likelihood ratio test statistic, can be used to test the justication for a mixed term i j in the saturated model log(i j ) = + i + j + i j , which ts exactly, i.e., with zero deviance. SIMFIT performs a chi-square test on the deviance to test the null hypotheses of homogeneity, which is the same as testing that all i j are zero, the effect of individual cells can be assessed from the leverages, and various deviance residuals plots can be done to estimate goodness of t of the assumed log-linear model. Yet another type of chi-square test situation arises when observed and expected frequencies are available, as in the analysis of chisqd.tf2 and chisqd.tf3 shown in table 3.40. Where there are K observed frequencies Sum of obs. = 1.000E+02 , No. of partitions (bins) No. of deg. of freedom Chi-square test stat. C P(chi-square >= C) Upper tail 5% crit. point Upper tail 1% crit. point Sum of exp. = 1.000E+02 = 6 = 4 = 1.531E+00 = 0.8212 Consider accepting H0 = 9.488E+00 = 1.328E+01

Table 3.40: Observed and expected frequencies Oi together with the corresponding expected frequencies Ei , a test statistic C can always be dened as C= (Oi Ei )2 , Ei i=1
K

which has an approximate chi-square distribution with K L degrees of freedom, where L is the number of parameters estimated from the data in order to dene the expected values. Program chisqd can accept arbitrary vectors of observed and expected frequencies in order to perform such a chi-square test as that shown in table 3.40, and this test is also available at several other appropriate places in the SIMFIT package as a nonparametric test for goodness of t, i.e. consistency between data and an assumed distribution. However, it should be noted that the chi-square test is an asymptotic test which should only be used when all expected frequencies exceed 1, and preferably 5.
3.9.4.12 McNemar test

This procedure is used with paired samples of dichotomous data in the form of a 2 2 table of nonnegative frequencies fi j which can be analyzed by calculating the 2 test statistic given by 2 = (| f12 f21 | 1)2 . f12 + f21

This has an approximate chi-square distribution with 1 degree of freedom. More generally, for larger r by r tables with identical paired row and column categorical variables, the continuity correction is not used, and

Statistical tests

103

the appropriate test statistic is 2 = ( fi j f ji )2 i=1 j >i f i j + f ji


r

with r(r 1)/2 degrees of freedom. Table 3.41 illustrates this test by showing that the analysis of data Data for McNemar test 173 20 7 15 51 2 5 3 24 H0: association between row and column data. Data: Data for McNemar test (details at end of file) No. of rows/columns = 3 Chi-sq. test stat. C = 1.248E+00 No. deg. of freedom = 3 P(chi-sq. >= C) = 0.7416 Consider accepting H0 Upper tail 5% point = 7.815E+00 Upper tail 1% point = 1.134E+01 Table 3.41: McNemar test in mcnemar.tf1 is consistent with association between the variables. Unlike the normal contingency table analysis where the null hypothesis is independence of rows and columns, with this test there is intentional association between rows and columns. The test statistic does not use the diagonal frequencies fi j and is testing whether the upper right corner of the table is symmetrical with the lower left corner.
3.9.4.13 Cochran Q repeated measures test on a matrix of 0,1 values

This procedure is used for a randomized block or repeated-measures design with a dichotomous variable. The blocks (e.g., subjects) are in rows from 1 to n of a matrix while the attributes, which can be either 0 or 1, are in groups, that is, columns 1 to m. So, with n blocks, m groups, Gi as the number of attributes equal to 1 in group i, and B j as the number of attributes equal to 1 in block j, then the statistic Q is calculated, where 2 m m 1 Gi (m 1) G2 i m i i=1 =1 Q= n 1 n B j m B2j j =1 j =1 and Q is distributed as approximately chi-square with m 1 degrees of freedom. It is recommended that m should be at least 4 and mn should be at least 24 for the approximation to be satisfactory. For example, try the test le cochranq.tf1 to obtain the results shown in table 3.42, noting that rows with all 0 or all 1 are not counted, while you can, optionally have an extra column of successive integers in order from 1 to n in the rst column to help you identify the subjects in the results le. Clearly, the test provides no reason to reject the null hypothesis that the binary response of the subjects is the same for the variables called A, B, C, D, E in table 3.42.
3.9.4.14 The binomial test

This procedure, which is based on the binomial distribution (page 283), is used with dichotomous data, i.e., where an experiment has only two possible outcomes and it is wished to test H0 : binomial p = p0 for some 0 p0 1. For instance, to test if success and failure are subject to pre-determined probabilities, e.g., equally likely. You input the number of successes, k, the number of Bernoulli trials, N , and the supposed probability

104

S IMFIT reference manual: Part 3

Data for Cochran Q test A B C D E subject-1 0 0 0 1 0 subject-2 1 1 1 1 1 subject-3 0 0 0 1 1 subject-4 1 1 0 1 0 subject-5 0 1 1 1 1 subject-6 0 1 0 0 1 subject-7 0 0 1 1 1 subject-8 0 0 1 1 0 No. blocks (rows) = 7 No. groups (cols) = 5 Cochran Q value = 6.947E+00 P(chi-sqd. >= Q) = 0.1387 95% chi-sq. point = 9.488E+00 99% chi-sq. point = 1.328E+01 Table 3.42: Cochran Q repeated measures test of success, p, then the program calculates the probabilities associated with k, N , p, and l = N k including the estimated probability parameter p with 95% condence limits, and the two-tail binomial test statistic. The probabilities, which can be used for upper-tail, lower-tail, or two-tail testing are p = k /N P(X = k) = P(X > k) = P(X < k) = P(X = l ) = P(X > l ) = N k p (1 p)N k k

i=k+1 k 1 i=0

N i p (1 p)N i i N i p (1 p)N i i

N l p (1 p)N l l

i=l +1

N i p (1 p)N i i N i p (1 p)N i i

P(X < l ) =

l 1 i=0

P(two tail) = min(P(X k), P(X k)) + min(P(X l ), P(X l )). Table 3.43 shows, for example, that the probability of obtaining ve successes (or alternatively ve failures) in an experiment with equiprobable outcome would not lead to rejection of H0 : p = 0.5 in a two tail test. Note, for instance, that the exact condence limit for the estimated probability includes 0.5. Many life scientists when asked what is the minimal sample size to be used in an experiment, e.g. the number of experimental animals in a trial, would use a minimum of six, since the null hypothesis of no effect would never be rejected with a sample size of ve.
3.9.4.15 The sign test

This procedure, which is also based on the binomial distribution (page 283) but assuming the special case p = 0.5, is used with dichotomous data, i.e., where an experiment has only two possible outcomes and it is wished to test if success and failure are equally likely. The test is rather weak and large samples, say

Statistical tests

105

Successes K = 5 Trials N = 5 L = (N - K) = 0 p-theory = 0.50000 p-estimate = 1.00000 (95% c.l. = 0.47818,1.00000) P( X > K ) = 0.00000 P( X < K ) = 0.96875 P( X = K ) = 0.03125 P( X >= K ) = 0.03125 P( X =< K ) = 1.00000 P( X > L ) = 0.96875 P( X < L ) = 0.00000 P( X = L ) = 0.03125 P( X >= L ) = 1.00000 P( X =< L ) = 0.03125 Two tail binomial test statistic = 0.06250 Table 3.43: Binomial test greater than 20, are usually recommended. For example, just enter the number of positives and negatives and observe the probabilities calculated. Table 3.44 could be used, for instance, to nd out how many consecutive Sign test analysis with m + n = 10 P( +ve = m ) = 0.24609, m = 5 P( +ve > m ) = 0.37695 P( +ve < m ) = 0.37695 P( +ve >= m ) = 0.62305 P( +ve =< m ) = 0.62305 P( -ve = n ) = 0.24609, n = 5 P( -ve < n ) = 0.37695 P( -ve > n ) = 0.37695 P( -ve =< n ) = 0.62305 P( -ve >= n ) = 0.62305 Two tail sign test statistic = 1.00000 Table 3.44: Sign test successes you would have to observe before the likelihood of an equiprobable outcome would be questioned. Obviously ve successes and ve failures is perfectly consistent with the null hypothesis H0 : p = 0.5, but see next what happens when the pattern of successes and failures is considered. Note that the Friedman test (page 116) is the extension of the sign test to more than two matched samples.
3.9.4.16 The run test

This is also based on an application of the binomial distribution (page 283) and is used when the sequence of successes and failures (presumed in the null hypothesis to be equally likely) is of interest, not just the overall proportions. For instance, the sequence + + + + + + or alternatively 111001100010

106

S IMFIT reference manual: Part 3

has twelve items with six runs, as will be clear by adding brackets like this (aaa)(bb)(aa)(bbb)(a)(b). You can perform this test by providing the number of items, signs, then runs, and you will be warned if the number of runs is inconsistent with the number of positive and negative signs, otherwise the probability of the number of runs given the number of positives and negatives will be calculated. Again, rather large samples are recommended. For instance, what is the probability of a sample of ten new born babies consisting of ve boys and ve girls? What if all the boys were born rst, then all the girls, that is, two runs? We have seen in table 3.44 that the sign test alone does not help, but table 3.45 would conrm what most would believe No. of -ve numbers No. of +ve numbers No. of runs Probability(runs =< observed; given no. of +ve, -ve numbers) Critical no. for 1% sig. level Critical no. for 5% sig. level Probability(runs =< observed; given no. of non zero numbers) Probability(signs =< observed) (Two tail sign test statistic) = 5 = 5 = 2 = 0.00794 = 2 = 3 = 0.01953 = 1.00000 Reject H0 at 1% s-level

Reject H0 at 5% s-level

Table 3.45: Run test intuitively: the event may not represent random sampling but could suggest the operation of other factors. In this way the run test, particularly when conditional upon the number of successes and failures, is using information from the sequence of outcomes and is therefore more powerful than the sign test alone. The run test can be valuable in the analysis of residuals if there is a natural ordering, for instance, when the residuals are arranged to correspond to the order of a single independent variable. This is not possible if there are replicates, or several independent variables so, to use the run test in such circumstances, the residuals must be arranged in some meaningful sequence, such as the order in time of the observation, otherwise arbitrary results can be obtained by rearranging the order of residuals. Given the numbers of positive and negative residuals, the probability of any possible number of runs can be calculated by enumerating all possible arrangements. For instance, the random number of runs R given m positive and n negative residuals (redening if necessary so that m n) depends on whether the number of runs is even or odd as follows 2 P(R = 2k) = m1 n1 k1 k1 , m+n m m1 n1 m1 + k1 k k m+n m

or P(R = 2k + 1) =

n1 k1

Here the maximum number of runs is 2m + 1 if m < n, or 2m if m = n, and k = 1, 2, . . . , m n. However, in the special case that m > 20 and n > 20, the probabilities of r runs can be estimated by using a normal

Statistical tests

107

distribution with 2mn + 1, m+n 2mn(2mn m n) , 2 = (m + n)2 (m + n 1) r + 0.5 and z = , = where the usual continuity correction is employed. The previous conditional probabilities depend on the values of m and n, but it is sometimes useful to know the absolute probability of R runs given N = n + m nonzero residuals. There will always be at least one run, so the probability of r runs occurring depends on the number of ways of choosing break points where a sequence of residuals changes sign. This will be the same as the number of ways of choosing r 1 items from N 1 without respect to order, divided by the total number of possible congurations, i.e. the probability of r 1 successes in N 1 independent Bernoulli trials given by P(R = r) = N1 r1 1 2
N 1

This is the value referred to as the probability of runs given the number of nonzero residuals in table 3.45.
3.9.4.17 The F test for excess variance

This procedure, which is based on the F distribution (page 289), is used when you have tted two nested models, e.g., polynomials of different degrees, to the same data and wish to use parsimony to see if the extra parameters are justied. You input the weighted sums of squares W SSQ1 for model 1 with m1 parameters and W SSQ2 for model 2 with m2 parameters, and the sample size n, when the following test statistic is calculated F (m2 m1 , n m2) = (W SSQ1 W SSQ2 )/(m2 m1 ) . W SSQ2/(n m2)

Table 3.46 illustrates how the test is performed. This test for parameter redundancy is also widely used with Q1 ((W)SSQ for model 1) Q2 ((W)SSQ for model 2) M1 (no. params. model 1) M2 (no. params. model 2) NPTS (no. exper. points) Numerator deg. freedom Denominator deg. freedom F test statistic TS P(F >= TS) P(F =< TS) 5% upper tail crit. pnt. 1% upper tail crit. pnt. Conclusion: Model 2 is not justified = = = = = = = = = = = = 1.200E+01 1.000E+01 2 3 12 1 9 1.800E+00 0.2126 0.7874 5.117E+00 1.056E+01

... Tentatively accept model 1

Table 3.46: F test for exess variance models that are not linear or nested and, in such circumstances, it must be interpreted with caution as no more than a useful guide. However, as the use of this test is so common, SIMFIT provides a way to store the necessary parameters in an archive le w_ftests.cfg after any tting, so that the results can be recalled

108

S IMFIT reference manual: Part 3

retrospectively to assess model validity. Note that, when you select to recover stored values for this test you must ensure that the data are retrieved in the correct order. The best way to ensure this is by using a systematic technique for assigning meaningful titles as the data are stored. The justication for the F test can be illustrated by successive tting of polynomials H0 : f (x) = 0 H1 : f (x) = 0 + 1 x H2 : f (x) = 0 + 1 x + 2 x2 ... Hk : f (x) = 0 + 1 x + 2 x2 + . . ., k xk in a situation where experimental error is normal with zero mean and constant variance, and the true model is a polynomial, a situation that will never be encountered in real life. The important distributional results, illustrated for the case of two models i and j, with j > i 0, so that the number of points and parameters satisfy n > m j > mi while the sums of squares are Qi > Q j , then 2. Q j /2 is 2 (n m j ) under model j So the likelihood ratio test statistic F= 1. (Qi Q j )/2 is 2 (m j mi ) under model i

3. Q j and Qi Q j are independent under model j. (Qi Q j )/(m j mi ) Q j /(n m j )

is distributed as F (m j mi , n m j ) if the true model is model i, which is a special case of model j in the nested hierarchy of the polynomial class of linear models.

3.9.5 Nonparametric tests using rstest


When it is not certain that your data are consistent with a known distribution for which special tests have been devised, it is advisable to use nonparametric tests. Many of these are available at appropriate points from simstat as follows. Kolmogorov-Smirnov 1-sample (page 88) and 2-sample (page 94), Mann-Whitney U (page 96), Wilcoxon signed ranks (page 97), chi-square (page 98), Cochran Q (page 103), sign (page 104), run (page 105), Kruskall-Wallis (page 113), Friedman (page 116), and nonparametric correlation (page 135) procedures. However, for convenience, program rstest should be used, and this also provides further tests as now described.
3.9.5.1 Runs up and down test for randomness

The runs up test can be conducted on a vector of observations x1 , x2 , . . . , xn provided n is large and there are no ties in the data. The runs down test is done by multiplying the sample by 1 then repeating the runs up test. Table 3.47 illustrates the results from analyzing normal.tf1, showing no evidence against randomness. The number of runs up ci of length i are calculated for increasing values of i up to a limit r 1, all runs of length greater than r 1 being counted as runs of length r. Then the chi-square statistic
1 2 = (c c )T c (c c)

with r degrees of freedom is calculated, where c = c1 , c2 , . . . , cr , vector of counts c = e1 , e2 , . . . , er , vector of expected values c = covariance matrix. Note that the default maximum value allowed for r is set by SIMFIT at six, which should be sufcient for most purposes.

Statistical tests

109

Title of data 50 numbers from a normal distribution mu = 0 and sigma = 1 Size of sample = 50 CU (chi-sq.stat. for runs up) = 1.210E+00 Degrees of freedom = 6 P(chi-sq. >= CU) (upper tail p) = 0.9764 CD (chi-sq.stat. for runs down) = 6.011E-01 Degrees of freedom = 6 P(chi-sq. >= CD) (upper tail p) = 0.9964 Table 3.47: Runs up and down test for randomness

3.9.5.2 Median test

The median test examines the difference between the medians of two samples in order to test the null hypothesis H0 : the medians are the same, against the alternative hypothesis that they are different. Table 3.48 presents the results from analyzing g08acf.tf1 and g08acf.tf2. The test procedure is rst to calculate the Current data sets X and Y are: Data for G08ACF: the median test No. X-values = 16 Data for G08ACF: the median test No. Y-values = 23 Results for median test: H0: medians are the same No. X-scores below pooled median = 13 No. Y-scores below pooled median = 6 Probability under H0 = 0.0009 Table 3.48: Median test median for the pooled sample, then form a two by two contingency table for scores above and below this pooled median. For small samples a Fisher exact test is use, otherwise a chi-square approximation is used.
3.9.5.3 Moods test and Davids test for equal dispersion

Reject H0 at 1% sig.level

These are used to test the null hypothesis of equal dispersions, i.e. equal variances. Table 3.49 presents the results from analyzing g08baf.tf1 and g08baf.tf2. If the two samples are of size n1 and n2 , so that n = n1 + n2 , then the ranks ri in the pooled sample are calculated. The two test statistics W and V are dened as follows. Moods test assumes that the two samples have the same mean so that W = ri
i=1 n1

n+1 2

which is the sum of squares of deviations from the average rank in the pooled sample, is approximately normal for large n. Davids test use the mean rank
n1

r = ri /n1
i=1

110

S IMFIT reference manual: Part 3

Current data sets X and Y are: Data for G08BAF: Mood-David tests for equal dispersions No. X-values = 6 Data for G08BAF: Mood-David tests for equal dispersions No. Y-values = 6 Results for the Mood test H0: dispersions are equal H1: X-dispersion > Y-dispersion H2: X-dispersion < Y-dispersion The Mood test statistic = 7.550E+01 Probability under H0 = 0.8339 Probability under H1 = 0.4170 Probability under H2 = 0.5830 Results for the David test H0: dispersions are equal H1: X-dispersion > Y-dispersion H2: X-dispersion < Y-dispersion The David test statistic = 9.467E+00 Probability under H0 = 0.3972 Probability under H1 = 0.8014 Probability under H2 = 0.1986 Table 3.49: Mood-David equal dispersion tests

to reduce the effect of the assumption of equal means in the calculation N= 1 n1 (ri r )2 n1 1 i =1

which is also approximately normally distributed for large n.

3.9.5.4 Kendall coefcient of concordance

This tests is used to measure the degree of agreement between k comparisons of n objects. Table 3.50 presents H0: no agreement between comparisons Data title Data for G08DAF No. of columns (objects) = 10 No. of rows (comparisons) = 3 Kendall coefficient W = 0.8277 P(chi-sq >= W) = 0.0078 Reject H0 at 1% sig.level Table 3.50: Kendall coefcient of concordance: results the results from analyzing g08daf.tf1, i.e. the data le shown in table 3.51, which illustrates the format for supplying data for analysis. Ranks ri j for the the rank of object j in comparison i (with tied values being given averages) are used to calculate the n column rank sums R j , which would be approximately equal to the average rank sum k(n + 1)/2 under H0 : there is no agreement. For total agreement the R j would have values from some permutation of k, 2k, . . . , nk, and the total squared deviation of these is k2 (n3 n)/12. Then the

Statistical tests

111

Data for G08DAF 3 10 1.0 4.5 2.0 4.5 3.0 7.5 6.0 9.0 7.5 10.0 2.5 1.0 2.5 4.5 4.5 8.0 9.0 6.5 10.0 6.5 2.0 1.0 4.5 4.5 4.5 4.5 8.0 8.0 8.0 10.0 5 Rows are comparisons (i = 1,2,...,k) Columns are objects (j = 1,2,...,n) The A(i,j) are ranks of object j in comparison i The A(i,j) must be > 0 and ties must be averages so that sum of ranks A(i,j) for j = 1,2,...,n must be n(n + 1)/2 Table 3.51: Kendall coefcient of concordance: data coefcient W is calculated according to

W=

j =1

(R j k(n + 1)/2)2
k2 (n3 n)/12

which lies between 0 for complete disagreement and 1 for complete agreement. For large samples (n > 7), k(n 1)W is approximately 2 n1 distributed, otherwise tables should be used for accurate signicance levels.

112

S IMFIT reference manual: Part 3

3.9.6 Analysis of variance


In studying the distribution of the variance estimate from a sample of size n from a normal distribution with mean and variance 2 , you will have encountered the following decomposition of a sum of squares

i=1

yi

i=1

yi y

y / n

into independent chi-square variables with n 1 and 1 degree of freedom respectively. Analysis of variance is an extension of this procedure based on linear models, assuming normality and constant variance, then partitioning of chi-square variables (page 289) into two or more independent components, invoking Cochrans theorem (page 289) and comparing the ratios to F variables (page 289) with the appropriate degrees of freedom for variance ratio tests. It can be used, for instance, when you have a set of samples (columns vectors) that come from normal distributions with the same variance and wish to test if all the samples have the same mean. Due to the widespread use of this technique, many people use it even though the original data are not normally distributed with the same variance, by applying variance stabilizing transformations (page 292), like the square root with counts, which can sometimes transform non-normal data into transformed data that are approximately normally distributed. An outline of the theory necessary for several widely used designs follows, but you should never make the common mistake of supposing that ANOVA is model free: ANOVA is always based upon data collected as replicates and organized into cells, where it is assumed that all the data are normally distributed with the same variance but with mean values that differ from cell to cell according to an assumed general linear model.

3.9.6.1 ANOVA (1): 1-way and Kruskal-Wallis (n samples or library le)

This procedure is used when you have columns (i.e. samples) of normally distributed measurements with the same variance and wish to test if all the means are equal. With two columns it is equivalent to the twosample unpaired t test (page 91), so it can be regarded as an extension of this test to cases with more than two columns. Suppose a random variable Y is measured for groups i = 1, 2, . . . , k and subjects j = 1, 2, . . . ni , and it is assumed that the appropriate general linear model for the n = k i=1 ni observations is yi j = + i + ei j

i=1

i = 0

where the errors ei j are independently normally distributed with zero mean and common variance 2 .

Then the 1-way ANOVA null hypothesis is

H0 : i = 0, for i = 1, 2, . . . , k,

Analysis of variance

113

that is, the means for all k groups are equal, and the basic equations are as follows. y i =

j =1

yi j / n i
k

ni

y = )2 = (yi j y Total SSQ = Residual SSQ =


k k k k ni k

i=1 j =1 ni

yi j / n
i )2 + ni (y i y )2 (yi j y
i=1 k

ni

i=1 j =1

i=1 j =1 ni

i=1 j =1 ni

)2 , with DF = n 1 (yi j y i )2 , with DF = n k (yi j y

i=1 j =1

Group SSQ = ni (y i y )2 , with DF = k 1.


i=1

Here Total SSQ is the overall sum of squares, Group SSQ is the between groups (i.e. among groups) sum of squares, and Residual SSQ is the residual (i.e. within groups, or error) sum of squares. The mean sums of squares and F value can be calculated from these using Total SSQ = Residual SSQ + Group SSQ Total DF = Residual DF + Group DF Group SSQ Group MS = Group DF Residual SSQ Residual MS = Residual DF Group MS F= , Residual MS so that the degrees of freedom for the F variance ratio to test if the between groups MS is signicantly larger than the residual MS are k 1 and n k. The SIMFIT 1-way ANOVA procedure allows you to include or exclude selected groups, i.e., data columns, and to employ variance stabilizing transformations if required, but it also provides a nonparametric test, and it allows you to explore which column or columns differ signicantly in the event of the F value leading to a rejection of H0 . As the assumptions of the linear model will not often be justied, the nonparametric Kruskal-Wallis test can be done at the same time, or as an alternative to the parametric 1-way ANOVA just described. This is in reality an extension of the Mann-Whitney U test (page 96) to k independent samples, which is designed to test H0 : the medians are all equal. The test statistic H is calculated as H=
k 12 R2 i 3(n + 1) n(n + 1) i n =1 i

where Ri is the sum of the ranks of the ni observations in group i, and n = k i=1 ni . This test is actually a 1-way ANOVA carried out on the ranks of the data. The p value are calculated exactly for small samples, but the fact that H approximately follows a 2 k1 distribution is used for large samples. If there are ties, then H is corrected by dividing by where = 1
i=1

(ti3 ti)

n3 n where ti is the number of tied scores in the ith group of ties, and m is the number of groups of tied ranks. The test is 3/ times as powerful as the 1-way ANOVA test when the parametric test is justied, but it is more

114

S IMFIT reference manual: Part 3

powerful, and should always be used, if the assumptions of the linear normal model are not appropriate. As it is unusual for the sample sizes to be large enough to verify that all the samples are normally distributed and with the same variance, rejection of H0 in the Kruskal-Wallis test (which is the higher order analogue of the Mann-Whitney U test, just as 1-way ANOVA is the higher analogue of the t test) should always be taken seriously. To see how these tests work in practise, read in the matrix test le tukey.tf1 which refers to a data set where the column vectors are the groups, and you will get the results shown in Table 3.52. The null hypothesis, that One Way Analysis of Variance: (Grand Mean Transformation:Source Between Groups Residual Total 4.316E+01)

x (untransformed data) SSQ NDOF MSQ 2.193E+03 4 5.484E+02 2.441E+02 25 9.765E+00 2.438E+03 29

F 5.615E+01

p 0.0000

Kruskal-Wallis Nonparametric One Way Analysis of Variance Test statistic 2.330E+01 NDOF 4 p 0.0001

Table 3.52: ANOVA example 1(a): 1-way and the Kruskal-Wallis test all the columns are normally distributed with the same mean and variance, would be rejected at the 5% signicance level if p < 0.05, or at the 1% signicance level if p < 0.01, which suggests, in this case, that at least one pair of columns differ signicantly. Note that each time you do an analysis, the Kruskal-Wallis nonparametric test based on ranks can be done at the same time, or instead of ANOVA. In this case the same conclusion is reached but, of course, it is up to you which result to rely on. Also, you can interactively suppress or restore columns in the data set and you can select variance stabilizing transformations if necessary (page 292). These can automatically divide sample values by 100 if your data are as percentages rather than proportions and a square root, arc sine, logit or similar transformation is called for.
3.9.6.2 ANOVA (1): Tukey Q test (n samples or library le)

This post-ANOVA procedure is used when you have k normal samples with the same variance and 1way ANOVA suggests that at least one pair of columns differ signicantly. For example, after analyzing tukeyq.tf1 as just described, then selecting the Tukey Q test, Table 3.53 will be displayed. Note that the means are ranked and columns with means between those of extreme columns that differ signicantly are not tested, according to the protocol that is recommended for this test. This involves a systematic procedure where the largest mean is compared to the smallest, then the largest mean is compared with the second largest, and so on. If no difference is found between two means then it is concluded that no difference exists between any means enclosed by these two, and so no testing is done. Evidently, for these data, column 5 differs signicantly from columns 1, 2, 3, and 4, and column 3 differs signicantly from column 1. The test statistic Q for comparing columns A and B with sample sizes nA and nB is y B y A SE s2 where SE = , if nA = nB n Q= SE = s2 2 1 1 + , if nA = nB nA nB

s2 = error MS

Analysis of variance

115

Tukey Q-test with 5 means and 10 comparisons 5% point = 4.189E+00, 1% point = 5.125E+00 Columns Q p 5% 1% NB NA 5 1 2.055E+01 0.0001 * * 6 6 5 2 1.416E+01 0.0001 * * 6 6 5 4 1.348E+01 0.0001 * * 6 6 5 3 1.114E+01 0.0001 * * 6 6 3 1 9.406E+00 0.0001 * * 6 6 3 2 3.018E+00 0.2377 NS NS 6 6 3 4 [[2.338E+00 0.4792]] No-Test No-Test 6 6 4 1 7.068E+00 0.0005 * * 6 6 4 2 [[6.793E-01 0.9885]] No-Test No-Test 6 6 2 1 6.388E+00 0.0013 * * 6 6 [ 5%] and/or [[ 1%]] No-Test results given for reference only Table 3.53: ANOVA example 1(b): 1-way and the Tukey Q test and the signicance level for Q is calculated as a studentized range.
3.9.6.3 ANOVA (1): Plotting 1-way data

After analyzing a selected subset of data, possibly transformed, it is useful to be able to inspect the columns of data so as to identify obvious differences. This can be done by plotting the selected columns as a scattergram, or by displaying the selected columns as a box and whisker plot, with medians, quartiles and ranges. Alternatively, a bar chart can be constructed with the means of selected columns, and with error bars calculated for 95% condence limits, or as selected multiples of the sample standard errors or sample standard deviations.
3.9.6.4 ANOVA (2): 2-way and the Friedman test (one matrix)

This procedure is used when you want to include row and column effects in a completely randomized design, i.e., assuming no interaction and one replicate per cell so that the appropriate linear model is yi j = + i + j + ei j
i=1 c r

i = 0

j =1

j = 0

for a data matrix with r rows and c columns, i.e. n = rc. The mean sums of squares and degrees of freedom for row and column effects are worked out, then the appropriate F and p values are calculated. Using Ri for c the row sums, C j for the column sums, and T = r i=1 Ri = j =1 C j for the sum of observations, these are
2 Row SSQ = R2 i /c T /n, with DF = r 1 i=1 c r

Column SSQ =

2 Total SSQ = y2 i j T /n, with DF = n 1 i=1 j =1

j =1 r c

2 C2 j /r T /n, with DF = c 1

Residual SSQ = Total SSQ Row SSQ Column SSQ, with DF = (r 1)(c 1) where Row SSQ is the between rows sums of squares, Column SSQ is the between columns sum of squares, Total SSQ is the total sum of squares and Residual SSQ is the residual, or error sum of squares. Now two F

116

S IMFIT reference manual: Part 3

statistics can be calculated from the mean sums of squares as Rows MS Residual MS Column MS . FC = Residual MS FR = The statistic FR is compared with F (r 1, (r 1)(c 1)) to test HR : i = 0, i = 1, 2, . . . , r i.e., absence of row effects, while FC is compared with F (c 1, (r 1)(c 1)) to test HC : j = 0, j = 1, 2, . . . , c i.e., absence of column effects. If the data matrix represents scores etc., rather than normally distributed variables with identical variances, then the matrix can be analyzed as a two way table with k rows and l columns using the nonparametric Friedman 2-way ANOVA procedure, which is an analogue of the sign test (page 104) for multiple matched samples designed to test H0 : all medians are equal, against the alternative, H1 : they come from different populations. The procedure ranks column scores as ri j for row i and column j, assigning average ranks for ties, works out rank sums as ti = lj=1 ri j , then calculates FR given by FR =
k 12 (ti l (k + 1)/2)2 . kl (k + 1) i =1

For small samples, exact signicance levels are calculated, while for large samples it is assumed that FR follows a 2 k1 distribution. For practise you should try the test le anova2.tf1 which is analyzed as shown in Table 3.54. Note that there are now two p values for the two independent signicance tests, and observe 2-Way Analysis of Variance: (Grand mean Source Between rows Between columns Residual Total SSQ 0.000E+00 8.583E+00 2.692E+01 3.550E+01 NDOF 17 2 34 53 2.000E+00) F 0.000E+00 5.421E+00 p 1.0000 0.0090

MSSQ 0.000E+00 4.292E+00 7.917E-01

Friedman Nonparametric Two-Way Analysis of Variance Test Statistic = No. Deg. Free. = Significance = 8.583E+00 2 0.0137

Table 3.54: ANOVA example 2: 2-way and the Friedman test that, as in the previous 1-way ANOVA test, the corresponding nonparametric (Friedman) test can be done at the same time, or instead of the parametric test if required.

Analysis of variance

117

3.9.6.5 ANOVA (3): 3-way and Latin Square design (one matrix)

The linear model for a m by m Latin Square ANOVA is yi jk = + i + j + k + ei jk


i=1 m

i = 0

j =1 m

j = 0
k =1

k = 0

where i , j and k represent the row, column and treatment effect, and ei jk is assumed to be normally distributed with zero mean and variance 2 . The sum of squares partition is now Total SSQ = Row SSQ + Column SSQ + Treatment SSQ + Residual SSQ where the m2 observations are arranged in the form of a m by m matrix so that every treatment occurs once in each row and column. This design, which is used for economical reasons to account for out row, column, and treatment effects, leads to the three variance ratios Row MS Residual MS Column MS FC = Residual MS Treatment MS FT = Residual MS FR =

to use in F tests with m 1, and (m 1)(m 2) degrees of freedom. Note that SIMFIT data les for Latin square designs with m treatment levels have 2m rows and m columns, where the rst m by m block identies the treatments, and the next m by m block of data are the observations. When designing such experiments, the particular Latin square used should be chosen randomly if possible as described on page 197. For instance, try the test le anova3.tf1, which should be consulted for details, noting that integers (1, 2, 3, 4, 5) are used instead of the usual letters (A, B, C, D, E) in the data le header to indicate the position of the treatments. Note that, in Table 3.55, there are now three p values for signicance testing between rows, columns and treatments.
3.9.6.6 ANOVA (4): Groups and subgroups (one matrix)

The linear models for ANOVA are easy to manipulate mathematically and trivial to implement in computer programs, and this has lead to a vast number of possible designs for ANOVA procedures. This situation is likely to bewilder users, and may easily mislead the unwary, as it stretches credulity to the limit to believe that experiments, which almost invariably reect nonlinear non-normal phenomena, can be analyzed in a meaningful way by such elementary models. Nevertheless, ANOVA remains valuable for preliminary data exploration, or in situations like clinical or agricultural trials, where only gross effects are of interest and precise modelling is out of the question, so a further versatile and exible ANOVA technique is provided by SIMFIT for two-way hierarchical classication with subgroups of possibly unequal size, assuming a xed effects model. Suppose, for instance, that there are k 2 treatment groups, with group i subdivided into li treatment subgroups, where subgroup j contains ni j observations. That is, observation ymi j is observation m in subgroup j of group i where 1 i k, 1 j li , 1 m ni j .

118

S IMFIT reference manual: Part 3

Three Way Analysis of Variance: (Grand mean

7.186E+00)

Source NDOF SSQ MSQ F p Rows 4 2.942E+01 7.356E+00 9.027E+00 0.0013 Columns 4 2.299E+01 5.749E+00 7.055E+00 0.0037 Treatments 4 5.423E-01 1.356E-01 1.664E-01 0.9514 Error 12 9.779E+00 8.149E-01 Total 24 6.274E+01 2.614E+00 Row means: 8.136E+00 6.008E+00 8.804E+00 6.428E+00 6.552E+00 Column means: 5.838E+00 6.322E+00 7.462E+00 7.942E+00 8.364E+00 Treatment means: 7.318E+00 7.244E+00 7.206E+00 6.900E+00 7.260E+00 Table 3.55: ANOVA example 3: 3-way and Latin square design The between groups, between subgroups within groups, and residual sums of squares are Group SSQ = ni. (y .i. y ... )2
i=1 k k

Subgroup SSQ = ni j (y .i j y .i. )2


i=1 j =1 i li

li

Residual SSQ =

i=1 j =1 m=1

.i j )2 (ymi j y

ni j

k which, using l = k i=1 li and n = i=1 ni. , and normalizing give the variance ratios

FG =

Group SSQ/(k 1) Residual SSQ/(n l ) Subgroup SSQ/(l k) FS = Residual SSQ/(n l )

to test for between groups and between subgroups effects. To practise, an appropriate test le is anova4.tf1, which should be consulted for details, and the results are shown in table 3.56. Of course, there are now two p values for signicance testing and, also note that, because this technique allows for many designs that cannot be represented by rectangular matrices, the data les must have three columns and n rows: column one contains the group numbers, column two contains the subgroup numbers, and column three contains the observations as a vector in the order of groups and subgroups within groups. By dening groups and subgroups correctly a large number of ANOVA techniques can be done using this procedure.
3.9.6.7 ANOVA (5): Factorial design (one matrix)

Factorial ANOVA is employed when two or more factors are used together at more than one level, possibly with blocking, and the technique is best illustrated by a simple example. For instance, table 3.57 shows the results from analyzing data in the test le anova5.tf1. which has two factors, A and B say, but no blocking. The appropriate linear model is yi jk = + i + j + ()i j + ei jk where there are a levels of factor A, b levels of factor B and n replicates per cell, that is, n observations at each xed pair of i and j values. As usual, is the mean, i is the effect of A at level i, j is the effect of B at level j, ()i j is the effect of the interaction between A and B at levels i and j, and ei jk is the random error

Analysis of variance

119

Groups/Subgroups 2-Way ANOVA Transformation = x (untransformed data) Source SSQ NDOF F Between Groups 4.748E-01 1 1.615E+01 Subgroups 8.162E-01 6 4.626E+00 Residual 5.587E-01 19 Total 1.850E+00 26 Group 1 1 1 1 1 2 2 2 Group Group Grand Subgroup 1 2 3 4 5 1 2 3 1 mean = 2 mean = mean = Mean 2.100E+00 2.233E+00 2.400E+00 2.433E+00 1.800E+00 1.867E+00 1.860E+00 2.133E+00 2.206E+00 ( 1.936E+00 ( 2.096E+00 (

p 0.0007 0.0047

16 Observations) 11 Observations) 27 Observations)

Table 3.56: ANOVA example 4: arbitrary groups and subgroups Factorial ANOVA Transformation = x (untransformed data) Source SSQ NDOF MS Blocks 0.000E+00 0 0.000E+00 Effect 1 (A) 1.386E+03 1 1.386E+03 Effect 2 (B) 7.031E+01 1 7.031E+01 Effect 3 (A*B) 4.900E+00 1 4.900E+00 Residual 3.664E+02 16 2.290E+01 Total 1.828E+03 19 Overall mean 2.182E+01 Treatment means Effect 1 1.350E+01 3.015E+01 Std.Err. of difference Effect 2 2.370E+01 1.995E+01 Std.Err. of difference Effect 3 1.488E+01 1.212E+01 Std.Err. of difference

F 0.000E+00 6.053E+01 3.071E+00 2.140E-01

p 0.0000 0.0000 0.0989 0.6499

in means = 2.140E+00

in means = 2.140E+00 3.252E+01 2.778E+01 in means = 3.026E+00

Table 3.57: ANOVA example 5: factorial design

b a component at replicate k. Also there are the necessary constraints that a i=1 i = 0, j =1 = 0, i=1 ()i j = 0,

120

S IMFIT reference manual: Part 3

and b j =1 ()i j = 0. The null hypotheses would be H0 : i = 0, for i = 1, 2, . . . , a to test for the effects of factor A, to test for the effects of factor B, and H0 : ()i j = 0, for all i, j to test for possible AB interactions. The analysis of variance table is based upon calculating F statistics as ratios of sums of squares that arise from the partitioning of the total corrected sum of squares as follows ... )2 = [(y i.. y ... ) + (y . j. y ... ) (yi jk y
i=1 j =1 k=1 a b n a b n

H0 : j = 0, for j = 1, 2, . . . , b

i=1 j =1 k=1

+ (y i j. y i.. y . j. + y ... ) + (yi jk y i j. )]2 i.. y ... )2 + an (y . j. y ... )2 = bn (y


i=1 j =1 a b

+ n (y i j. y i.. y . j. + y ... )2 +
i=1 j =1

i=1 j =1 k=1

i jk y i j. )2 (y

It is clear from the F statistics and signicance levels p in table 3.57 that, with these data, A has a large effect, B has a small effect, and there is no signicant interaction. Figure 3.14 illustrates a graphical tech-

Means for Two-Factor ANOVA


35 A2B1

Mean Values

25

A2B2 Effect of B A1B1

Effect of A

15 A1B2 5 1 2

Levels of Factor A

Figure 3.14: Plotting interactions in Factorial ANOVA nique for studying interactions in factorial ANOVA that can be very useful with limited data sets, say with only two factors. First of all, note that the factorial ANOVA table outputs results in standard order, e.g. A1 B1 , A1 B2 , A2 B1 , A2 B2 and so on, while the actual coefcients i , j , ()i j in the model can be estimated by subtracting the grand mean from the corresponding treatment means. In the marginals plot, the line connecting the circles is for observations with B at level 1 and the line connecting the triangles is for observations

Analysis of variance

121

with B at level 2. The squares are the overall means of observations with factor A at level 1 (13.5) and level 2 (30.15), while the diamonds are the overall means of observations with factor B (i.e. 23.7 and 19.95) from table 3.57. Parallel lines indicate the lack of interaction between factors A and B while the larger shift for variation in A as opposed to the much smaller effect of changes in levels of B merely reinforces the conclusions reached previously from the p values in table 3.57. If the data set contains blocking, as with test les anova5.tf2 and anova5.tf4, then there will be extra information in the ANOVA table corresponding to the blocks, e.g., to replace the values shown as zero in table 3.57 as there is no blocking with the data in anova5.tf1.
3.9.6.8 ANOVA (6): Repeated measures (one matrix)

This procedure is used when you have paired measurements, and wish to test for absence of treatment effects. With two samples it is equivalent to the two-sample paired t test (page 93), so it can be regarded as an extension of this test to cases with more than two columns. If the rows of a data matrix represent the effects of different column-wise treatments on the same subjects, so that the values are serially correlated, and it is wished to test for signicant treatment effects irrespective of differences between subjects, then repeatedmeasurements design is appropriate. The simplest, model-free, approach is to treat this as a special case of 2-way ANOVA where only between-column effects are considered and between-row effects, i.e., between subject variances, are expected to be appreciable, but are not considered. Many further specialized techniques are also possible, when it is reasonable to attempt to model the treatment effects, e.g., when the columns represent observations in sequence of, say, time or drug concentration, but often such effects are best tted by nonlinear rather than linear models. A useful way to visualize repeated-measurements ANOVA data with small samples ( 12 subjects) is to input the matrix into the exhaustive analysis of a matrix procedure and plot the matrix with rows identied by different symbols. Table 3.58 shows the results from analyzing data in the test le anova6.tf1 which consists of three sections, a Mauchly sphericity test, the ANOVA table, and a Hotelling T 2 test, all of which will now be discussed. In order for the normal two-way univariate ANOVA to be appropriate, sphericity of the covariance matrix of orthonormal contrasts is required. The test is based on a orthonormal contrast matrix, for example a Helmert matrix of the form 0 0 ... 1/2 1/2 0 1/ 6 1/ 0 ... 6 2/ 6 0 C= 1/ 12 1/ 12 1/ 12 3/ 12 0 ... ... ... ... ... ... ...

which, for m columns, has dimensions m 1 by m, and where every row sum is zero, every row has length unity, and all the rows are orthogonal. Such Helmert conrasts compare each successive column mean with the average of the preceding (or following) column means but, in the subsequent discussion, any orthonormal contrast matrix leads to the same end result, namely, when the covariance matrix of orthonormal contrasts satises the sphericity condition, then the sums of squares used to construct the F test statistics will be independent chi-square variables and the two-way univariate ANOVA technique will be the most powerful technique to test for equality of column means. The sphericity test uses the sample covariance matrix S to construct the Mauchly W statistic given by W= |CSCT | . [Tr(CSCT )/(m 1)]m1 2 m2 3 m + 3 log W 6(m 1)

If S is estimated with degrees of freedom then 2 =

is approximately distributed as chi-square with m(m 1)/2 1 degrees of freedom. Clearly, the results in table 3.58 show that the hypothesis of sphericity cannot be rejected, and the results from two-way ANOVA can be tentatively accepted. However, in some instances, it may be necessary to alter the degrees of freedom for the F statistics as discussed next.

122

S IMFIT reference manual: Part 3

Sphericity test on CV of Helmert orthonormal contrasts H0: Covariance matrix = k*Identity (for some k > 0) No. small eigenvalues = No. of variables (k) = Sample size (n) = Determinant of CV = Trace of CV = Mauchly W statistic = LRTS (-2*log(lambda)) = Degrees of Freedom = P(chi-square >= LRTS) = e (Geisser-Greenhouse)= e (Huynh-Feldt) = e (lower bound) = 0 (i.e. < 1.00E-07) 4 5 1.549E+02 2.820E+01 1.865E-01 4.572E+00 5 0.4704 0.6049 1.0000 0.3333 2.490E+01) F 2.476E+01 p 0.0000 0.0006 (Greenhouse-Geisser) 0.0000 (Huyhn-Feldt) 0.0076 (Lower-bound)

Repeat-measures ANOVA: (Grand mean Source SSQ NDOF MSSQ Subjects 6.808E+02 4 Treatments 6.982E+02 3 2.327E+02

Remainder Total

1.128E+02 1.492E+03

12 19

9.400E+00

Friedman Nonparametric Two-Way Analysis of Variance Test Statistic = No. Deg. Free. = Significance = 1.356E+01 3 0.0036

Hotelling one sample T-square test H0: Column means are all equal No. rows = 5, No. Hotelling T-square = F Statistic (FTS) = Deg. Free. (d1,d2) = P(F(d1,d2) >= FTS) = columns = 4 1.705E+02 2.841E+01 3, 2 0.0342 Reject H0 at 5% sig.level

Table 3.58: ANOVA example 6: repeated measures

The model for univariate repeated measures with m treatments used once on each of n subjects is a mixed model of the form yi j = + i + j + ei j , where i is the xed effect of treatment i so that m i=1 i = 0, and j is the random effect of subject j with

Analysis of variance

123

mean zero, and n j =1 j = 0. Hence the decomposition of the sum of squares is . j )2 = n (y i. y .. )2 + (yi j y i. y . j + y .. )2 , (yi j y
i=1 i=1 j =1 m n m m n

i=1 j =1

that is SSWithin subjects = SStreatments + SSError with degrees of freedom n(m 1) = (m 1) + (m 1)(n 1). To test the hypothesis of no treatment effect, that is H0 : i = 0 for i = 1, 2, . . . , m, the appropriate test statistic would be F= SStreatment/(m 1) SSError/[(m 1)(n 1)]

but, to make this test more robust, it may be necessary to adjust the degrees of freedom when calculating critical levels. In fact the degrees of freedom should be taken as Numerator degrees of freedom = (m 1)

Denominator degrees of freedom = (m 1)(n 1) where there are four possibilities for the correction factor , all with 0 1. 1. The default epsilon. This is = 1, which is the correct choice if the sphericity criterion is met. 2. The Greenhouse-Geisser epsilon. This is =

where i are the eigenvalues of the covariance matrix of orthonormal contrasts, and it could be used if the sphericity criterion is not met, although some argue that it is an ultraconservative estimate. 3. The Huyhn-Feldt epsilon. This is can also be used when the sphericity criterion is not met, and it is constructed from the as follows Greenhouse-Geisser estimate a= ) b = (m 1)(n G (m 1) = min(1, a/b), 2 n(m 1)

1 2 (m i=1 i ) 1 2 (m 1) m i=1 i

where G is the number of groups. It is generally recommended to use this estimate if the ANOVA probabilities given by the various adjustments differ appreciably. 4. The lower bound epsilon. This is dened as = 1/(m 1) which is the smallest value and results in using the F statistic with 1 and n 1 degrees of freedom.

124

S IMFIT reference manual: Part 3

If the sphericity criterion is not met, then it is possible to use multivariate techniques such as MANOVA as long as n > m, as these do not require sphericity, but these will always be less powerful than the univariate ANOVA just discussed. One possibility is to use the Hotelling T 2 test to see if the column means differ signicantly, and the results displayed in table 3.58 were obtained in this way. Again a matrix C of orthonormal contrasts is used together with the vector of column means y = (y 1 , y 2 , . . . , y m )T to construct the statistic T 2 = n(Cy )T (CSCT )1 (Cy ) since (n m + 1)T 2 F (m 1, n m + 1) (n 1)(m 1)

if all column means are equal.

Analysis of proportions

125

3.9.7 Analysis of proportions


Suppose that a total of N observations can be classied into k categories with frequencies consisting of yi observations in category i, so that 0 yi N and k i=1 yi = N , then there are k proportions dened as p i = yi / N , of which only k 1 are independent due to the fact that

i=1

pi = 1.

If these proportions are then interpreted as estimates of the multinomial probabilities (page 284) and it is wished to make inferences about these probabilities, then we are in a situation that loosely can be described as analysis of proportions, or analysis of categorical data. Since the observations are integer counts and not measurements, they are not normally distributed, so techniques like ANOVA should not be used and specialized methods to analyze frequencies must be employed.
3.9.7.1 Dichotomous data

If there only two categories, such as success or failure, male or female, dead or alive, etc., the data are referred to as dichotomous, and there is only one parameter to consider. So the analysis of two-category data is based on the binomial distribution (page 283) which is required when y successes have been recorded in N trials and it is wished to explore possible variations in the binomial parameter estimate p = y/ N , and and its unsymmetrical condence limits (see page 190), possibly as ordered by an indexing parameter x. The SIMFIT analysis of proportions procedure accepts a matrix of such y, N data then calculates the binomial parameters and derived parameters such as the Odds Odds = p /(1 p ), where 0 < p < 1, and log(Odds), along with standard errors and condence limits. It also does a chi-square contingency table test and a likelihood ratio test for common binomial parameters. Sometimes the proportions of successes in sample groups are in arbitrary order, but sometimes an actual indexing parameter is required, as when proportions in the same groups are evolving in time. As an example, read in binomial.tf2, which has (y, N ) data, to see how a parameter x is added equal to the order in the data le. It will be seen from the results in table 3.59 that condence limits are calculated for the parameter estimates and for the differences between parameter estimates, giving some idea which parameters differ signicantly when compared independently. Logs of differences and odds with condence limits can also be tabulated. You could then read in binomial.tf3 as an example of (y, N , x) data, to see what to do if a parameter has to be set. Note that tests are done on the data without referencing the indexing parameter x, but plotting the estimates of proportions with condence limits depends upon the parameter x, if only for spacing out. Experiment with the various ways of plotting the proportions to detect signicant differences visually, as when condence limits do not overlap, indicating statistically signicant differences. Figure 3.15 shows the data from binomial.tf2 plotted as binomial parameter estimates with the overall 95% condence limits, along with the same data in Log-Odds format, obtained by transferring the data as Y = p/(1 p) and X = x directly from the Log-Odds plot into the advanced graphics option then choosing the reverse semi-log transformation, i.e., where x is mapped to y and log y is mapped to x. Note that the error bars are exact in all plots and are therefore unsymmetrical. Observe that, in gure 3.15, estimates are both above and below the overall mean (solid circle) and overall 95% condence limits (dotted lines), indicating a signicant partitioning into two groups, while the same conclusion can be reached by observing the error bar overlaps in the Log-Odds plot. If it is suspected that the parameter estimate is varying as a function of x or several independent variables, then logistic regression using the GLM option can be used. For further details of the error bars and more advanced plotting see page 242

126

S IMFIT reference manual: Part 3

To test H0: equal binomial p-values Sample-size/no.pairs = 5 Overall sum of Y = 202 Overall sum of N = 458 Overall estimate of p = 0.4410 Lower 95% con. limit = 0.3950 Upper 95% con. limit = 0.4879 -2 log lambda (-2LL) = 1.183E+02, NDOF = 4 P(chi-sq. >= -2LL) = 0.0000 Reject H0 at 1% s-level Chi-sq. test stat (C) = 1.129E+02, NDOF = 4 P(chi-sq. >= C) = 0.0000 Reject H0 at 1% s-level y 23 12 31 65 71 N 84 78 111 92 93 lower-95% 0.18214 0.08210 0.19829 0.60242 0.66404 p-hat 0.27381 0.15385 0.27928 0.70652 0.76344 upper-95% 0.38201 0.25332 0.37241 0.79688 0.84542

Difference d(i,j) = p_hat(i) - p_hat(j) Row(i) Row(j) lower-95% d(i,j) upper-95% 1 2 -0.00455 0.11996 0.24448, 1 3 -0.13219 -0.00547 0.12125, 1 4 -0.56595 -0.43271 -0.29948, 1 5 -0.61829 -0.48963 -0.36097, 2 3 -0.24109 -0.12543 -0.00977, 2 4 -0.67543 -0.55268 -0.42992, 2 5 -0.72737 -0.60959 -0.49182, 3 4 -0.55224 -0.42724 -0.30225, 3 5 -0.60427 -0.48416 -0.36405, 4 5 -0.18387 -0.05692 0.07004,

not significant not significant p( 1) < p( 4) p( 1) < p( 5) p( 2) < p( 3) p( 2) < p( 4) p( 2) < p( 5) p( 3) < p( 4) p( 3) < p( 5) not significant

Var(d(i,j)) 0.00404 0.00418 0.00462 0.00431 0.00348 0.00392 0.00361 0.00407 0.00376 0.00420

Table 3.59: Analysis of proportions: dichotomous data


p-estimated as a function of x
Control Parameter x
1.00 5 4 3 2 1 0 0 1 2 3 4 5 -2 -1 0 1

Log Odds Plot

p(x) with con.lims.

0.50

0.00

x (control variable)

log10[p /(1 - p )]

Figure 3.15: Plotting analysis of proportions data


3.9.7.2 Condence limits for analysis of two proportions

Given two proportions pi and p j estimated as p i = yi /Ni p j = y j /N j

Analysis of proportions

127

it is often wished to estimate condence limits for the relative risk RRi j , the difference between proportions DPi j , and the odds ratio ORi j , dened as RRi j = p i / p j DPi j = p i p j ORi j = p i (1 p j )/[ p j (1 p i)]. First of all note that, for small proportions, the odds ratios and relative risks are similar in magnitude. Then it should be recognized that, unlike the case of single proportions, exact condence limits can not easily be estimated. However, approximate central 100(1 )% condence limits can be calculated using log(RRi j ) Z/2 DPi j Z/2 log(ORi j ) Z/2 j 1 p i 1 p + Ni p i Nj p j j (1 p j) p i (1 p i ) p + Ni Nj 1 1 1 1 + + + yi Ni yi y j N j y j

provided p i and p j are not too close to 0 or 1. Here Z/2 is the upper 100(1 /2) percentage point for the standard normal distribution, and condence limits for RRi j and ORi j can be obtained using the exponential function.
3.9.7.3 Meta analysis

A pair of success/failure classications with y successes in N trials, i.e. with frequencies n11 = y1 , n12 = N1 y1 , n21 = y2 , and n22 = N2 y2 , results in a 2 by 2 contingency table, and meta analysis is used for exploring k sets of such 2 by 2 contingency tables. That is, each row of each table is a pair of numbers of successes and number of failures, so that the Odds ratio in contingency table k can be dened as y1k /(N1k y1k ) y2k /(N2k y2k ) n11k n22k = . n12k n21k Typically, the individual contingency tables would be for partitioning of groups before and after treatment, and a common situation would be where the aim of the meta analysis would be to assess differences between the results summarized in the individual contingency tables, or to construct a best possible Odds ratio taking into account the sample sizes for appropriate weighting. Suppose, for instance, that contingency table number k is n11k n12k n1+k n21k n22k n2+k n+1k n+2k n++k Odds ratiok = where the marginals are indicated by plus signs in the usual way. Then, assuming conditional independence and a hypergeometric distribution (page 284), the mean and variance of n11k are given by E (n11k ) = n1+k n+1k /n++k n 1 +k n 2 +k n +1 k n +2 k V (n11k ) = 2 , n++k (n++k 1) and, to test for signicant differences between m contingency tables, the Cochran-Mantel-Haenszel test statistic CMH , given by
2

CMH =

k =1

(n11k E (n11k))
k =1

1 2

V (n11k )

128

S IMFIT reference manual: Part 3

can be regarded as an approximately chi-square variable with one degree of freedom. Some authors omit the continuity correction and sometimes the variance estimate is taken to be

(n11k ) = n1+k n2+k n+1k n+2k /n3 V ++k .

As an example, read in meta.tf1 and observe the calculation of the test statistic as shown in table 3.60. The

No. of 2 by 2 tables = 8 To test H0: equal binomial p-values Sample-size/no.pairs = 16 Overall sum of Y = 4081 Overall sum of N = 8419 Overall estimate of p = 0.4847 Lower 95% con. limit = 0.4740 Upper 95% con. limit = 0.4955 -2 log lambda (-2LL) = 3.109E+02, NDOF = 15 P(chi-sq. >= -2LL) = 0.0000 Reject H0 at 1% s-level Chi-sq. test stat (C) = 3.069E+02, NDOF = 15 P(chi-sq. >= C) = 0.0000 Reject H0 at 1% s-level Cochran-Mantel-Haenszel 2 x 2 x k Meta Analysis y N Odds Ratio E[n(1,1)] Var[n(1,1)] 126 226 2.19600 113.00000 16.89720 35 96 908 1596 2.14296 773.23448 179.30144 497 1304 913 1660 2.17526 799.28296 149.27849 336 934 235 407 2.85034 203.50000 31.13376 58 179 402 710 2.31915 355.00000 57.07177 121 336 182 338 1.58796 169.00000 28.33333 72 170 60 159 2.36915 53.00000 9.00000 11 54 104 193 2.00321 96.50000 11.04518 21 57 H0: conditional independence (all odds ratios = 1) CMH Test Statistic = 2.794E+02 P(chi-sq. >= CMH) = 0.0000 Reject H0 at 1% s-level Common Odds Ratio = 2.174E+00, 95%cl = (1.914E+00, 2.471E+00) Overall 2 by 2 table y N - y 2930 2359 1151 1979 Overall Odds Ratio = 2.136E+00, 95%cl = (1.950E+00, 2.338E+00) Table 3.60: Analysis of proportions: meta analysis

Analysis of proportions

129

MH presented in table 3.60 is calculated allowing for random effects using estimated common odds ratio

MH =

k =1 m k =1

(n11k n22k /n++k ) (n12k n21k /n++k )


,

while the variance is used to construct the condence limits from

MH )] = 2 [log(

k =1

(n11k + n22k )n11k n22k /n2 ++k


2
m k =1

n11k n22k /n++k

k =1

[(n11k + n22k)n12k n21k + (n12k + nn21k)n11k n22k ]/n2 ++k


2
m k =1

n11k n22k /n++k

k =1

n12k n21k /n++k

k =1

(n12k + n21k)n12k n21k /n2 ++k


2
k =1

n12k n21k /n++k

Also, in table 3.60, the overall 2 by 2 contingency table using the pooled sample assuming a xed effects model is listed for reference, along with the overall odds ratio and estimated condence limits calculated using the expressions presented previously for an arbitrary log odds ratio (page 127). Table 3.61 illustrates Difference d(i,j) = p_hat(i) - p_hat(j) Row(i) Row(j) d(i,j) lower-95% upper-95% 1 2 0.19294 0.07691 0.30897, 3 4 0.18779 0.15194 0.22364, 5 6 0.19026 0.15127 0.22924, 7 8 0.25337 0.16969 0.33706, 9 10 0.20608 0.14312 0.26903, 11 12 0.11493 0.02360 0.20626, 13 14 0.17365 0.04245 0.30486, 15 16 0.17044 0.02682 0.31406,

p( 1) p( 3) p( 5) p( 7) p( 9) p(11) p(13) p(15)

> > > > > > > >

p( 2) p( 4) p( 6) p( 8) p(10) p(12) p(14) p(16)

Var(d) NNT=1/d 0.00350 5 0.00033 5 0.00040 5 0.00182 4 0.00103 5 0.00217 9 0.00448 6 0.00537 6

Table 3.61: Analysis of proportions: risk difference another technique to study sets of 2 by 2 contingency tables. SIMFIT can calculate all the standard probability statistics for sets of paired experiments. In this case the pairwise differences are illustrated along with the number needed to treat i.e. NNT = 1/d , but it should be remembered that such estimates have to be interpreted with care. For instance, the differences and log ratios change sign when the rows are interchanged. Figure 3.16 shows the Log-Odds-Ratio plot with condence limits resulting from this analysis, after transferring to advanced graphics as just described for ordinary analysis of proportions. The relative position of the data with respect to the line Log-Odds-Ratio = 0 clearly indicates a shift from 50:50 but non-disjoint condence limits do not suggest statistically signicant differences. For further details of the error bars and more advanced plotting see page 242 Contingency table analysis is compromised when cells have zero frequencies, as many of the usual summary statistics become undened. Structural zeros are handled by applying loglinear GLM analysis but sampling

130

S IMFIT reference manual: Part 3

Meta Analysis of 2 by 2 Contingency Tables


Control Parameter x
8 6 4 2 0 0 1

log10[Odds Ratios]
Figure 3.16: Meta analysis and log odds ratios zeros presumably arise from small samples with extreme probabilities. Such tables can be analyzed by exact methods, but usually a positive constant is added to all the frequencies to avoid the problems. Table 3.62 illustrates how this problem is handled in SIMFIT when analyzing data in the test le meta.tf4; the correcCochran-Mantel-Haenszel 2 x 2 x k Meta Analysis y N Odds Ratio E[n(1,1)] Var[n(1,1)] *** 0.01 added to all cells for next calculation 0 6 0.83361 0.01091 0.00544 0 5 *** 0.01 added to all cells for next calculation 3 6 601.00000 1.51000 0.61686 0 6 *** 0.01 added to all cells for next calculation 6 6 1199.00995 4.01000 0.73008 2 6 *** 0.01 added to all cells for next calculation 5 6 0.00825 5.51000 0.25454 6 6 *** 0.01 added to all cells for next calculation 2 2 0.40120 2.01426 0.00476 5 5 H0: conditional independence (all odds ratios = 1) CMH Test Statistic = 3.862E+00 P(chi-sq. >= CMH) = 0.0494 Reject H0 at 5% s-level Common Odds Ratio = 6.749E+00, 95%cl = ( 1.144E+00, 3.981E+01) Table 3.62: Analysis of proportion: meta analysis with zero frequencies tion of adding 0.01 to all contingency tables frequencies being indicated. Values ranging from 0.00000001 to

Analysis of proportions

131

0.5 have been suggested elsewhere for this purpose, but all such choices are a compromise and, if possible, sampling should be continued until all frequencies are nonzero.
3.9.7.4 Bioassay, estimating percentiles

Where it is required to construct a dose response curve from sets of (y, N ) data at different levels of an independent variable, x, it is sometimes useful to apply probit analysis or logistic regression to estimate percentiles, like LD50 (page 74) using generalized linear models (page 50). To observe how this works, read in the test le ld50.tf1 and try the various options for choosing models, plotting graphs and examining residuals.
3.9.7.5 Trichotomous data

This procedure is used when an experiment has three possible outcomes, e.g., an egg can fail to hatch, hatch male, or hatch female, and you wish to compare the outcome from a set of experiments. For example, read in trinom.tf1 then trinom.tf2 to see how to detect signicant differences graphically (i.e., where there are non-overlapping condence regions) in trinomial proportions, i.e., where groups can be split into three categories. For details of the trinomial distribution see page 284 and for plotting contours see page 257.

132

S IMFIT reference manual: Part 3

3.9.8 Multivariate statistics


3.9.8.1 Correlation: parametric (Pearson product moment)

Given any set of nonsingular n (xi , yi ) pairs, a correlation coefcient r can be calculated as )(yi y ) (xi x
n n

r=

i=1 n

i=1

)2 (yi y )2 (xi x
i=1

where 1 r 1 and, using bxy for the slope of the regression of X on Y , and byx for the slope of the regression of Y on X , r2 = byx bxy . However, only when X is normally distributed given Y , and Y is normally distributed given X can simple statistical tests be used for signicant linear correlation. Figure 3.17 illustrates how the elliptical contours of
Bivariate Normal Distribution: = 0
3

Bivariate Normal: = 0
Key Contour
1 2 3 4 5 6 7 8 9 10 1.44010-2 2.87810-2 4.31610-2 5.75510-2 7.19310-2 8.63110-2 0.101 0.115 0.129 0.144

1.55210-1
4

3 7 10 6 9 5

-3 Y 3 3 X

1.96410-5 -3

-3 -3 3

Bivariate Normal Distribution: = 0.9


3

Bivariate Normal: = 0.9


Key Contour
1 2 3 4 5 6 7 8 9 10
9 5 1

3.60410-1

4 8 10 7 3 6 2

3.30910-2 6.61810-2 9.92710-2 0.132 0.165 0.199 0.232 0.265 0.298 0.331

-3 Y 3 3 X

2.99210-40 -3

-3 -3 3

Figure 3.17: Bivariate density surfaces and contours constant probability for a bivariate normal distribution discussed on page 287 are aligned with the X and Y axes when X and Y are uncorrelated, i.e., = 0 but are inclined otherwise. In this example X = Y = 0 and x = Y = 1, but in the upper gure = 0, while in the lower gure = 0.9. The Pearson product moment correlation coefcient r is an estimator of , and it can can be used to test for independence of X and Y . For instance, when the (xi , yi ) pairs are from such a bivariate normal distribution, the statistic t=r n2 1 r2

Multivariate statistics

133

has a Students t -distribution with n 2 degrees of freedom. The SIMFIT product moment correlation procedure can be used when you have a data matrix X consisting of m > 1 columns of n > 1 measurements (not counts or categorical data) and wish to test for pairwise linear correlations, i.e., where pairs of columns can be regarded as consistent with a joint normal distribution. In matrix notation, the relationships between such a n by m data matrix X , the same matrix Y after centering by subtracting each column mean from the corresponding column, the sum of squares and products matrix C, the covariance matrix S, the correlation matrix R, and the diagonal matrix D of standard deviations are C = Y TY 1 S= C n1 D = diag( s11 , s22 , . . . , smm ) R = D1 SD1 S = DRD. So, for all pairs of columns, the sample correlation coefcients r jk are given by s jk r jk = , s j j skk where s jk = 1 n (xi j x j )(xik x k ), n 1 i =1

and the corresponding t jk values and signicance levels p jk are calculated then output in matrix format with the correlations as a strict upper triangular matrix, and the signicance levels as a strict lower triangular matrix. Table 3.63 shows the results from analyzing the test le g02baf.tf1, which refers to a set of 3 column vectors of length 5. To be more precise, the values ai j for matrix A in table 3.63 are interpreted as now described. Matrix A, Pearson correlation results Upper triangle = r, Lower = corresponding two-tail p values ..... -0.5704 0.1670 0.3153 ..... -0.7486 0.7883 0.1455 ..... Test for absence of any significant correlations H0: correlation matrix is the identity matrix Determinant = 2.290E-01 Test statistic (TS) = 3.194E+00 Degrees of freedom = 3 P(chi-sq >= TS) = 0.3627 Table 3.63: Correlation: Pearson product moment analysis For j > i in the upper triangle, then ai j = ri j = r ji are the correlation coefcients, while for i > j in the lower triangle ai j = pi j = p ji are the corresponding two-tail probabilities. The self-correlations are all 1, of course, and so they are represented by dotted lines. Table 3.63 indicates that none of the correlations are signicant in this case, that is, the probability of obtaining such pairwise linearity in a random swarm of points is not low, but after the correlation matrix the results of a likelihood ratio test for the absence of signicant correlations are displayed. To test the hypothesis of no signicant correlations, i.e. H0 : the covariance matrix is diagonal, or equivalently H0 : the correlation matrix R is the identity matrix, the statistic 2 log = (n (2m + 11)/6) log |R|

134

S IMFIT reference manual: Part 3

is used, which has the asymptotic chi-square distribution with m(m 1)/2 degrees of freedom. After the results have been calculated you can choose pairs of columns for further analysis, as shown for the test le cluster.tf1 in table 3.64, where there seem to be signicant correlations. First the test for Test for absence of any significant correlations H0: correlation matrix is the identity matrix Determinant = 2.476E-03 Test statistic (TS) = 4.501E+01 Degrees of freedom = 28 P(chi-sq >= TS) = 0.0220 Reject H0 at 5% sig.level For the next analysis: X is column 1, Y is column 2

Unweighted linear regression for y = A + B*x and x = C + D*y mean of 12 mean of 12 x-values = 8.833E+00, std. dev. x = 5.781E+00 y-values = 9.917E+00, std. dev. y = 7.597E+00

Parameter Estimate Std.Err. Est./Std.Err. p B (slope) 6.958E-01 3.525E-01 1.974E+00 0.0766 A (const) 3.770E+00 3.675E+00 1.026E+00 0.3291 r (Ppmcc) 5.295E-01 0.0766 r-squared 2.804E-01, y-variation due to x = 28.04% z(Fisher) 5.895E-01, Note: z = (1/2)log[(1 + r)/(1 - r)], r2 = B*D, and sqrt[(n-2)/(1-r2)]r = Est./Std.Err. for B and D The Pearson product-moment corr. coeff. r estimates rho and 95% conf. limits using z are -0.0771 =< rho =< 0.8500 Source Sum of squares ndof Mean square F-value due to regression 1.780E+02 1 1.780E+02 3.896E+00 about regression 4.569E+02 10 4.569E+01 total 6.349E+02 11 Conclusion: m is not significantly different from zero (p > 0.05) c is not significantly different from zero (p > 0.05) The two best-fit unweighted regression lines are: y = 3.770E+00 + 6.958E-01*x, x = 4.838E+00 +

4.029E-01*y

Table 3.64: Correlation: analysis of selected columns signicant correlation was done, then columns 1 and 2 were selected for further analysis, consisting of all the statistics necessary to study the regression of column 1 on column 2 and vice versa. Various graphical techniques are then possible to visualize correlation between the columns selected by superimposing best-t lines or condence region ellipses (page 244). Highly signicant linear correlation is indicated by best-t lines with similar slopes as well as r values close to 1 and small p values. Note that, after a correlation analysis, a line can be added to the scattergram to indicate the extent of rotation of the axes of the ellipses from coincidence with the X , Y axes. You can plot either both unweighted regression lines, the unweighted reduced major axis line, or the unweighted major axis line. Plotting both lines is the most useful and least controversial; plotting the reduced major axis line, which minimizes the sum of the areas of triangles between data and best t line, is favored by some for allometry; while the major axis minimizes the sum of squared differences between the data and line and should only be used when both variables have similar ranges and units of measurement. If a single line must be plotted to summarize the overall correlation, it should be either the reduced or major axis line, since these allow for uncertainty in both variables. It should not be either of

Multivariate statistics

135

the usual regression lines, since the line plotted should be independent of which variable is regarded as x and which is regarded as y.
3.9.8.2 Correlation: nonparametric (Kendall tau and Spearman rank)

These nonparametric procedures can be used when the data matrix does not consist of columns of normally distributed measurements, but may contain counts or categorical variables, etc. so that the conditions for Pearson product moment correlation are not satised and ranks have to be used. Suppose, for instance, that the data matrix has n rows (observations) and m columns (variables) with n > 1 and m > 1, then the xi j are replaced by the corresponding column-wise ranks yi j , where groups of tied values are replaced by the average of the ranks that would have been assigned in the absence of ties. Kendalls tau jk for variables j and k is then dened as jk =
h=1 i=1

f (yh j yi j ) f (yhk yik )


[n(n 1) T j][n(n 1)Tk ]

where f (u) = 1 if u > 0, = 0 if u = 0, = 1 if u < 0,

and T j = t j (t j 1). Here t j is the number of ties at successive tied values of variable j, and the summation is over all tied values. For large samples jk is approximately normally distributed with =0 2 = 4n + 10 9n(n 1)

which can be used as a test for the absence of correlation. Another alternative is to calculate Spearmans rank coefcient c jk , dened as n(n2 1) 6 (yi j yik )2 (T j + Tk )/2
i=1 n

c jk =

where now T j = t j (t 2 j 1)

[n(n2 1) T j ][n(n2 1)Tk ]

and a test can be based on the fact that, for large samples, the statistic t jk = c jk n2 1 c2 jk

is approximately t -distributed with n 2 degrees of freedom. For example, read in and analyze the test le npcorr.tfl as previously to obtain Table 3.65. To be more precise, matrices A and B in table 3.65 are to be interpreted as follows. In the rst matrix A, for j > i in the upper triangle, then ai j = ci j = c ji are Spearman correlation coefcients, while for i > j in the lower triangle ai j = i j = ji are the corresponding Kendall coefcients. In the second matrix B, for j > i in the upper triangle, then bi j = pi j = p ji are two-tail probabilities for the corresponding ci j coefcients, while for i > j in the lower triangle bi j = pi j = p ji are the corresponding two-tail probabilities for the corresponding i j . Note that, from these matrices, jk , c jk and p jk values are given for all possible correlations j, k. Also, note that these nonparametric correlation tests are tests for monotonicity rather that linear correlation but, as with the previous parametric test, the columns of data must be of the same length and the values must be ordered according to some correlating inuence such as multiple responses on the same animals. If the number of categories is small or there are many ties, then Kendalls Tau is to be preferred and conversely. Since you are not testing for linear correlation you should not add regression lines when plotting such correlations.

136

S IMFIT reference manual: Part 3

Nonparametric correlation results Matrix A: Upper triangle = Spearmans, Lower = Kendalls tau ..... 0.2246 0.1186 0.0294 ..... 0.3814 0.1176 0.2353 ..... Matrix B: Two tail p-values ..... 0.5613 0.7611 0.9121 ..... 0.3112 0.6588 0.3772 .....

Table 3.65: Correlation: Kendall-tau and Spearman-rank

3.9.8.3 Correlation: partial

Partial correlations are useful when it is believed that some subset of the variables in a multivariate data set can realistically be regarded as normally distributed random variables, and correlation analysis is required for this subset of variables, conditional upon the remaining variables being regarded as xed at their current values. This is most easily illustrated in the case of three variables, and table 3.66 illustrates the calculation of partial correlation coefcients, together with signicance tests and condence limits for the correlation matrix in pacorr.tf1. Assuming a multivariate normal distribution and linear correlations, the partial correlations Partial correlation data: 1=Intelligence, 2=Weight, 3=Age 1.0000 0.6162 0.8267 1.0000 0.7321 1.0000 No. variables = 3, sample size = 30 r(1,2) = 0.6162 r(1,3) = 0.8267 r(2,3) = 0.7321 ...... r(1,2|3) = 0.0286 (95%c.l. = -0.3422, 0.3918) t = 1.488E-01, ndof = 27, p = 0.8828 ...... r(1,3|2) = 0.7001 (95%c.l. = 0.4479, 0.8490) t = 5.094E+00, ndof = 27, p = 0.0000 Reject H0 at 1% sig.level ...... r(2,3|1) = 0.5025 (95%c.l. = 0.1659, 0.7343) t = 3.020E+00, ndof = 27, p = 0.0055 Reject H0 at 1% sig.level Table 3.66: Correlation: partial between any two variables from the set i, j, k conditional upon the third can be calculated using the usual correlation coefcients as ri j rik r jk ri, j|k = . 2 )(1 r2 ) (1 rik jk If there are p variables in all but p q are xed then the sample size n can be replaced by n ( p q) in the usual signicance tests and estimation of condence limits, e.g. n ( p q) 2 for a t test. From table 3.66 it is clear that when variable 3 is regarded as xed, the correlation between variables 1 and 2 is not signicant

Multivariate statistics

137

but, when either variable 1 or variable 2 are regarded as xed, there is evidence for signicant correlation between the other variables. The situation is more involved when there are more than three variables, say nx X variables which can be regarded as xed, and the remaining ny Y variables for which partial correlations are required conditional on the xed variables. Then the variance-covariance matrix can be partitioned as in = xx yx xy yy

when the variance-covariance of Y conditional upon X is given by


1 y|x = yy yx xx xy ,

while the partial correlation matrix R is calculated by normalizing as R = diag(y|x ) 2 y|x diag(y|x ) 2 . This analysis requires a technique for indicating that a full correlation matrix is required for all the variables, but then in a subsequent step some variables are to be regarded as X variables, and others as Y variables. All this can be done interactively but SIMFIT provides a convenient method for doing this directly from the data le. For instance, at the end of the test le g02byf.tf1, which is a full data set, not a correlation matrix, will be found the additional lines begin{indicators} -1 -1 1 end{indicators} and the indicator variables have the following signicance. A value of 1 indicates that the corresponding variable is to be used in the calculation of the full correlation matrix, but then this variable is to be regarded as a Y variable when the partial correlation matrix is calculated. A value of 1 indicates that the variable is to be included in the calculation of the full correlation matrix, then regarded as an X variable when the partial correlation matrix is to be calculated. Any values of 0 indicate that the corresponding variables are to be suppressed. Table 3.67 illustrates the successive results for test le g02byf.tf1 when Pearson correlation is performed, followed by partial correlation with variables 1 and 2 regarded as Y and variable 3 regarded as X . Exactly as for the full correlation matrix, the strict upper triangle of the output from the partial correlation Matrix A, Pearson product moment correlation results: Upper triangle = r, Lower = corresponding two-tail p values ..... 0.7560 0.8309 0.0011 ..... 0.9876 0.0001 0.0000 ..... Test for absence of any significant correlations H0: correlation matrix is the identity matrix Determinant = 3.484E-03 Test statistic (TS) = 6.886E+01 Degrees of freedom = 3 P(chi-sq >= TS) = 0.0000 Reject H0 at 1% sig.level Matrix B, partial correlation results for variables: yyx Upper triangle: partial r, Lower: corresponding 2-tail p values ...-0.7381 0.0026 ... Table 3.67: Correlation: partial correlation matrix
1 1

138

S IMFIT reference manual: Part 3

analysis contains the partial correlation coefcients ri j , while the strict lower triangle holds the corresponding two tail probabilities pi j where pi j = P tnnx 2 |ri j | n nx 2 1 ri2j + P tnnx 2 |ri j | n nx 2 1 ri2j .

To be more precise, the values ai j and bi j in the matrices A and B of table 3.67 are interpreted as now described. In the rst matrix A, for j > i in the upper triangle, then ai j = ri j = r ji are full correlation coefcients, while for i > j in the lower triangle ai j = pi j = p ji are the corresponding two-tail probabilities. In the second matrix B, for j > i in the upper triangle, then bi j = ri j = r ji are partial correlation coefcients, while for i > j in the lower triangle bi j = pi j = p ji are the corresponding two-tail probabilities.
3.9.8.4 Correlation: canonical

This technique is employed when a n by m data matrix includes at least two groups of variables, say nx variables of type X , and ny variables of type Y , measured on the same n subjects, so that m nx + ny . The idea is to nd two transformations, one for the X variables to generate new variables V , and one for the Y variables to generate new variables U , with l components each for l min(nx , ny ), such that the canonical variates u1 , v1 calculated from the data using these transformations have maximum correlation, then u2 , v2 , and so on. Now the variance-covariance matrix of the X and Y data can be partitioned as Sxx Syx Sxy Syy

and it is required to nd transformations that maximize the correlations between the X and Y data sets. Actually, the equations
1 (Sxy Syy Syx R2 Sxx )a = 0 1 Sxy R2Syy )b = 0 (Syx Sxx

1 S S1 S and S1 S S1 S , and the square roots of have the same nonzero eigenvalues as the matrices Sxx xy yy yx yy yx xx xy these eigenvalues are the canonical correlations, while the eigenvectors of the two above equations dene the canonical coefcients, i.e. loadings. Table 3.68 shows the results from analyzing data in the test le g03adf.tf1, which has 9 rows and 4 columns. Users of this technique should note that the columns of

Variables: yxxy No. x = 2, No. y = 2, No. unused = 0 Minimum of rank of x and rank of y = 2 Correlations Eigenvalues Proportions Chi-sq. 0.9570 9.1591E-01 0.8746 1.4391E+01 0.3624 1.3133E-01 0.1254 7.7438E-01 CVX: Canonical coefficients for centralized X -4.261E-01 1.034E+00 -3.444E-01 -1.114E+00 CVY: Canonical coefficients for centralized Y -1.415E-01 1.504E-01 -2.384E-01 -3.424E-01 Table 3.68: Correlation: canonical

NDOF 4 1

p 0.0061 0.3789

the data matrix must be indicated by setting a n by 1 integer vector with values of 1 for X , 0 for variable suppressed, or -1 for Y . Such variable indicators can be initialized from the trailing section of the data le, by using the special token begin{indicators}, as will be seen at the end of g03adf.tf1, where the following indicator variables are appended

Multivariate statistics

139

begin{indicators} -1 1 1 -1 end{indicators} indicating that variables 1 and 4 are to be considered as Y variables, while variables 2 and 3 are to be regarded as X variables. However, the assignment of data columns to groups of type X , suppressed, or Y can also be adjusted interactively if required. Note that the eigenvalues are proportional to the correlation explained by the corresponding canonical variates, so a scree diagram can be plotted to determine the minimum number of canonical variates needed to adequately represent the data. This diagram plots the eigenvalues together with the average eigenvalue, and the canonical variates with eigenvalues above the average should be retained. Alternatively, assuming multivariate normality, the likelihood ratio test statistics 2 log = (n (kx + ky + 3)/2)

j =i+1

log(1 R2 j)

can be calculated for i = 0, 1, . . . , l 1, where kx nx and ky ny are the ranks of the X and Y data sets and l = min(kx , ky ). These are asymptotically chi-square distributed with (kx i)(ky i) degrees of freedom, so that the case i = 0 tests that none of the l correlations are signicant, the case i = 1 tests that none of the remaining l 1 correlations are signicant, and so on. If any of these tests in sequence are not signicant, then the remaining tests should, of course, be ignored. Figure 3.18 illustrates two possible graphical display for the canonical variates dened by matrix.tf5, where
Canonical Correlation
2 2

Canonical Correlation

Canonical Variable u1

Canonical Variable u2
-3 -2 -1 0 1 2

-1

-1

-2

-2

-3

-3

-3

-2

-1

Canonical Variable v1

Canonical Variable v2

Figure 3.18: Canonical correlations for two groups columns 1 and 2 are designated the Y sub-matrix, while columns 3 and 4 hold the X matrix. The canonical variates for X are constructed from the nx by ncv loading or coefcient matrix CV X , where CV X (i, j) contains the loading coefcient for the ith x variable on the jth canonical variate u j . Similarly CVY ) is the ny by ncv loading coefcient matrix for the ith y variable on the jth canonical variate v j . More precisely, if cvx j is column j of CV X , and cvy j is column j of CVY , while x(k) is the vector of centralized X observations for case k, and y(k) is the vector of centralized Y observations for case k, then the components u(k) j and v(k) j of the n vector canonical variates u j and v j are v(k) j = cvxT j x(k), u(k) j = cvyT j y(k), k = 1, 2, . . . , n k = 1, 2, . . . , n.

It is important to realize that the canonical variates for U and V do not represent any sort of regression of Y on X , or X on Y , they are just new coordinates chosen to present the existing correlations between the original X and Y in a new space where the correlations are then ordered for convenience as R2 (u1 , v1 ) R2 (u2 , v2 ) . . . R2 (ul , vl ).

140

S IMFIT reference manual: Part 3

Clearly, the left hand plot shows the highest correlation, that is, between u1 and v1 , whereas the right hand plot illustrates weaker correlation between u2 and v2 . Note that further linear regression and correlation analysis can also be performed on the canonical variates if required, and also the loading matrices can be saved to construct canonical variates using the SIMFIT matrix multiplication routines, and vectors of canonical variates can be saved directly from plots like those in gure 3.18.
3.9.8.5 Cluster analysis: multivariate dendrograms

The idea is, as in data mining, where you have a n by m matrix ai j of m variables (columns) for each of n cases (rows) and wish to explore clustering, that is groupings together of like entities. To do this, you choose an appropriate pre-analysis transformation of the data, a suitable distance measure, a meaningful scaling procedure and a sensible linkage function. SIMFIT will then calculate a distance matrix, or a similarity matrix, and plot the clusters as a dendrogram. The shape of the dendrogram depends on the choice of analytical techniques and the order of objects plotted is arbitrary: groups at a given xed distance can be rotated and displayed in either orientation. As an example, analyze the test le cluster.tf1 giving the results displayed in table 3.69. The test le cluster.tf1 should be examined to see how to provide labels, Variables included:1 2 3 4 5 6 7 8 Transformation = Untransformed Distance = Euclidean distance Scaling = Unscaled Linkage = Group average Weighting [weights r not used] Distance matrix (strict lower triangle) is:2) 2.20E+01 3) 3.62E+01 2.88E+01 4) 2.29E+01 2.97E+01 3.66E+01 5) 1.95E+01 1.66E+01 3.11E+01 2.45E+01 6) 3.98E+01 3.27E+01 4.06E+01 3.18E+01 2.61E+01 7) 2.17E+01 2.83E+01 3.82E+01 2.13E+01 1.93E+01 8) 1.41E+01 2.41E+01 4.26E+01 1.88E+01 1.89E+01 9) 3.27E+01 2.30E+01 4.54E+01 4.49E+01 2.36E+01 10) 3.16E+01 2.39E+01 3.72E+01 4.10E+01 2.22E+01 10) 2.47E+01 11) 3.22E+01 2.44E+01 3.91E+01 4.18E+01 2.02E+01 11) 1.99E+01 8.25E+00 12) 2.99E+01 2.27E+01 3.77E+01 3.90E+01 1.72E+01 12) 1.81E+01 1.14E+01 6.24E+00

3.62E+01 3.42E+01 1.85E+01 3.87E+01 3.66E+01 3.34E+01 4.39E+01 3.35E+01 3.39E+01 (+) 4.14E+01 3.13E+01 3.34E+01 (+) 3.84E+01 2.92E+01 3.14E+01 (+)

Table 3.69: Cluster analysis: distance matrix as in gure 3.19, and further details about plotting dendrograms will be found on pages 247, 248 and 249. The distance d jk between objects j and k is just a chosen variant of the weighted L p norm d jk = { wi jk D(a ji /si , aki /si )} p , for some D, e.g.,
i=1 m

(a) The Euclidean distance D(, ) = ( )2 with p = 1/2 and wi jk = 1

(b) The Euclidean squared difference D(, ) = ( )2 with p = 1 and wi jk = 1

(c) The absolute distance D = | | with p = 1 and wi jk = 1, otherwise known as the Manhattan or city block metric.

Multivariate statistics

141

Cluster Analysis Dendrogram


40.0 30.0 20.0 10.0 0.0 J-10 K-11 L-12 G-7 I-9 A-1 B-2 H-8 D-4 E-5 F-6 C-3

Distance

Figure 3.19: Dendrograms and multivariate cluster analysis However, as the values of the variables may differ greatly in size, so that large values would dominate the analysis, it is usual to subject the data to a preliminary transformation or to apply a suitable weighting. Often it is best to transform the data to standardized (0, 1) form before constructing the dendrogram, or at least to use some sort of scaling procedure such as: (i) use the sample standard deviation as si for variable i, (ii) use the sample range as si for variable i, or (iii) supply precalculated values of si for variable i. Bray-Curtis dissimilarity uses the absolute distance except that the weighting factor is given by wi jk = 1 m i=1 (a ji /si + aki /si )

which is independent of the variables i and only depends on the cases j and k, and distances are usually multiplied by 100 to represent percentage differences. Bray-Curtis similarity is the complement, i.e., 100 minus the dissimilarity. The Canberra distance measure, like the Bray-Curtis one, also derives from the absolute distance except that the weighting factor is now wi jk = 1 . (a ji /si + aki /si )

There are various conventions for dening and deciding what to do when values or denominators are zero with the Bray-Curtis and Canberra distance measures, and the scheme used by SIMFIT is as follows. If any data values are negative the calculation is terminated. If any Bray-Curtis denominator is zero the calculation is terminated. If there are no zero values, then is equal to the number of variables in the Canberra measure. If both members of a pair are zero, then is decreased by one for each occurrence of such a pair, and the pairs are ignored.

142

S IMFIT reference manual: Part 3

If one member of a pair is zero, then it is replaced by the smallest non-zero value in the data set divided by ve, then scaled if required. Another choice which will affect the dendrogram shape is the method used to recalculate distances after each merge has occurred. Suppose there are three clusters i, j, k with ni , n j , nk objects in each cluster and let clusters j and k be merged to give cluster jk. Then the distance from cluster i to cluster jk can be calculated in several ways. [1] Single link: di, jk = min(di j , dik ) [2] Complete link: di, jk = max(di j , dik ) [3] Group average: di, jk = (n j di j + nk dik )/(n j + nk ) [4] Centroid: di, jk = (n j di j + nk dik n j nk d jk /(n j + nk ))/(n j + nk ) [5] Median: di, jk = (di j + dik d jk /2)/2 [6] Minimum variance: di, jk = {(ni + n j )di j + (ni + nk )dik nid jk }/(ni + n j + nk ) An important application of distance matrices and dendrograms is in partial clustering. Unlike the situation with full clustering where we start with n groups, each containing a single case, and nish with just one group containing all the cases, in partial clustering the clustering process is not allowed to be completed. There are two distinct ways to arrest the clustering procedure. 1. A number, K , between 1 and n is chosen, and clustering is allowed to proceed until just K subgroups have been formed. It may not always be possible to satisfy this requirement, e.g. if there are ties in the data. 2. A threshold, D, is set somewhere between the rst clustering distance and the last clustering distance, and clustering terminates when this threshold is reached. The position of such clustering thresholds will be plotted on the dendrogram, unless D is set equal to zero. As an example of this technique consider the results in table 3.70. This resulted from analysis of the famous Fisher iris data set in iris.tf1 when K = 3 subgroups were requested. We note that groups 1 (setosa) and 2 (versicolor) contained the all the cases from the known classication, but most of the known group 3 (virginica) cases (those identied by asterisks) were also assigned to subgroup 2. This table should also be compared to table 3.73 resulting from K -means clustering analysis of the same data set. From the SIMFIT dendrogram partial clustering procedure it is also possible to create a SIMFIT MANOVA type le for any type of subsequent MANOVA analysis and, to aid in the use of dendrogram clusters as training sets for allocating new observations to groups, the subgroup centroids are also appended to such les. Finally, attention should be drawn to the advanced techniques for plotting dendrogram thresholds and subgroups illustrated on page 249.
3.9.8.6 Cluster analysis: classical metric scaling

Scaling techniques provide various alternatives to dendrograms for visualizing distances between cases, so facilitating the recognition of potential groupings in a space of lower dimension than the number of variables. For instance, once a distance matrix D = (di j ) has been calculated for m > 2 variables, as described for dendrograms (page 140), it may be possible to calculate principal coordinates. This involves constructing a matrix E dened by 2 2 ei j = 1/2(di2j di2 . d. j d.. ),
2 where di2 . is the average of di j over the sufx j , etc., in the usual way. The idea is to choose an integer k, where 1 < k < m 1, so that the data can be represented approximately in a space of dimension less than the number of variables, but in such a way that the distance between the points in that space correspond to the

Multivariate statistics

143

Title: Fishers Iris data, 3 groups, 4 variables, Variables included: 1 2 3 4 Transformation = Untransformed, Distance = Euclidean distance, Scaling = Unscaled, Linkage = Group average, [weights r not used], Dendrogram sub-clusters for K = 3 Odd rows: data ... Even rows: corresponding group number 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 1 1 1 1 1 1 1 1 13 14 15 16 17 18 19 20 21 22 23 24 1 1 1 1 1 1 1 1 1 1 1 1 25 26 27 28 29 30 31 32 33 34 35 36 1 1 1 1 1 1 1 1 1 1 1 1 37 38 39 40 41 42 43 44 45 46 47 48 1 1 1 1 1 1 1 1 1 1 1 1 49 50 51 52 53 54 55 56 57 58 59 60 1 1 2 2 2 2 2 2 2 2 2 2 61 62 63 64 65 66 67 68 69 70 71 72 2 2 2 2 2 2 2 2 2 2 2 2 73 74 75 76 77 78 79 80 81 82 83 84 2 2 2 2 2 2 2 2 2 2 2 2 85 86 87 88 89 90 91 92 93 94 95 96 2 2 2 2 2 2 2 2 2 2 2 2 97 98 99 100 101 102 103 104 105 106 107 108 2 2 2 2 2* 2* 3 2* 2* 3 2* 3 109 110 111 112 113 114 115 116 117 118 119 120 2* 3 2* 2* 2* 2* 2* 2* 2* 3 3 2* 121 122 123 124 125 126 127 128 129 130 131 132 2* 2* 3 2* 2* 3 2* 2* 2* 3 3 3 133 134 135 136 137 138 139 140 141 142 143 144 2* 2* 2* 3 2* 2* 2* 2* 2* 2* 2* 2* 145 146 147 148 149 150 2* 2* 2* 2* 2* 2* Table 3.70: Cluster analysis: partial clustering for Iris data distances represented by the di j of the distance matrix as far as possible. If E is positive denite, then the ordered eigenvalues i > 0 of E will be nonnegative and the proportionality expression P = i
i=1 k m1 i=1

will show how well the data of dimension m are represented in this subspace of dimension k. The most useful case is when k = 2, or k = 3, and the di j satisfy di j dik + d jk , so that a two or three dimensional plot will display distances corresponding to the di j . If this analysis is carried out but some relatively large negative eigenvalues result, then the proportion P may not adequately represent the success in capturing the values in distance matrix in a subspace of lower dimension that can be plotted meaningfully. It should be pointed out that the principal coordinates will actually be the same as the principal components scores when the distance matrix is based on Euclidean norms. Further, where metrical scaling succeeds, the distances between points plotted in say two or three dimensions will obey the triangle inequality and so correspond reasonably closely to the distances in the dissimilarity matrix, but if it fails it could be useful to proceed to non-metrical scaling, which is discussed next.

144

S IMFIT reference manual: Part 3

3.9.8.7 Cluster analysis: non-metric (ordinal) scaling

Often a distance matrix is calculated where some or all of the variables are ordinal, so that only the relative order is important, not the actual distance measure. Non-metric (i.e. ordinal) scaling is similar to the metric scaling previously discussed, except that the representation in a space of dimension 1 < k < m 1 is sought in such a way as to attempt to preserve the relative orders, but not the actual distances. The closeness of a tted distance matrix to the observed distance matrix can be estimated as either ST RESS, or SST RESS, given by ST RESS =
i1 2 m i=1 j =1 (di j di j ) m i1 d2 i=1 j =1 i j i1 2 2 2 m i=1 j =1 (di j di j ) , m i1 d4 i=1 j =1 i j

SST RESS =

where di j is the Euclidean squared distance between points i and j, and di j is the tted distance when the di j are monotonically regressed on the di j . This means that di j is monotonic relative to di j and is obtained from di j with the smallest number of changes. This is a nonlinear optimization problem which may depend critically on starting estimates, and so can only be relied upon to locate a local, not a global solution. For this reason, starting estimates can be obtained in SIMFIT by a preliminary metric scaling, or alternatively the values from such a scaling can be randomly perturbed before the optimization, in order to explore possible alternative solution points. Note that SIMFIT can save distance matrices to les, so that dendrogram creation, classical metric, and non-metric scaling can be carried out retrospectively, without the need to generate distance matrices repeatedly from multivariate data matrices. Such distance matrices will be stored as vectors, corresponding to the strict lower triangle of the distance matrix packed by rows, (i.e. the strict upper triangle packed by columns). Table 3.71 tabulates the results from analyzing the distance matrix, stored in the test Eigenvalues from classical metric scaling 0.7871 0.2808 0.1596 0.0748 0.0316 0.0207 0.0000 -0.0122 -0.0137 -0.0305 -0.0455 -0.0562 -0.0792 -0.1174 [Sum 1 to 2]/[sum 1 to 13] = 1.0680 (106.80%) STRESS = 1.2557E-01 (start = Metric 1.4962E-01 (start = Metric 0%) 0%)

S-STRESS =

Table 3.71: Cluster analysis: metric and non-metric scaling le g03faf.tf1, by the metric, and also both non-metric techniques. This table rst lists the eigenvalues from classical metric scaling, where each eigenvalue has been normalized by dividing by the sum of all the eigenvalues, then the ST RESS and SST RESS values are listed. Note that the type of starting estimates used,

Multivariate statistics

145

together with the percentages of the metric values used in any random starts, are output by SIMFIT and it will be seen that, with this distance matrix, there are small but negative eigenvalues, and hence the proportion actually exceeds unity, and in addition two-dimensional plotting could be misleading. However it is usual to consider such small negative eigenvalues as being effectively zero, so that metric scaling in two dimensions is probably justied in this case. Figure 3.20 conrms this by showing considerable agreement between the

Classical Metric Scaling


0.40 0.40

Non-Metric Scaling

1 0.20

1 0.20

Component 2

Component 2

13 12 14 9 10 11 7

2 43 6 5

12 0.00

13 14 9 10 7

2 3 4 6 5

0.00

-0.20 8 -0.40 -0.60 -0.40 -0.20 0.00 0.20 0.40

-0.20

11 8

-0.40 -0.60 -0.40 -0.20 0.00 0.20 0.40

Component 1

Component 1

Figure 3.20: Classical metric and non-metric scaling two dimensional plots from metric scaling, and also non-metric scaling involving the ST RESS calculation. Note that the default labels in such plots may be integers corresponding to the case numbers, and not case labels, but such plot labels can be edited interactively, or overwritten from a labels le if required.
3.9.8.8 Cluster analysis: K-means

Once a n by m matrix of values ai j for n cases and m variables has been provided, the cases can be subdivided into K non-empty clusters where K < n, provided that a K by m matrix of starting estimates bi j has been specied. The procedure is iterative, and proceeds by moving objects between clusters to minimize the objective function
k=1 iSk j =1

k j )2 wi (ai j a

where Sk is the set of objects in cluster k and a k j is the weighted sample mean for variable j in cluster k. The weighting factors wi can allow for situations where the objects may not be of equal value, e.g., if replicates have been used to determine the ai j . As an example, analyze the data in test le g03eff.tf1 using the starting coordinates appended to this le, which are identical to the starting clusters in test le g03eff.tf2, to see the results displayed in table 3.72. Note that the nal cluster centroids minimizing the objective function, given the starting estimates supplied, are calculated, and the cases are assigned to these nal clusters. Plots of the clusters and nal cluster centroids can be created as in gure 3.21 for variables x1 and x2 , with optional labels if these are supplied on the data le (as for dendrograms). With two dimensional data representing actual distances, outline maps can be added and other special effects can be created, as shown on page 250. Table 3.73 illustrates analysis of the Fisher Iris data set in the le iris.tf1, using starting clusters in iris.tf2. It should be compared with table 3.70. The data were maintained in the known group order (as in manova1.tf5), and the clusters assigned are seen to be identical to the known classication for group 1 (setosa), while limited misclassication has occurred for groups 2 (versicolor, 2 assigned to group 3), and 3

146

S IMFIT reference manual: Part 3

Variables included:1 2 3 4 5 No. clusters = 3 Transformation = Untransformed Weighting = Unweighted for replicates Cases (odd rows) and Clusters (even rows) 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 2 3 1 1 2 2 3 3 3 13 14 15 16 17 18 19 20 3 3 3 3 3 1 1 3 Final cluster centroids 8.1183E+01 1.1667E+01 7.1500E+00 2.0500E+00 6.6000E+00 4.7867E+01 3.5800E+01 1.6333E+01 2.4000E+00 6.7333E+00 6.4045E+01 2.5209E+01 1.0745E+01 2.8364E+00 6.6545E+00 Table 3.72: Cluster analysis: K-means clustering

K-means clusters
40.0 9 8 4 15 1713 11 10 20.0 14 16 12 5 3 20 1 18 7 19 2 6 40 60 80 100

Variable 2

30.0

10.0

Variable 1

Figure 3.21: K-means clustering: example 1 (viginica, 14 assigned to group 2), as shown by the starred values. Clearly group 1 is distinct from groups 2 and 3 which show some similarities to each other, a conclusion also illustrated in gure 3.22, which should be compared with gure 3.27 using principal components (page 148) and canonical variates (page 158). It must be emphasized that in gure 3.22 the groups were generated by K-means clustering, while gure 3.27 was created using pre-assigned groups. Another difference is that the graph from clusters generated by K-means clustering is in the actual coordinates (or a transformation of the original coordinates) while the graphs in principal components or canonical variates are not in the original variables, but special linear combinations of the physical variables, chosen to emphasize features of the total data set. Also note that in gure 3.22 there are data assigned to groups with centroids that are not nearest neighbors in the space plotted. This is because the clusters are assigned using the distances from cluster centroids when all dimensions are taken into account, not just the two plotted, which explains this apparent anomaly. Certain other aspects of the SIMFIT implementation of K-means clustering should be made clear. 1. If variables differ greatly in magnitude, data should be transformed before cluster analysis but note that, if this is done interactively, the same transformation will be applied to the starting clusters. If a transformation cannot be applied to data, clustering will not be allowed at all, but if a starting estimate

Multivariate statistics

147

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3* 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3* 2 2 2 2 2 2 2 2 2 2 2 2 3 2* 3 3 3 3 3 3 2* 2* 3 2* 3 2* 3 3 2* 3 2 3 3 3 3 2* 3 3 2* 3 3 2* Cluster Size WSSQ Sum of weights 1 50 1.515E+01 5.000E+01 2 62 3.982E+01 6.200E+01 3 38 2.388E+01 3.800E+01 Final cluster centroids 5.0060E+00 3.4280E+00 1.4620E+00 2.4600E-01 5.9016E+00 2.7484E+00 4.3935E+00 1.4339E+00 6.8500E+00 3.0737E+00 5.7421E+00 2.0711E+00

1 1 1 1 2 2 2 2 3 3 2* 3

1 1 1 1 2 2 2 2 3 3 3 3

1 1 1 1 2 2 2 2 3 3 3 3

1 1 1 1 2 2 2 2 2* 3 3 2*

1 1 1 1 2 2 2 2 3 2* 3 3

Table 3.73: K-means clustering for Iris data

5.00

Technique: K-means Clustering Data: Fisher Iris Measurements

Sepal Width

4.00

3.00

2.00 4.00 5.00 6.00 7.00 8.00

Sepal Length
Figure 3.22: K-means clustering: example 2

cannot be transformed (e.g., square root of a negative number), then that particular value will remain untransformed.

148

S IMFIT reference manual: Part 3

2. If, after initial assignment of data to the starting clusters some are empty, clustering will not start, and a warning will be issued to decrease the number of clusters requested, or edit the starting clusters. 3. Clustering is an iterative procedure, and different starting clusters may lead to different nal cluster assignments. So, to explore the stability of a cluster assignment, you can perturb the starting clusters by adding or multiplying by a random factor, or you can even generate a completely random starting set. For instance, if the data have been normalized to zero mean and unit variance, then choosing uniform random starting clusters from U (1, 1), or normally distributed values from N (0, 1) might be considered. 4. After clusters have been assigned you may wish to pursue further analysis, say using the groups for canonical variate analysis, or as training sets for allocation of new observations to groups. To do this, you can create a SIMFIT MANOVA type le with group indicator in column 1. Such les also have the centroids appended, and these can be overwritten by new observations (not forgetting to edit the extra line counter following the last line of data) for allocating to the groups as training sets. 5. If weighting, variable suppression, or interactive transformation is used when assigning K-means clusters, all results tables, plots and MANOVA type les will be expressed in coordinates of the transformed space.
3.9.8.9 Principal components analysis

In the principal components analysis of a n by m data matrix, new coordinates y are selected by rotation of the original coordinates x so that the proportion of the variance projected onto the new axes decreases in the order y1 , y2 , . . . , ym . The hope is that most of the variance can be accounted for by a subset of the data in y coordinates, so reducing the number of dimensions required for data analysis. It is usual to scale the original data so that the variables are all of comparable dimensions and have similar variances, otherwise the analysis will be dominated by variables with large values. Basing principal components analysis on the correlation matrix rather than the covariance or sum of squares and cross product matrices is often recommended as it also prevents the analysis being unduly dominated by variables with large values. The data format for principal components analysis is exactly the same as for cluster analysis; namely a data matrix with n rows (cases) and m columns (variables). If the data matrix is X with covariance, correlation or scaled sum of squares and cross products matrix S, then the quadratic form aT 1 Sa1 is maximized subject to the normalization aT 1 a1 = 1 to give the rst principal component c1 = a 1 i xi .
i=1 m

Similarly, the quadratic form aT 2 Sa2


T is maximized, subject to the normalization and orthogonality conditions aT 2 a2 = 1 and a2 a1 = 0, to give the second principal component

c2 = a 2 i xi
i=1

and so on. The vectors ai are the eigenvectors of S with eigenvalues 2 i , where the proportion of the variation accounted for by the ith principal component can be estimated as
2 2 i / j. j =1 m

Multivariate statistics

149

Actually SIMFIT uses a singular value decomposition (SVD) of a centered and scaled data matrix, say Xs = )/ (n 1) as in (X X Xs = V PT to obtain the diagonal matrix of singular values, the matrix of left singular vectors V as the n by m matrix of scores, and the matrix of right singular vectors P as the m by m matrix of loadings. Table 3.74 shows analysis of the data in g03aaf.tf1, where column j of the loading matrix contains the Variables included:1 2 3 Transformation: Untransformed Matrix type: Variance-covariance matrix Score type: Score variance = eigenvalue Replicates: Unweighted for replicates Eigenvalues Proportion Cumulative chi-sq 8.274E+00 0.6515 0.6515 8.613E+00 3.676E+00 0.2895 0.9410 4.118E+00 7.499E-01 0.0590 1.0000 0.000E+00 Principal Component loadings (by column) -1.38E-01 6.99E-01 7.02E-01 -2.50E-01 6.61E-01 -7.07E-01 9.58E-01 2.73E-01 -8.42E-02 Principal Component scores (by column) -2.15E+00 -1.73E-01 -1.07E-01 3.80E+00 -2.89E+00 -5.10E-01 1.53E-01 -9.87E-01 -2.69E-01 -4.71E+00 1.30E+00 -6.52E-01 1.29E+00 2.28E+00 -4.49E-01 4.10E+00 1.44E-01 8.03E-01 -1.63E+00 -2.23E+00 -8.03E-01 2.11E+00 3.25E+00 1.68E-01 -2.35E-01 3.73E-01 -2.75E-01 -2.75E+00 -1.07E+00 2.09E+00 Table 3.74: Principal components analysis coefcients required to express y j as linear function of the variables x1 , x2 , . . . , xm , and row i of the scores matrix contains the values for row i of the original data expressed in variables y1 , y2 , . . . , ym . In this instance the data were untransformed and the variance covariance matrix was used to illustrate the statistical test for relevant eigenvalues but, where different types of variables are used with widely differing means and variances it is more usual to analyze the correlation matrix, instead of the covariance matrix, and many other scaling options and weighting options are available for more experienced users. Figure 3.23 shows the scores and loadings for the data in test le iris.tf1 plotted as a scattergram after analyzing the correlation matrix. The score plot displays the score components for all samples using the selected principal components, so some may prefer to label the legends as principal components instead of scores, and this plot is used to search for possible groupings among the sample. The loading plot displays the coefcients that express the selected principal components y j as linear functions of the original variables x1 , x2 , . . . , xm , so this plot is used to observe the contributions of the original variables x to the new ones y. Note that a 95% condence Hotelling T 2 ellipse is also plotted, which assumes a multivariate normal distribution for the original data and uses the F distribution. The condence ellipse is based on the fact that, if y and S are the estimated mean vector and covariance matrix from a sample of size n and, if x is a further independent sample from an

DOF 5 2 0

p 0.1255 0.1276 0.0000

150

S IMFIT reference manual: Part 3

Principal Components for Iris Data


0.25 Set Ver Ver Ver Ver Vir Ver Ver Vir Ver Ver Ver Ver Ver Vir Ver Ver Ver Ver Ver Vir Ver Vir Ver Ver Vir Vir Ver Ver Ver Vir Vir Vir Ver Vir Ver Ver Ver Vir Vir Ver Ver Ver Ver Ver Vir VerVer Vir Ver Ver Ver Vir Vir Vir Vir Ver Ver Vir Ver Vir Vir Vir Vir Ver VirVir Ver Vir Vir Ver Ver Vir Vir VerVer Vir Vir Vir Vir Vir Ver Vir Ver Ver Vir Vir Vir Vir Vir Vir Vir Vir Vir Vir Vir 0.00 0.25 -1.0 -0.50 0.0 Ver 0.3

Loadings for Iris Data

Pe-le Pe-wi

0.00

Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set Set

Loading 2

PC 2

-0.3 Se-le -0.5

-0.8 Se-wi -0.25 0.00 0.25 0.50 0.75

-0.25 -0.25

PC 1

Loading 1

Figure 3.23: Principal component scores and loadings assumed pvariate normal distribution, then (x y )T S1 (x y ) p(n2 1) Fp,n p, n(n p)

where the signicance level for the condence region can be altered interactively. The components can be labeled using any labels supplied at the end of the data, but this can cause confusion where, as in the present case, the labels overlap leading to crowding. A method for moving labels to avoid such confusion is provided, as illustrated on page 252. However, with such dense labels it is best to just plot the scores using different symbols for the three groups, as shown for comparison with canonical variates analysis of the same data set in gure 3.27. Note that gure 3.23 also illustrates an application of the SIMFIT technique for adding extra data interactively to create the cross-hairs intersecting at (0, 0), and it also shows how labels can be added to identify the variables in a loadings plot. It should be noted that, as the eigenvectors are of indeterminate sign and only the relative magnitudes of coefcients are important, the scattergrams can be plotted with either the scores calculated from the SVD, or else with the scores multiplied by minus one, which is equivalent to reversing the direction of the corresponding axis in a scores or loadings plot. An important topic in this type of analysis is deciding how to choose a sufcient number of principal components to represent the data adequately. As the eigenvalues are proportional to the fractions of variance along the principal component axes, a table of the cumulative proportions is calculated, and some users may nd it useful to include sufcient principal components to account for a given amount of the variance, say 70%. Figure 3.24 shows how scree plots can be displayed to illustrate the number of components needed to represent the data adequately. For instance, in this case, it seems that approximately half of the principal components are required. A useful rule of thumb for selecting the minimum number of components is to observe where the scree diagram crosses the average eigenvalue or becomes attened indicating that all subsequent eigenvalues contribute to a comparable extent. In cases where the correlation matrix is not used, a chi-square test statistic is also provided along with appropriate probability estimates to make the decision more objective. In this case, if k principal components are selected, the chi-square statistic (n 1 (2m + 5)/6)

i=k+1

log(2 i ) + (m k) log

i=k+1

2 i /(m k)

with (m k 1)(m k + 2)/2 degrees of freedom can be used to test for the equality of the remaining m k eigenvalues. If one of these test statistics, say the k + 1th, is not signicant then it is usual to assume k principal components should be retained and the rest regarded as of little importance. So, if it is concluded that the remaining eigenvalues are of comparable importance, then a decision has to be made whether to eliminate all or preserve all. For instance, from the last column of p values referring to the above chi-square test in table 3.74, it might be concluded that a minimum of four components are required to represent the data

Multivariate statistics

151

Principal Components Scree Diagram


3

Eigenvalues and Average

10

13

16

Number
Figure 3.24: Principal components scree diagram

in cluster.tf1 adequately. The common practise of always using two or three components just because these can be visualized is to be deplored.
3.9.8.10 Procrustes analysis

This technique is useful when there are two matrices X and Y with the same dimensions, and it wished to see how closely the X matrix can be made to t the target matrix Y using only distance preserving transformations, like translation and rotation. For instance, X could be a matrix of loadings, and the target matrix Y could be a reference matrix of loadings from another data set. Table 3.75 illustrates the outcome from analyzing data X-data for rotation: g03bcf.tf1 Y-data for target: g03bcf.tf2 No. of rows 3, No. of columns 2 Type: To origin then Y-centroid Scaling: Scaling Alpha = 1.5563E+00 Residual sum of squares = 1.9098E-02 Residuals from Procrustes rotation 9.6444E-02 8.4554E-02 5.1449E-02 Rotation matrix from Procrustes rotation 9.6732E-01 2.5357E-01 -2.5357E-01 9.6732E-01 Y-hat matrix from Procrustes rotation -9.3442E-02 2.3872E-02 1.0805E+00 2.5918E-02 1.2959E-02 1.9502E+00 Table 3.75: Procrustes analysis

152

S IMFIT reference manual: Part 3

in the test les g03bcf.tf1 with X data to be rotated, and g03bcf.tf2 containing the target matrix. First the centroids of X and Y are translated to the origin to give Xc and Yc . Then the matrix of rotations R that minimize the sum of squared residuals is found from the singular value decomposition as
T Xc Yc = UDV T

R = UV T , and after rotation a dilation factor can be estimated by least squares, if required, to give the estimate c = Xc R. Y Additional options include normalizing both matrices to have unit sums of squares, normalizing the X matrix to have the same sum of squares as the Y matrix, and translating to the original Y centroid after rotation. Also, as well as displaying the residuals, the sum of squares, the rotation and best t matrices, options are provided to plot arbitrary rows or columns of these matrices.
3.9.8.11 Varimax and Quartimax rotation

Generalized orthomax rotation techniques can be used to simplify the interpretation of loading matrices, e.g. from canonical variates or factor analysis. These are only unique up to rotation so, by applying rotations according to stated criteria, different contributions of the original variables can be assessed. Table 3.76 illustrates how this analysis is performed using the test le g03baf.tf1. The input loading matrix has m No. of rows 10, No. of columns 3 Type: Unstandardised, Scaling: Varimax, Gamma = 1 Data for G03BAF 7.8800E-01 -1.5200E-01 -3.5200E-01 8.7400E-01 3.8100E-01 4.1000E-02 8.1400E-01 -4.3000E-02 -2.1300E-01 7.9800E-01 -1.7000E-01 -2.0400E-01 6.4100E-01 7.0000E-02 -4.2000E-02 7.5500E-01 -2.9800E-01 6.7000E-02 7.8200E-01 -2.2100E-01 2.8000E-02 7.6700E-01 -9.1000E-02 3.5800E-01 7.3300E-01 -3.8400E-01 2.2900E-01 7.7100E-01 -1.0100E-01 7.1000E-02 Rotation matrix ... Varimax 6.3347E-01 -5.3367E-01 -5.6029E-01 7.5803E-01 5.7333E-01 3.1095E-01 1.5529E-01 -6.2169E-01 7.6772E-01 Rotated matrix ... Varimax 3.2929E-01 -2.8884E-01 -7.5901E-01 8.4882E-01 -2.7348E-01 -3.3974E-01 4.4997E-01 -3.2664E-01 -6.3297E-01 3.4496E-01 -3.9651E-01 -6.5659E-01 4.5259E-01 -2.7584E-01 -3.6962E-01 2.6278E-01 -6.1542E-01 -4.6424E-01 3.3219E-01 -5.6144E-01 -4.8537E-01 4.7248E-01 -6.8406E-01 -1.8319E-01 2.0881E-01 -7.5370E-01 -3.5429E-01 4.2287E-01 -5.1350E-01 -4.0888E-01 Table 3.76: Varimax rotation rows and k columns and results from the analysis of an original data matrix with n rows (i.e. cases) and m

Multivariate statistics

153

columns (i.e. variables), where k factors have been calculated for k m. If the input loading matrix is not standardized to unit length rows, this can be done interactively. The rotated matrix is calculated so that the elements i j are either relatively large or small. This involves minimizing the generalized orthomax objective function V= for one of two cases as follows Varimax rotation: = 1 Quartimax rotation: = 0. The resulting rotation matrix R satises = R and, when the matrices have been calculated they can be viewed, written to the results log le, saved to a text le, or plotted.
3.9.8.12 Multivariate analysis of variance (MANOVA)
4 ( i j) j =1 i=1

2 ( i j) j =1 i=1

Sometimes a designed experiment is conducted in which more than one response is measured at each treatment, so that there are two possible courses of action. 1. Do a separate ANOVA analysis for each variable. The disadvantages of this approach are that it is tedious, and also it relies upon the questionable assumption that each variable is statistically independent of every other variable, with a xed variance for each variable. The advantages are that the variance ratio tests are intuitive and unambiguous, and also there is no requirement that sample size per group should be greater than the number of variables. 2. Do an overall MANOVA analysis for all variables simultaneously. The disadvantages of this technique are that it relies on the assumption of a multivariate normal distribution with identical covariance matrices across groups, it requires a sample size per group greater than the number of variables, and also there is no unique and intuitive best test statistic. Further, the power will tend to be lower than the power of the corresponding ANOVA. The advantages are that analysis is compact, and several useful options are available which simplify situations like the analysis of repeated measurements. Central to a MANOVA analysis are the assumptions that there are n observations of a random m dimensional g vector divided into g groups, each with ni observations, so that n = i=1 ni where ni m for i = 1, 2, . . . , g. If yi j is the m vector for individual j of group i, then the sample mean y i , corrected sum of squares and products matrix Ci , and covariance matrix Si for group i are y i = Ci = Si = 1 ni
ni

j =1

yi j

ni

j =1

i )T (yi j yi)(yi j y 1 Ci . ni 1

For each ANOVA design there will be a corresponding MANOVA design in which corrected sums of squares and product matrices replace the ANOVA sums of squares, but where other test statistics are required in place of the ANOVA F distributed variance ratios. This will be claried by dealing with typical MANOVA procedures, such as testing for equality of means and equality of covariance matrices across groups. MANOVA example 1. Testing for equality of all means

154

S IMFIT reference manual: Part 3

If all groups have the same multivariate normal distribution, then estimates for the mean and covariance matrix can be obtained from the overall sample statistics =y and = = 1 g ni yi j n i =1 j =1 1 g ni )(yi j )T (yi j n 1 i =1 j =1

obtained by ignoring group means y i and summing across all groups. Alternatively, the pooled betweengroups B, within-groups W , and total sum of squares and products matrices T can be obtained along with the within-groups covariance matrix S using the group mean estimates y i as B = ni (y i y )(y i y )T W = (yi j y i )(yi j y i )T = (ni 1)Si
i=1 i=1 j =1 g i=1 g ni g

T = B +W

= (n g)S . = (n 1)

Table 3.77 is typical, and clearly strong differences between groups will be indicated if B is much larger than Source of variation Between groups Within groups Total d.f. g1 ng n1 ssp matrix B W T

Table 3.77: MANOVA example 1a. Typical one way MANOVA layout W . The usual likelihood ratio test statistic is Wilks lambda dened as = |W | |B| + |W |

but other statistics can also be dened as functions of the eigenvalues of BW 1 . Unlike B and W separately, the matrix BW 1 is not symmetric and positive denite but, if the m eigenvalues of BW 1 are i , then Wilks lambda, Roys largest root R, the Lawley-Hotelling trace T , and the Pillai trace P can be dened as =
m

1 i=1 1 + i

R = max(i ) T = i P=
i=1 m

i . i=1 1 + i

Table 3.78 resulted when manova1.tf3 was analyzed and the methods used to calculate the signicance levels will be outlined. Table 3.79 indicates conditions on the number of groups g, variables m, and total

Multivariate statistics

155

MANOVA H0: all mean vectors are equal No. groups = 3 No. variables = 2 No. observations = 15 Statistic Wilks lambda Roys largest root Lawley-Hotelling T Pillais trace Value Transform deg.free. p 1.917E-01 7.062E+00 4 22 0.0008 Reject H0 at 1% 2.801E+00 3.173E+00 8.727E+00 4 11 0.0017 Reject H0 at 1% 1.008E+00

Table 3.78: MANOVA example 1b. Test for equality of all means Parameters g = 2, any m g = 3, any m m = 1, any g m = 2, any g F statistic (2g m 1)(1 ) m (3g m 2)(1 ) m (n g)(1 ) (g 1) (n g 1)(1 ) (g 1) Degrees of freedom m, 2 g m 1 2m, 2(n m 2) g 1, n g 2(g 1), 2(n g 1)

Table 3.79: MANOVA example 1c. The distribution of Wilks number of observations n that lead to exact F variables for appropriate transforms of Wilks . For other conditions the asymptotic expression 2n 2 m g log Fm,g1 2

is generally used. The Lawley-Hotelling trace is a generalized Hotellings T02 statistic, and so the null distribution of this can be approximated as follows. Dening the degrees of freedom and multiplying factors and by 1 = g 1 2 = n g = m1 (2 m) 1 + 2 m1 1 (2 1)(1 + 2 m 1) = (2 m)(2 m 1)(2 m 3) m1 = , 2 m + 1 T F,2 m+1 , otherwise the alternative approximation T 2 f

then the case > 0 leads to the approximation

is employed, where f = m1 /{(2 m 1)}. The null distributions for Roys largest root and Pillais trace are more complicated to approximate, which is one reason why Wilks is the most widely used test statistic.

156

S IMFIT reference manual: Part 3

MANOVA example 2. Testing for equality of selected means Table 3.80 resulted when groups 2 and 3 were tested for equality, another example of a Hotellings T 2 test. MANOVA H0: selected First group = Second group = No. observations = No. variables = Hotelling T2 = Test statistic S = Numerator DOF = Denominator DOF = P(F >= S) = MANOVA H0: selected First group = Second group = No. observations = No. variables = Hotelling T2 = Test statistic S = Numerator DOF = Denominator DOF = P(F >= S) = group means are equal 2 ( 5 cases) 3 ( 5 cases) 15 (to estimate CV) 2 1.200E+01 5.498E+00 2 11 0.0221 Reject H0 at 5% sig.level group means are equal 2 ( 5 cases) 3 ( 5 cases) 10 (to estimate CV) 2 1.518E+01 6.640E+00 2 7 0.0242 Reject H0 at 5% sig.level

Table 3.80: MANOVA example 2. Test for equality of selected means The rst result uses the difference vector d2,3 between the means estimated from groups 2 and 3 with the matrix W = (n g)S estimated using the pooled sum of squares and products matrix to calculate and test T 2 according to T2 = (n g)n2n3 n2 + n3
T 1 d2 d2,3 ,3W

ngm+1 2 T Fm,ngm+1 , m(n g) while the second result uses the data from samples 2 and 3 as if they were the only groups as follows S2,3 = (n2 1)S2 + (n3 1)S3 n2 + n3 2 n2 n3 T 1 T2 = d2 ,3 S2,3 d2,3 n2 + n3

n2 + n3 m 1 2 T Fm,n2 +n3 m1 . m(n2 + n3 2) The rst method could be used if all covariance matrices are equal (see next) but the second might be preferred if it was only likely that the selected covariance matrices were identical. MANOVA example 3. Testing for equality of all covariance matrices Table 3.81 shows the results from using Boxs test to analyze manova1.tf2 for equality of covariance matri-

Multivariate statistics

157

MANOVA H0: all covariance matrices are equal No. groups No. observations No. variables Test statistic C No. Deg. freedom P(chi-sqd. >= C) = = = = = = 3 21 2 1.924E+01 6 0.0038 Reject H0 at 1% sig.level

Table 3.81: MANOVA example 3. Test for equality of all covariance matrices ces. This depends on the likelihood ratio test statistic C dened by C = M (n g) log |S| (ni 1) log |Si | ,
i=1 g

where the multiplying factor M is M = 1 2 m2 + 3 m 1 6(m + 1)(g 1)

i=1

ni 1 n g

and, for large n, C is approximately distributed as 2 with m(m + 1)(g 1)/2 degrees of freedom. Just as tests for equality of variances are not very robust, this test should be used with caution, and then only with large samples, i.e. ni >> m. MANOVA example 4. Prole analysis Figure 3.25 illustrates the results from plotting the group means from manova1.tf1 using the prole analysis

MANOVA Profile Analysis


30.0 Group 1 Group 2

Group Means

20.0

10.0 1 2 3 4 5

Variables

Figure 3.25: MANOVA prole analysis

158

S IMFIT reference manual: Part 3

option, noting that error bars are not added as a multivariate distribution is assumed, while table 3.82 shows the results of the statistical analysis. Prole analysis attempts to explore a common question that often arises MANOVA H0: selected group profiles are equal First group Second group No. observations No. variables Hotelling T2 Test statistic S Numerator DOF Denominator DOF P(F >= S) = = = = = = = = = 1 ( 5 cases) 2 ( 5 cases) 10 (to estimate CV) 5 3.565E+01 5.570E+00 4 5 0.0438 Reject H0 at 5% sig.level

Table 3.82: MANOVA example 4. Prole analysis in repeated measurements ANOVA namely, can two proles be regarded as parallel. This amounts to testing if the sequential differences between adjacent means for groups i and j are equal, that is, if the slopes between adjacent treatments are constant across the two groups, so that the two proles represent a common shape. To do this, we rst dene the m 1 by m transformation matrix K by 1 1 0 0 0 ... 0 1 1 0 0 ... . K= 0 0 1 1 . . . ... ... ... ... ... ... Then a Hotellings T 2 test is conducted using the pooled estimate for the covariance matrix Si j = [(ni 1)Si + (n j 1)S j ]/(ni + n j 2) and mean difference vector di j = y i y j according to T2 = and comparing the transformed statistic ni + n j m T 2 Fm1,n1 +n2 m (ni + n j 2)(m 1) to the corresponding F distribution. Clearly, from table 3.82, the proles are not parallel for the data in test le manova1.tf3.
3.9.8.13 Comparing groups: canonical variates (discriminant functions)

ni n j ni + n j

(Kdi j )T (KSi j K T )1 (Kdi j )

If MANOVA investigation suggests that at least one group mean vector differs from the the rest, it is usual to proceed to canonical variates analysis, although this technique can be also be used for data exploration when the assumption of multivariate normality with equal covariance matrices is not justied. Transforming multivariate data using canonical variates is a technique for highlighting differences between groups. Table 3.83 shows the results from analyzing data in the test le manova1.tf4 which has three groups, each of size three. The most useful application of this technique is to plot the group means together with the data and 95% condence regions in canonical variate space in order to visualize how close or how far apart the groups are. This is done for the rst two canonical variates in gure 3.26, which requires some explanation. First of all, note that canonical variates, unlike principal components, are not simply obtained by a distance preserving rotation: the transformation is non-orthogonal and best represents the Mahalanobis distance between groups. In gure 3.26 we see the group means identied by the lled symbols labelled as 1, 2 and 3, each surrounded by a 95% condence region, which in this case is circular as equally scaled physical distances are plotted

Multivariate statistics

159

Rank = 3 Correlations Eigenvalues Proportions 0.8826 3.5238 0.9795 0.2623 0.0739 0.0205 Canonical variate means 9.841E-01 2.797E-01 1.181E+00 -2.632E-01 -2.165E+00 -1.642E-02 Canonical coefficients -1.707E+00 7.277E-01 -1.348E+00 3.138E-01 9.327E-01 1.220E+00

Chi-sq. 7.9032 0.3564

NDOF 6 2

p 0.2453 0.8368

Table 3.83: Comparing groups: canonical variates

Canonical Variate Means


3 A 2 1

CV 2

0 -1 -2 -3 -4 -5 -4 -3

1 2 C

-2

-1

CV 1

Figure 3.26: Comparing groups: canonical variates and condence regions along the axes. The canonical variates are uncorrelated and have unit variance so, assuming normality, the 100(1 )% condence region for the population mean is a circle radius r= 2 ,2 /ni ,

where group i has ni observations and 2 ,2 is the value exceeded by 100% of a chi-square distribution with 2 degrees of freedom. Note that a circle radius 100(1 )% of the whole population is expected to lie. Also, the test le manova1.tf4 has three other observations appended which are to be compared with the main groups in order to assign group membership, that is, to see to which of the main groups 1, 2 and 3 the extra observations should be assigned. The half-lled diamonds representing these are identied by the labels A, B and C which, like the identifying numbers 1, 2, and 3, are plotted automatically by SIMFIT to identify group means and extra data. In this case, as the data sets are small, the transformed observations from groups 1, 2 and 3 are also shown as circles, triangles 2 ,2 denes a tolerance region, i.e. the region within which

160

S IMFIT reference manual: Part 3

and squares respectively, which is easily done by saving the coordinates from the plotted transforms of the observations in ASCII text les which are then added interactively as extra data les to the means plot. The aim of canonical variate analysis is to nd the transformations ai that maximize Fi , the ratios of B (the between group sum of squares and products matrices) to W (the within-group sum of squares and products matrix), i.e. aT i Bai /(g 1) Fi = T ai Wai /(n g)

where there are g groups and n observations with m covariates each, so that i = 1, 2, . . . , l where l is the lesser of the number of groups minus one and the rank of the data matrix. The canonical variates are obtained by solving the symmetric eigenvalue problem (B 2W )x = 0,

2 where the eigenvalues 2 i dene the ratios Fi , and the eigenvectors ai corresponding to the i dene the transformations. So, just as with principal components, a scree diagram of the eigenvalues in decreasing order indicates the proportion of the ratio of between-group to within-group variance captured by the canonical variates. Note that table 3.83 lists the rank k of the data matrix, the number of canonical variates l = min(k, g l 2 2 2 2 1), the eigenvalues 2 i , the canonical correlations i /(1 + i ), the proportions i / j =1 j , the group means, the loadings, and the results of a chi-square test. If the data are assumed to be from a common multivariate distribution, then to test for a signicant dimensionality greater than some level i, the statistic

2 = (n 1 g (k g)/2)

j =i+1

log(1 + 2 j)

has an asymptotic chi-square distribution with (k i)(g 1 i) degrees of freedom. If the test is not signicant for some level h, then the remaining tests for i > h should be ignored. It should be noted that the group means and loadings are calculated for data after column centering and the canonical variates have within group variance equal to unity. Also, if the covariance matrices = B/(g 1) and = W /(n g) are used, then 1 = (n g)W 1 B/(g 1), so eigenvectors of W 1 B are the same as those of 1 , but eigenvalues of W 1 B are (g 1)/(n g) times the corresponding eigenvalues of 1 . Figure 3.27 illustrates the famous Fisher Iris data set contained in manova1.tf5 and shown in table 3.73,
Principal Components for Iris Data
3 2 1 3 2 1

Canonical Variates for Iris Data

PC 2

CV 2

3 1 0 2 -1 -2 -3

0 -1 -2 -3 -2 -1 0 1 2

-10

-5

10

PC 1

CV 1

Figure 3.27: Comparing groups: principal components and canonical variates using the rst two principal components and also the rst two canonical variates. In this instance there are only two canonical variates, so the canonical variates diagram is fully representative of the data set, and both techniques illustrate the distinct separation of group 1 (circles = setosa) from groups 2 (triangles = versicolor) and 3 (squares = virginica), and the lesser separation between groups 2 and 3. Users of these techniques

Multivariate statistics

161

should always remember that, as eigenvectors are only dened up to an arbitrary scalar multiple and different matrices may be used in the principal component calculation, principal components and canonical variates may have to be reversed in sign and re-scaled to be consistent with calculations reported using software other than SIMFIT. To see how to compare extra data to groups involved in the calculations, the test le manova1.tf4 should be examined.
3.9.8.14 Comparing groups: Mahalanobis distances (discriminant analysis)

Discriminant analysis can be performed for grouped multivariate data as in table 3.84 for test le g03daf.tf1, D2 for all groups assuming unequal CV 0.0000E+00 9.5570E+00 5.1974E+01 8.5140E+00 0.0000E+00 2.5297E+01 2.5121E+01 4.7114E+00 0.0000E+00 D2 for samples/groups assuming unequal CV 3.3393E+00 7.5213E-01 5.0928E+01 2.0777E+01 5.6559E+00 5.9653E-02 2.1363E+01 4.8411E+00 1.9498E+01 7.1841E-01 6.2803E+00 1.2473E+02 5.5000E+01 8.8860E+01 7.1785E+01 3.6170E+01 1.5785E+01 1.5749E+01 Table 3.84: Comparing groups: Mahalanobis distances by calculating Mahalanobis distances between group means, or between group means and samples. The squared Mahalanobis distance D2 i and x j can be dened as either i j between two group means x D2 i x j )T S1 (x i x j) i j = (x

1 i x j )T S i x j) or D2 i j = (x j (x

depending on whether the covariance matrices are assumed to be equal, when the pooled estimate S is used, or unequal when the group estimate S j is used. This distance is a useful quantitative measure of similarity between groups, but often there will be extra measurements which can then be appended to the data le, as with manova1.tf2, so that the distance between measurement k and group j can be calculated as either D2 j )T S1 (xk x j) k j = (xk x

1 or D2 j )T S j ). k j = (xk x j (xk x

From table 3.81 on page 157 we see that, for these data, the covariances must be regarded as unequal, so from table 3.84 we conclude that the groups are similarly spaced but, whereas extra data points 1 to 4 seem to belong to group 2, extra data points 5 and 6 can not be allocated so easily.
3.9.8.15 Comparing groups: Assigning new observations

Assigning new observations to groups dened by training sets can be made more objective by employing Bayesian techniques than by simply using distance measures, but only if a multivariate normal distribution can be assumed. For instance, table 3.85 displays the results from assigning the six observations appended to g03dcf.tf1 to groups dened by using the data as a training set, under the assumption of unequal variancecovariance matrices and equal priors. The calculation is for g groups, each with n j observations on m variables, and it is necessary to make assumptions about the identity or otherwise of the variance-covariance matrices, as well as assigning prior probabilities. Then Bayesian arguments lead to expressions for posterior probabilities q j , under a variety of assumptions, given prior probabilities j as follows.

162

S IMFIT reference manual: Part 3

Size of training set = 21 Number of groups = 3 Method: Predictive CV-mat: Unequal Priors: Equal Observation Group-allocated 1 2 2 3 3 2 4 1 5 3 6 3 Posterior probabilities 0.0939 0.9046 0.0015 0.0047 0.1682 0.8270 0.0186 0.9196 0.0618 0.6969 0.3026 0.0005 0.3174 0.0130 0.6696 0.0323 0.3664 0.6013 Atypicality indices 0.5956 0.2539 0.9747 0.9519 0.8360 0.0184 0.9540 0.7966 0.9122 0.2073 0.8599 0.9929 0.9908 0.9999 0.9843 0.9807 0.9779 0.8871 Table 3.85: Comparing groups: Assigning new observations Estimative with equal variance-covariance matrices (Linear discrimination)
2 log q j 1 2 Dk j + log j

Estimative with unequal variance-covariance matrices (Quadratic discrimination)


2 1 log q j 1 2 Dk j + log j 2 log |S j |

Predictive with equal variance-covariance matrices qj

j (n g + 1 )/ 2 ((n j + 1)/n j )m/2 {1 + [n j /((n g)(n j + 1))]D2 k j}

Predictive with unequal variance-covariance matrices qj

j (n j /2) 2 2 n j /2 ((n j m)/2)((n j 1)/n j )m/2 |S j |1/2 {1 + (n j /(n2 j 1))Dk j }

Subsequently the posterior probabilities are normalized so that g j =1 q j = 1 and the new observations are assigned to the groups with the greatest posterior probabilities. In this analysis the priors can be assumed to be all equal, proportional to sample size, or user dened. Also, atypicality indices I j are computed to estimate how well an observation ts into an assigned group. These are Estimative with equal or unequal variance-covariance matrices I j = P(D2 k j / 2 , m/ 2 )

Multivariate statistics

163

Predictive with equal variance-covariance matrices


2 I j = R(D2 k j /(Dk j + (n g)(n j 1)/n j ), m/2, (n g m + 1)/2)

Predictive with unequal variance-covariance matrices


2 2 I j = R(D2 k j /(Dk j + (n j 1)/n j ), m/2, (n j m)/2),

where P(x, ) is the incomplete gamma function (page 293), and R(x, , ) is the incomplete beta function (page 290). Values of atypicality indices close to one for all groups suggest that the corresponding new observation does not t well into any of the training sets, since one minus the atypicality index can be interpreted as the probability of encountering an observation as or more extreme than the one in question given the training set. As before, observations 5 and 6 do not seem to t into any of the groups. Note that extra observations can be edited interactively or supplied independently in addition to the technique of appending to the data le as with manova.tf2. However, the assignment of extra observations to the training sets depends on the data transformation selected and variables suppressed or included in the analysis, and this must be considered when supplying extra observations interactively. Finally, once extra observations have been assigned, you can generate an enhanced training set, by creating a SIMFIT MANOVA type le in which the new observations have been appended to the groups to which they have been assigned.
3.9.8.16 Factor analysis

This technique is used when it is wished to express a multivariate data set in m manifest, or observed variables, in terms of k latent variables, where k < m. Latent variables are variables that by denition are unobservable, such as social class or intelligence, and thus cannot be measured but must be inferred by estimating the relationship between the observed variables and the supposed latent variables. The statistical treatment is based upon a very restrictive mathematical model that, at best, will only be a very crude approximation and, most of the time, will be quite inappropriate. For instance, Krzanowski (in Principles of Multivariate Analysis, Oxford, revised edition, 2000) explains how the technique is used in the psychological and social sciences, but then goes on to state At the extremes of, say, Physics or Chemistry, the models become totally unbelievable. p477 It should only be used if a positive answer is provided to the question, Is the model valid? p503 However, despite such warnings, the technique is now widely used, either to attempt to explain observables in terms of hypothetical unobservables, or as just another technique for expressing multivariate data sets in a space of reduced dimension. In this respect it is similar to principal components analysis (page148), except that the technique attempts to capture the covariances between the variables, not the variances. If the observed variables x can be represented as a linear combination of the unobservable variables or factors f , so that the partial correlation ri j.l between xi and x j with fl xed is effectively zero, then the correlation between xi and x j can be said to be explained by fl . The idea is to estimate the coefcients expressing the dependence of x on f in such a way that the the residual correlation between the x variables is a small as possible, given the value of k. The assumed relationship between the mean-centered observable variables xi and the factors is xi =

j =1

i j f j + ei for i = 1, 2, . . . , m, and j = 1, 2, . . . , k

where i j are the loadings, fi are independent normal random variables with unit variance, and ei are independent normal random variables with variances i . If the variance covariance matrix for x is , dened as = T + ,

164

S IMFIT reference manual: Part 3

where is the matrix of factor loadings i j , and is the diagonal matrix of variances i , while the sample covariance matrix is S, then maximum likelihood estimation requires the minimization of F () =

j =k +1

( j log j ) (m k),

is given by where j are eigenvalues of S = 1/2 S1/2 . Finally, the estimated loading matrix = 1/2V ( I )1/2,

where V are the eigenvectors of S , is the diagonal matrix of i , and I is the identity matrix. Table 3.86 illustrates the analysis of data in g03caf.tf, which contains a correlation matrix for n = 211 No. variables = 9, Transformation = Untransformed Matrix type = Input correlation/covariance matrix directly No. of factors = 3, Replicates = Unweighted for replicates F(Psi-hat) = 3.5017E-02 Test stat C = 7.1494E+00 DegFreedom = 12 (No. of cases = 211) P(chisq >= C) = 0.8476 Eigenvalues Communalities Psi-estimates 1.5968E+01 5.4954E-01 4.5046E-01 4.3577E+00 5.7293E-01 4.2707E-01 1.8475E+00 3.8345E-01 6.1655E-01 1.1560E+00 7.8767E-01 2.1233E-01 1.1190E+00 6.1947E-01 3.8053E-01 1.0271E+00 8.2308E-01 1.7692E-01 9.2574E-01 6.0046E-01 3.9954E-01 8.9508E-01 5.3846E-01 4.6154E-01 8.7710E-01 7.6908E-01 2.3092E-01 Residual correlations 0.0004 -0.0128 0.0220 0.0114 -0.0053 0.0231 -0.0100 -0.0194 -0.0162 0.0033 -0.0046 0.0113 -0.0122 -0.0009 -0.0008 0.0153 -0.0216 -0.0108 0.0023 0.0294 -0.0123 -0.0011 -0.0105 0.0134 0.0054 -0.0057 -0.0009 0.0032 -0.0059 0.0097 -0.0049 -0.0114 0.0020 0.0074 0.0033 -0.0012 Factor loadings by columns 6.6421E-01 -3.2087E-01 7.3519E-02 6.8883E-01 -2.4714E-01 -1.9328E-01 4.9262E-01 -3.0216E-01 -2.2243E-01 8.3720E-01 2.9243E-01 -3.5395E-02 7.0500E-01 3.1479E-01 -1.5278E-01 8.1870E-01 3.7667E-01 1.0452E-01 6.6150E-01 -3.9603E-01 -7.7747E-02 4.5793E-01 -2.9553E-01 4.9135E-01 7.6567E-01 -4.2743E-01 -1.1701E-02 Table 3.86: Factor analysis 1: calculating loadings and m = 9. The proportion of variation for each variable xi accounted for by the k factors is the communal2 ity k j =1 i j , the Psi-estimates are the variance estimates, and the residual correlations are the off-diagonal

Multivariate statistics

165

elements of C (T + ) where C is the sample correlation matrix. If a good t has resulted and sufcient factors have been included, then the off-diagonal elements of the residual correlation matrix should be small with respect to the diagonals (listed with arbitrary values of unity to avoid confusion). Subject to the normality assumptions of the model, the minimum dimension k can be estimated by tting sequentially with k = 1, k = 2, k = 3, and so on, until the likelihood ratio test statistic ) 2 = [n 1 (2m + 5)/6 2k/3]F( is not signicant as a chi-square variable with [(m k)2 (m + k)]/2 degrees of freedom. Note that data for factor analysis can be input as a general n by m multivariate matrix, or as either a m by m covariance or correlation matrix. However, if a square covariance or correlation matrix is input then there are two further considerations: the sample size must be supplied independently, and it will not be possible to estimate or plot the sample scores in factor space, as the original sample matrix will not be available. It remains to explain the estimation of scores, which requires the original data of course, and not just the covariance or correlation matrix. This involves the calculation of a m by k factor score coefcients matrix , so that the estimated vector of factor scores f, given the x vector for an individual can be calculated from f = xT . However, when calculating factor scores from the factor score coefcient matrix in this way, the observable variables xi must be mean centered, and also scaled by the standard deviations if a correlation matrix has been analyzed. The regression method uses = 1 (I + T 1 )1 , while the Bartlett method uses = 1 (T 1 )1 . Table 3.87 shows the analysis of g03ccf.tf1, a correlation matrix for 220 cases, 6 variables and 2 factors, but a further possibility should be mentioned. As the factors are only unique up to rotation, it is possible to perform a Varimax or Quartimax rotation (page 152) to calculate a rotation matrix R before working out the score coefcients, which may simplify the interpretation of the observed variables in terms of the unobservable variables.
3.9.8.17 Biplots

The biplot is used to explore relationships between the rows and columns of any arbitrary matrix, by projecting the matrix onto a space of smaller dimensions using the singular value decomposition (SVD Page 200). It is based upon the fact that, as a n by m matrix X of rank k can be expressed as a sum of k rank 1 matrices as follows T T X = 1 u1 vT 1 + 2 u2 v2 + + k uk vk , then the best t rank r matrix Y with r < k which minimizes the objective function S = (xi j yi j )2
i=1 j =1 m n

= trace[(X Y )(X Y )T ] is the sum of the rst r of these rank 1 matrices. Further, such a least squares approximation results in the minimum value 2 2 Smin = 2 r+1 + r+2 + + k

166

S IMFIT reference manual: Part 3

C = 2.3346, P(chisq >= C) = 0.6745, DOF = 4 (n = 220, m = 6, k = 2) Eigenvalues Communalities Psi-estimates 5.6142E+00 4.8983E-01 5.1017E-01 2.1428E+00 4.0593E-01 5.9407E-01 1.0923E+00 3.5627E-01 6.4373E-01 1.0264E+00 6.2264E-01 3.7736E-01 9.9082E-01 5.6864E-01 4.3136E-01 8.9051E-01 3.7179E-01 6.2821E-01 Factor loadings by columns 5.5332E-01 -4.2856E-01 5.6816E-01 -2.8832E-01 3.9218E-01 -4.4996E-01 7.4042E-01 2.7280E-01 7.2387E-01 2.1131E-01 5.9536E-01 1.3169E-01 Factor score coefficients, Method: Regression, Rotation: None 1.9318E-01 -3.9203E-01 1.7035E-01 -2.2649E-01 1.0852E-01 -3.2621E-01 3.4950E-01 3.3738E-01 2.9891E-01 2.2861E-01 1.6881E-01 9.7831E-02 Table 3.87: factor analysis 2: calculating factor scores

so that the rank r least squares approximation Y accounts for a fraction


2 2 1 + + r 2 2 2 1 + 2 + + k

of the total variance, where k is less than or equal to the smaller of n and m, k is greater than or equal to r, and i = 0 for i > k. Figure 3.28 illustrates a biplot for the data in test le houses.tf1. The technique is based upon creating one of several possible rank-2 representations of of a n by m matrix X with rank k of at least two as follows. Let the SVD of X be X = U V T = i ui vT i
i=1 k

so that the best t rank-2 matrix Y to the original matrix X will be u11 u12 Y = . . . u21 u22 1 . . . 0 u 2n

0 2

v11 v21

v12 v22

... ...

v1 m . v2 m

u 1n

Then Y can be written in several ways as GH T , where G is a n by 2 matrix and H is a m by 2 matrix as follows.

Multivariate statistics

167

40

Bath Radio Kitchen Shaafat Bet-Hanina TV set A-Tur Isawyie

Sur-Bahar Bet-Safafa Toilet

Water Am.Colony Sh.Jarah

Silwan Abu-Tor Jewish Moslem Armenian Christian

-40

Refrigerator

-80 -60 -20

Electricity

20

60

Figure 3.28: Two dimensional biplot for East Jerusalem Households

1. General representation u11 1 u12 1 Y = . . . u1n 1 2. Representation with row emphasis u11 1 u12 1 Y = . . . u1n 1 3. Representation with column emphasis u11 u12 Y = . . . u 1n u21 u22 v11 1 . . . v21 2 u 2n u21 2 u22 2 v11 . . . v21 u2n 2 u21 2 u22 2 v11 1 . . . v21 2 u2n 2

v12 1 v22 2

. . . v1m 1 . . . v2m 2

v12 v22

. . . v1 m . . . v2 m

v12 1 v22 2

... ...

v1m 1 v2m 2

168

S IMFIT reference manual: Part 3

4. User-dened representation u11 1 u12 1 Y = . . . u1n 1 u21 2 u22 2 v11 1 . . . v21 2 u2n 2 v12 1 v22 2

. . . v1m 1 . . . v2m 2

where 0 < < 1, and = 1 .

To construct a biplot we take the n row effect vectors gi and m column effect vectors h j as vectors with origin at (0, 0) and dened in the general representation as gT i = (u1i 1 u2i 2 ) hT j = (v1 j 1 v2 j 2 ) with obvious identities for the alternative row emphasis and column emphasis factorizations. The biplot consists of n vectors with end points at (u1i 1 , u2i 2 ) and m vectors with end points at (v1 j 1 , v2 j 2 ) so that interpretation of the biplot is then in terms of the inner products of vector pairs. That is, vectors with the same direction correspond to proportional rows or columns, while vectors approaching right angles indicate near orthogonality, or small contributions. Another possibility is to display a difference biplot in which a residual matrix R is rst created by subtracting the best t rank-1 matrix so that R = X 1 u1 vT 1 = i ui vT i
i=2 k

and this is analyzed, using appropriate vectors calculated with 2 and 3 of course. Again, the row vectors may dominate the column vectors or vice versa whatever representation is used and, to improve readability, additional scaling factors may need to be introduced. For instance, gure 3.28 used the residual matrix and scaling factors of -100 for rows and -1 for columns to reect and stretch the vectors until comparable size was attained. To do this over-rides the default autoscaling option, which is to scale each set of vectors so that the largest row and largest column vector are of unit length, whatever representation is chosen. Biplots are most useful when the number of rows and columns is not too large, and when the rank-2 approximation is satisfactory as an approximation to the data or residual matrix. Note that biplot labels should be short, and they can be appended to the data le as with houses.tf1, or pasted into the plot as a table of label values. Fine tuning to re-position labels was necessary with gure 3.28, and this can be done by editing the PostScript le in a text editor (page 338), or by using the same techniques described for scattergrams with labels (page 252). Sometimes, as with gure 3.29, it is useful to inspect biplots in three dimensions. This has the advantage that three singular values can be used, but the plot may have to be viewed from several angles to get a good idea of which vectors of like type are approaching a parallel orientation (indicating proportionality of rows or columns) and which pairs of vectors i, j of opposite types are orthogonal (i.e., at right angles, indicating small contributions to xi j ) As with other projection techniques, such as principal components, it is necessary to justify that the number of singular values used to display a biplot does represent the data matrix adequately. To do this, consider table 3.88 from the singular value decomposition of houses.tf1. In this example, it is clear that the rst two or three singular values do represent the data adequately, and this is further reinforced by gure 3.30 where the percentage variance represented by the successive singular values is plotted as a function of the singular value index. Here we see plotted the cumulative variance CV (i) CV (i) = 100 ij=1 2 j
2 k j =1 j

Multivariate statistics

169

Three Dimensional Multivariate Biplot

Electricity Am.Colony Sh.Jarah Christian Armenian Shaafat Bet-Hanina Refrigerator Water Bath TV set

.35

Toilet

Moslem A-Tur Isawyie Kitchen Radio Silwan Abu-Tor Jewish

Z
-.4 .6

-1

Sur-Bahar Bet-Safafa

Y
0 -.4

Figure 3.29: Three dimensional biplot for East Jerusalem Households Index 1 2 3 4 5 6 7 8 Sigma(i) Fraction Cumulative 4.99393E+02 0.7486 0.7486 8.83480E+01 0.1324 0.8811 3.36666E+01 0.0505 0.9315 1.78107E+01 0.0267 0.9582 1.28584E+01 0.0193 0.9775 1.04756E+01 0.0157 0.9932 3.37372E+00 0.0051 0.9983 1.15315E+00 0.0017 1.0000 Sigma(i)2 Fraction Cumulative: rank = 8 2.49394E+05 0.9631 0.9631 7.80536E+03 0.0301 0.9933 1.13344E+03 0.0044 0.9977 3.17222E+02 0.0012 0.9989 1.65339E+02 0.0006 0.9995 1.09738E+02 0.0004 1.0000 1.13820E+01 0.0000 1.0000 1.32974E+00 0.0000 1.0000

Table 3.88: Singular values for East Jerusalem Households plotted as a function of the index i, and such tables or plots should always be inspected to make sure that CV (i) is greater than some minimum value (say 70 percent, for instance) for i = 2 or i = 3 as appropriate.

170

S IMFIT reference manual: Part 3

100%

Percentage Variance

99% 98% 97% 96% 95%

Index i
Figure 3.30: Percentage variance from singular value decomposition

Time series

171

3.9.9 Time series


A time series is a vector x(t ) of n > 1 observations xi obtained at a sequence of points ti , e.g., times, distances, etc., at xed intervals , i.e. = ti+1 ti , for i = 1, 2, . . . , n 1, and it is assumed that there is some seasonal variation, or other type of autocorrelation to be estimated. A linear trend can be removed by rst order differencing xt = xt xt 1 , while seasonal patterns of seasonality s can be eliminated by rst order seasonal differencing s xt = xt xt s .
3.9.9.1 Time series data smoothing

Sometimes it is useful to be able to smooth a time series in order to suppress outliers and reveal trends more clearly. In extreme cases it may even be better to create a smoothed data set for further correlation analysis or model tting. The obvious way to do this is to apply a moving average of span n, which replaces the data values by the average of n adjacent values to create a smooth set. When n is odd, it is customary to set the new smooth point equal to the mean of the original value and the (n 1)/2 values on either side but, when n is even, the Hanning lter is used that is, double averaging or alternatively using an appropriately weighted mean of span (n + 1). Because such moving averages could be unduly inuenced by outliers, running medians can also be used, however a very popular smoothing method is the 4253H twice smoother. This starts by applying a span 4 running median centered by 2, followed by span 5 then span 3 running medians, and nally a Hanning lter. The rough (i.e., residuals) are then treated in the same way and the rst-pass smooth are re-roughed by adding back the smoothed rough, then nally the rough are re-calculated. Figure 3.31 illustrates the effect of this T 4253H smoothing.

T4253H Data smoothing


800

Data and Smooth

600

400

200

10

20

30

40

50

Time
Figure 3.31: The T4253H data smoother

172

S IMFIT reference manual: Part 3

3.9.9.2 Time series lags and autocorrelations

This procedure should be used to explore a time series before tting an ARIMA model (page 174). The general idea is to observe the autocorrelations and partial autocorrelations in order to identify a suitable differencing scheme. You input a vector of length NX , which is assumed to represent a time series sequence with xed differences, e.g., every day, at intervals of 10 centimeters, etc. Then you choose the orders of nonseasonal differencing ND, and seasonal differencing NDS, along with seasonality NS, the maximum number of lags required NK , and the maximum number of partial autocorrelations of interest L. All autocorrelations and partial autocorrelations requested are then worked out, and a statistic S to test for the presence of signicant autocorrelations in the data is calculated. Table 3.89 shows the results from analysis of times.tf1. Original dimension (NX) = After differencing (NXD) = Non-seasonal order (ND) = Seasonal order (NDS) = Seasonality (NS) = No. of lags (NK) = No. of PACF (NVL) = X-mean (differenced) = X-variance (differenced) = Statistic (S) = P(chi-sq >= S) = Lag R PACF 1 0.5917 5.917E-01 2 0.5258 2.703E-01 3 0.3087 -1.299E-01 4 0.1536 -1.440E-01 5 0.0345 -5.431E-02 6 -0.0297 1.105E-02 7 -0.0284 7.109E-02 8 -0.0642 -4.492E-02 9 -0.1366 -1.759E-01 10 -0.2619 -2.498E-01 100 99 1 0 0 10 10 4.283E-01 3.152E-01 8.313E+01 0.0000 VR ARP 6.498E-01 3.916E-01 6.024E-01 3.988E-01 5.922E-01 1.601E-03 5.799E-01 -1.440E-01 5.782E-01 -1.365E-01 5.782E-01 -4.528E-02 5.752E-01 1.474E-01 5.741E-01 1.306E-01 5.563E-01 -6.707E-02 5.216E-01 -2.498E-01

Table 3.89: Autocorrelations and Partial Autocorrelations Note that differencing of orders d = ND, D = NDS, and seasonality s = NS may be applied repeatedly to a series so that wt = d D s xt will be shorter, of length NXD = n d D s, and will extend for t = 1 + d + D s, . . ., NX . Non-seasonal differencing up to order d is calculated sequentially using 1 xi 2 xi ... d xi = xi+1 xi = 1 xi+1 1 xi = d 1 xi+1 d 1 xi for i = 1, 2, . . . , n 1 for i = 1, 2, . . . , n 2 for i = 1, 2, . . . , n d for i = 1, 2, . . . , n d s for i = 1, 2, . . . n d 2s for i = 1, 2, . . . , n d D s.

while seasonal differencing up to order D is calculated by the sequence d 1 s xi d 2 s xi ... d D s xi = d xi+s d xi d 1 = d 1 s xi+s s xi
+1 x d D+1 x = d D i+s s i s

Time series

173

Note that, as indicated in table 3.89, either the original sample X of length NX , or a differenced series XD of length NXD, can be analyzed interactively, by simply adjusting ND, NDS, or NS. Also the maximum number of autocorrelations NK < NXD and maximum number of partial autocorrelations L NK , can be controlled, although the maximum number of valid partial autocorrelations NV L may turn out to be less than L. Now, dening either x = X , and n = NX , or else x = XD and n = NXD as appropriate, and using K = NK , the mean and variance are recorded, plus the autocorrelation function R, comprising the autocorrelation coefcients of lag k according to rk =
n k i=1

)(xi+k x ) (xi x )2 . (xi x


i=1

If n is large and much larger than K , then the S statistic


2 S = n rk k =1 K

has a chi-square distribution with K degrees of freedom under the hypothesis of zero autocorrelation, and so it can be used to test that all correlations are zero. The partial autocorrelation function PACF has coefcients at lag k corresponding to pk,k in the autoregression xt = ck + pk,1 xt 1 + pk,2 xt 2 + + pk,l xt k + ek,t where ek,t is the predictor error, and the pk,k estimate the correlation between xt and xt +k conditional upon the intermediate values xt +1 , xt +2 , . . . , xt +k1 . Note that the parameters change as k increases, and so k = 1 is used for p1,1 , k = 2 is used for p2,2 , and so on. These parameters are determined from the Yule-Walker equations ri = pk,1 ri1 + pk,2 ri2 + + pk,k rik , i = 1, 2, . . . , k where r j = r| j| when j < 0, and r0 = 1. An iterative technique is used and it may not always be possible to solve for all the partial autocorrelations requested. This is because the predictor error variance ratios V R are dened as vk = Var(ek,t )/Var(xt ) = 1 pk,1r1 pk,2 r2 pk,k rk , unless | pk,k | 1 is encountered at some k = L0 , when the iteration terminates, with NV L = L0 1. The Autoregressive parameters of maximum order ARP are the nal parameters pL, j for j = 1, 2, . . . , NV L where NV L is the number of valid partial autocorrelation values, and L is the maximum number of partial autocorrelation coefcients requested, or else L = L0 1 as before in the event of premature termination of the algorithm. Figure 3.32 shows the data in test le times.tf1 before differencing and after rst order non-seasonal differencing has been applied to remove the linear trend. Note that, to obtain hardcopy of any differenced series,
Undifferenced Time Series
50

Time Series ( Non-Seasonal Differencing Order = 1)


2.50

Undifferenced Values

Differenced Values
0 20 40 60 80 100

40 30 20 10 0

2.00 1.50 1.00 0.50 0.00 -0.50 -1.00 0 20 40 60 80 100

Time

Time

Figure 3.32: Time series before and after differencing

174

S IMFIT reference manual: Part 3

a le containing the t values and corresponding differenced values can be saved from the graph as an ASCII coordinate le, then column 1 can be discarded using edit. A valuable way to detect signicant autocorrelations is to plot the autocorrelation coefcients, or the partial autocorrelation coefcients, as in gure 3.33. The statistical signicance of autocorrelations at specied lags can be judged by plotting the approximate 95% condence limits or, as in this case, by plotting 2/ n, where n is the sample size (after differencing, if any). Note that in plotting time series data you can always choose the starting value and the increment between observations, otherwise defaults starting at 1 with an increment of 1 will be assumed.
Autocorrelation Function
1.00 1.00

Partial Autocorrelation Function

0.50

PACF values
0 2 4 6 8 10

ACF values

0.50

0.00

0.00

-0.50

-0.50

10

Lags

Lags

Figure 3.33: Times series autocorrelation and partial autocorrelations

3.9.9.3 Autoregressive integrated moving average models (ARIMA)

It must be stressed that tting an ARIMA model is a very specialized iterative technique that does not yield unique solutions. So, before using this procedure, you must have a denite idea, by using the autocorrelation and partial autocorrelation options (page 172), or by knowing the special features of the data, exactly what differencing scheme to adopt and which parameters to t. Users can select the way that starting estimates are estimated, they can monitor the optimization, and they can alter the tolerances controlling the convergence, but only expert users should alter the default settings. It is assumed that the time series data x1 , x2 , . . . , xn follow an ARIMA model so that a differenced series given by wt = d D s xi c

can be tted, where c is a constant, d is the order of non-seasonal differencing, D is the order of seasonal differencing and s is the seasonality. The method estimates the expected value c of the differenced series in terms of an uncorrelated series at and an intermediate series et using parameters , , , as follows. The seasonal structure is described by wt = 1 wt s + 2 wt 2s + + P wt Ps + et 1et s 2 et 2s Q et Qs while the non-seasonal structure is assumed to be et = 1 et 1 + 2 et 2 + + p et p + at 1 at 1 2 at 2 q at q .

The model parameters 1 , 2 , . . . , p , 1 , 2 , . . . , q and 1 , 2 , . . . , P , 1 2 , . . . , Q are estimated by nonlinear optimization, the success of which is heavily dependent on choosing an appropriate differencing scheme, starting estimates and convergence criteria. After tting an ARIMA model, forecasts can be estimated along with 95% condence limits. For example, table 3.90 shows the results from tting the data in times.tf1 with a non-seasonal order of one and no seasonality, along with forecasts and associated standard errors, while gure 3.34 illustrates the t. On the rst graph the original time series data are plotted along with forecasts and 95% condence limits for the predictions. However it should be realized that only the differenced time series has been tted, that is, after rst order differencing to remove the linear trend. So in the second plot the best t ARIMA model is shown as a continuous line, while the differenced data are plotted as symbols.

Time series

175

Original dimension (NX) After differencing (NXD) Non-seasonal order (ND) Seasonal order (NDS) Seasonality (NS) No. of forecasts (NF) No. of parameters (NP) No. of iterations (ITC) Sum of squares (SSQ) Parameter Value phi( 1) 6.0081E-01 C( 0) 4.2959E-01 pred( 1) 4.3935E+01 pred( 2) 4.4208E+01 pred( 3) 4.4544E+01

= 100 = 99 = 1 = 0 = 0 = 3 = 1 = 2 = 0.199E+02 Std. err. Type 8.215E-02 Autoregressive 1.124E-01 Constant term 4.530E-01 Forecast 8.551E-01 Forecast 1.233E+00 Forecast

Table 3.90: Fitting an ARIMA model to time series data

ARIMA forecasts with 95% Confidence Limits


60.0

Observations

40.0

Forecasts

20.0

0.0

20

40

60

80

100

120

Time

Differenced Series and ARIMA Fit


4.00

Data and Best Fit

Differenced Series Best ARIMA Fit


2.00

0.00

-2.00

20

40

60

80

100

Time
Figure 3.34: Fitting an ARIMA model to time series data

176

S IMFIT reference manual: Part 3

3.9.10 Survival analysis


3.9.10.1 Fitting one set of survival times

Survival Analysis
1.00

Estimated Survivor Function

Kaplan-Meier Estimate

0.50

MLE Weibull Curve


0.00

10

15

20

25

Time
Figure 3.35: Analyzing one set of survival times The idea is that you have one or more samples of survival times (page 285) with possible censoring, but no covariates, that you wish to analyze and compare for differences, using gct in mode 3, or simstat. In other words, you observe one or more groups and, at known times, you record the frequency of failure or censoring. You would want to calculate a nonparametric Kaplan-Meier estimate for the survivor function, as well as a maximum likelihood estimate for some supposed probability density function, such as the Weibull distribution. Finally, you would want to compare survivor functions in groups by comparing parameter estimates, or using the Mantel-Haenszel log rank test, or by resorting to some model such as the proportional hazards model, i.e. Cox regression by generalized linear modelling, particularly if covariates have to be taken into account. For example, gure 3.35 shows the result from analyzing the test le survive.tf4, which contains times for both failure and right censoring. Note that more advanced versions of such plots are described on page 255. Also, the parameter estimates from tting the Weibull model (page 290) by maximum likelihood can be seen in table 3.91. To understand these results, note that if the times ti are distinct and ordered failure times, i.e. ti1 < ti , and the number in the sample that have not failed by time ti is ni , while the number that do fail is di , then the estimated probabilities of failure and survival at time ti are given by p (failure) = di /ni p (survival) = (ni di)/ni . The Kaplan-Meier product limit nonparametric estimate of the survivor function (page 285) is dened as a step function which is given in the interval ti to ti+1 by the product of survival probabilities up to time ti , that

Survival analysis

177

Alternative MLE Weibull parameterizations S(t) = exp[-{exp(beta)}tB] = exp[-{lambda}tB] = exp[-{A*t}B] Parameter Value Std. err. ..95% conf. lim. .. B 1.371E+00 2.38E-01 8.40E-01 1.90E+00 beta -3.083E+00 6.46E-01 -4.52E+00 -1.64E+00 lambda 4.583E-02 2.96E-02 -2.01E-02 1.12E-01 A 1.055E-01 1.77E-02 6.60E-02 1.45E-01 t-half 7.257E+00 1.36E+00 4.22E+00 1.03E+01 Correlation coefficient(beta,B) = -0.9412 Table 3.91: Survival analysis: one sample p 0.000 0.001 0.153 * 0.000 0.000

is (t ) = S with variance estimated by Greenwoods formula as (t )) = S (t )2 (S V dj . n ( n j =1 j j d j )


i i

j =1

nj dj nj

It is understood in this calculation that, if failure and censoring occur at the same time, the failure is regarded as having taken place just before that time and the censoring just after it. To understand tting the Weibull distribution, note that maximum likelihood parameter and standard error estimates are reported for three alternative parameterizations, namely S(t ) = exp( exp()t B ) = exp(t B ) = exp((At )B ). Since the density and survivor function are f (t ) = Bt B1 exp(t B )

S(t ) = exp(t B),

and there are d failures and n d right censored observations, the likelihood function l (B, ) is proportional to the product of the d densities for the failures in the overall set of n observations and the survivor functions, that is l (B, ) (B)d

iD

tiB1 exp tiB


i=1

where D is the set of failure times. Actually, the log-likelihood function objective function L(B, ) = d log(B) + d + (B 1) log(ti ) exp() tiB
iD i=1 n

178

S IMFIT reference manual: Part 3

with = exp() is better conditioned, so it is maximized and the partial derivatives L1 = L/ L2 = L/B L11 = 2 L/2 L12 = 2 L/B L22 = 2 L/B2 are used to form the standard errors and correlation coefcient according to ) = se(B ) = se( L11 /(L11 L22 L2 12 )

L22 /(L11 L22 L2 12 ) ) = L12 / L11 L22 . , corr(B

3.9.10.2 Comparing two sets of survival times

Graphical Check for Proportional Hazards


2.00

1.00

0.00

-1.00

-2.00

-3.00 0.00 0.80 1.60 2.40 3.20

log[Time]

Figure 3.36: Analyzing two sets of survival times

As an example of how to compare two data sets, consider the pairwise comparison of the survival times in survive.tf3 and survive.tf4, leading to the results of gure 3.36. Note that you can plot the hazards and the other usual transforms, and do graphical tests for the proportional hazards model. For instance, the transformed Kaplan-Meier nonparametric survivor functions in gure 3.36 should be approximately linear and parallel if the proportional hazards assumption and also the Weibull survival model are justied. To prepare your own data you must rst browse the test les survive.tf? and understand the format (column 1 is time, column 2 is 0 for failure and 1 for right censoring, column 3 is frequency), then use program makmat. To understand the graphical and statistical tests used to compare two samples, and to appreciate the results displayed in table 3.92, consider the relationship be-

log[-log[KMS(t)]]

Results for the Mantzel-Haenszel (log-rank) test H0: h_A(t) = h_B(t) (equal hazards) H1: h_A(t) = theta*h_B(t) (proportional hazards) QMH test statistic = 1.679E+01 P(chi-sq. >= QMH) = 0.0000 Reject H0 at 1% s-level Estimate for theta = 1.915E-01 95% conf. range = 8.280E-02, 4.429E-01 Table 3.92: Survival analysis: two samples

Survival analysis

179

tween the cumulative hazard function H (t ) and the hazard function h(t ) dened as follows h(t ) = f (t )/S(t )
t

H (t ) =
0

h(u) du

= log(S(t )). So various graphs can be plotted to explore the form of the cumulative survivor functions for the commonly used models based on the identities Exponential : H (t ) = At Weibull : log(H (t )) = log AB + B log t Gompertz : log(h(t )) = log B + At Extreme value : log(H (t )) = (t ). (t )) against log t , i.e. of the type plotted in For instance, for the Weibull distribution, a plot of log( log(S gure 3.36, should be linear, and the proportional hazards assumption would merely alter the constant term since, for h(t ) = AB(At )B1, log( log(S(t )) = log + log AB + B log t . Testing for the presence of a constant of proportionality in the proportional hazards assumption amounts to testing the value of with respect to unity. If the condence limits in table 3.92 enclose 1, this can be taken as suggesting equality of the two hazard functions, and hence equality of the two distributions, since equal hazards implies equal distributions. The QMH statistic given in table 3.92 can be used in a chi-square test with one degree of freedom for equality of distributions, and it arises by considering the 2 by 2 contingency tables at each distinct time point t j of the following type. Group A Group B Total Died d jA d jB dj Survived n jA d jA n jB d jB nj dj Total n jA n jB nj

Here the total number at risk n j at time t j also includes subjects subsequently censored, while the numbers d jA and d jB actually dying can be used to estimate expectations and variances such as E (d jA ) = n jA d j /n j V (d jA ) = Now, using the sums OA = d jA EA = E (d jA ) d j (n j d j )n jA n jB . n2 j (n j 1)

VA = V (d jA )

as in the Mantel-Haenszel test, the log rank statistic can be calculated as QMH = (OA EA )2 . VA

Clearly, the graphical test, the value of , the 95% condence range, and the chi-square test with one degree of freedom support the assumption of a Weibull distribution with proportional hazards in this case. The advanced technique for plotting survival analysis data is described on page 255.

180

S IMFIT reference manual: Part 3

3.9.10.3 Survival analysis using generalized linear models

Many survival models can be tted to n uncensored and m right censored survival times with associated explanatory variables using the GLM technique from lint, gct in mode 4, or simstat. For instance, the simplied interface allows you to read in data for the covariates, x, the variable y which can be either 1 for right-censoring or 0 for failure, together with the times t in order to t survival models. With a density f (t ), survivor function S(t ) = 1 F (t ) and hazard function h(t ) = f (t )/S(t ) a proportional hazards model is assumed for t 0 with h(ti ) = (ti ) exp( j xi j )
j

= (ti ) exp(T xi ) (t ) =
t 0

(u) du

S(t ) = exp((t ) exp(T x)).


3.9.10.4 The exponential survival model

f (t ) = (t ) exp(T x (t ) exp(T x))

The exponential model has constant hazard and is particularly easy to t, since = T x f (t ) = exp( t exp())

F (t ) = 1 exp(t exp()) (t ) = 1 (t ) = t h(t ) = exp() and E (t ) = exp(), so this simply involves tting a GLM model with Poisson error type, a log link, and a calculated offset of log(t ). The selection of a Poisson error type, the log link and the calculation of offsets are all done automatically by the simplied interface from the data provided, as will be appreciated on tting the test le cox.tf1. It should be emphasized that the values for y in the simplied GLM procedure for survival analysis must be either y = 0 for failure or y = 1 for right censoring, and the actual time for failure t must be supplied paired with the y values. Internally, the SIMFIT simplied GLM interface reverses the y values to dene the Poisson variables and uses the t values to calculate offsets automatically. Users who wish to use the advanced GLM interface for survival analysis must be careful to declare the Poisson variables correctly and provide the appropriate offsets as offset vectors. Results from the analysis of cox.tf1 are shown in table 3.93.
3.9.10.5 The Weibull survival model

Weibull survival is similarly easy to t, but is much more versatile than the exponential model on account of the extra shape parameter as in the following equations. f (t ) = t 1 exp( t exp()) F (t ) = 1 exp(t exp()) (t ) = t 1 (t ) = t

h(t ) = t 1 exp() E (t ) = (1 + 1/) exp(/). However, this time, the offset is log(t ), where has to be estimated iteratively and the covariance matrix subsequently adjusted to allow for the extra parameter that has been estimated. The iteration to estimate

Survival analysis

181

Model: exponential survival No. parameters = 4, Rank = 4, No. points = Parameter Value 95% conf. limits Constant -5.150E+00 -6.201E+00 -4.098E+00 B( 1) 4.818E-01 1.146E-01 8.490E-01 B( 2) 1.870E+00 3.740E-01 3.367E+00 B( 3) -3.278E-01 -8.310E-01 1.754E-01 Deviance = 3.855E+01, A = 1.000E+00 Model: Weibull survival No. parameters = 4, Rank = 4, No. points = Parameter Value 95% conf. limits Constant -5.041E+00 -6.182E+00 -3.899E+00 B( 1) 4.761E-01 1.079E-01 8.443E-01 B( 2) 1.841E+00 3.382E-01 3.344E+00 B( 3) -3.244E-01 -8.286E-01 1.798E-01 Alpha 9.777E-01 8.890E-01 1.066E+00 Deviance = 3.706E+01 Deviance - 2n*log[alpha] = 3.855E+01 Model: Cox proportional hazards No. parameters = 3, No. points = Parameter Value 95% conf. B( 1) 7.325E-01 2.483E-01 B( 2) 2.756E+00 7.313E-01 B( 3) -5.792E-01 -1.188E+00 Deviance = 1.315E+02

33, Deg. freedom = Std.error p 5.142E-01 0.0000 1.795E-01 0.0119 7.317E-01 0.0161 2.460E-01 0.1931

29

**

33, Deg. freedom = Std.error p 5.580E-01 0.0000 1.800E-01 0.0131 7.349E-01 0.0181 2.465E-01 0.1985 4.336E-02 0.0000

29

**

33, Deg. freedom = limits Std.error 1.217E+00 2.371E-01 4.780E+00 9.913E-01 2.962E-02 2.981E-01

30 p 0.0043 0.0093 0.0615

Table 3.93: GLM survival analysis and covariance matrix adjustments are done automatically by the SIMFIT simplied GLM interface, and the . deviance is also adjusted by a term 2n log
3.9.10.6 The extreme value survival model

Extreme value survival is dened by f (t ) = exp(t ) exp( exp(t + )) which is easily tted, as it is transformed by u = exp(t ) into Weibull form, and so can be tted as a Weibull model using t instead of log(t ) as offset. However it is not so useful as a model since the hazard increases exponentially and the density is skewed to the left.
3.9.10.7 The Cox proportional hazards model

This model assumes an arbitrary baseline hazard function 0 (t ) so that the hazard function is h(t ) = 0 (t ) exp(). It should rst be noted that Cox regression techniques will often yield slightly different parameter estimates, as these will often depend on the starting estimates, and also since there are alternative procedures for allowing for ties in the data. In order to allow for Coxs exact treatment of ties in the data, i.e., more than one failure or censoring at each time point, this model is tted by the SIMFIT GLM techniques after rst calculating the risk sets at failure times ti , that is, the sets of subjects that fail or are censored at time ti plus those who survive

182

S IMFIT reference manual: Part 3

beyond time ti . Then the model is tted using the technique for conditional logistic analysis of stratied data (page 3.6.4). The model does not involve calculating an explicit constant as that is subsumed into the arbitrary baseline function. However, the model can accommodate strata in two ways. With just a few strata, dummy indicator variables can be dened as in test les cox.tf2 and cox.tf3 but, with large numbers of strata, data should be prepared as for cox.tf4. As an example, consider the results shown in table 3.93 from tting an exponential, Weibull, then Cox model to data in the test le cox.tf1. In this case there is little improvement from tting a Weibull model after an exponential model, as shown by the deviances and half normal residuals plots. The deviances from the full models (exponential, Weibull, extreme value) can be compared for goodness of t, but they can not be compared directly to the Cox deviance.
3.9.10.8 Comprehensive Cox regression

Note that the Cox model can be completed by assuming a baseline hazard function, such as a piece1.00 Survivor Functions for Strata 1, 2, 3 wise exponential function, and the advantage in doing this is so that the survivor functions for the strata 0.80 can be computed and the residuals can be used for goodness of t analysis. Figure 3.37 illustrates the 0.60 analysis of cox.tf4 using the comprehensive Cox regression procedure to calculate parameter scores, 0.40 residuals, and survivor functions, in addition to parameter estimates. This data set has three covariates 0.20 and three strata, hence there are three survivor functions, one for each stratum. It is frequently bene0.00 cial to plot the survivor functions in order to visual0.00 0.25 0.50 0.75 1.00 1.25 ize the differences in survival between different subTime groups, i.e., strata, and in this case, the differences are clear. It should be pointed out that parameter esFigure 3.37: Cox regression survivor functions timates using the comprehensive procedure will be slightly different from parameter estimates obtained by the GLM procedure if there are ties in the data, as the Breslow approximation for ties is used by the comprehensive procedure, unlike the Cox exact method which is employed by the GLM procedures. Another advantage of the comprehensive procedure is that experienced users can input a vector of offsets, as the assumed model is actually (t , z) = 0 (t ) exp(T x + ) for parameters , covariates x and offset . Then the maximum likelihood estimates for are obtained by maximizing the Kalbeisch and Prentice approximate marginal likelihood L= exp(T si + i ) T di i=1 [l R(t(i) ) exp( xl + l )]
nd

where, nd is the number of distinct failure times, si is the sum of the covariates of individuals observed to fail at t(i) , and R(t(i) ) is the set of individuals at risk just prior to t(i) . In the case of multiple strata, the objective (t(i) ) function is taken to be the sum of such expression, one for each stratum. The survivor function exp(H and residuals r(tl ) are calculated using (t(i) ) = H

S(t) = 1 - F(t)

t j ti

di T x l + l ) l R(t(i) ) exp(

T xl + l ), (tl ) exp( r(tl ) =H where there are di failures at t(i)

Statistical calculations

183

3.9.11 Statistical calculations


In data analysis it is frequently necessary to perform calculations rather than tests, e.g., examining condence limits for parameters estimated from data, or plotting power as a function of sample size when designing an experiment. A brief description of such procedures follows.
3.9.11.1 Statistical power and sample size

Experiments often generate random samples from a population so that parameters estimated from the samples can be use to test hypotheses about the population parameters. So it is natural to investigate the relationship between sample size and the absolute precision of the estimates, given the expectation E (X ) and variance 2 (X ) of the random variable. For a single observation, i.e., n = 1, the Chebyshev inequality P (|X E (X )| < ) 1 with > 0, indicates that, for an unspecied distribution, and P (|X E (X )| < 10(X )) 0.99, but, for an assumed normal distribution, P (|X E (X )| < 1.96(X )) 0.95, P (|X E (X )| < 4.5(X )) 0.95, 2 (X ) 2

and P (|X E (X )| < 2.58(X )) 0.99. However, provided that E (X ) = 0, it is more useful to formulate the Chebyshev inequality in terms of the relative precision, that is, for > 0 P Now, for an unspecied distribution, P and P X E (X ) (X ) < 4.5 E (X ) |E (X )| X E (X ) (X ) < 10 E (X ) |E (X )| 0.95, 0.99, X E (X ) 1 2 (X ) < 1 2 2 . E (X ) E (X )

but, for an assumed normal distribution, P and P X E (X ) (X ) < 1.96 E (X ) |E (X )| X E (X ) (X ) < 2.58 E (X ) |E (X )| 0.95, 0.99.

So, for high precision, the coefcient of variation cv% cv% = 100 (X ) |E (X )|

184

S IMFIT reference manual: Part 3

must be as small as possible, while the signal-to-noise ratio SN (X ) SN (X ) = |E (X )| (X )

must be as large as possible. For instance, for the single measurement to be within 10% of the mean 95% of the time requires SN 45 for an arbitrary distribution, or SN 20 for a normal distribution. A particularly valuable application of these results concerns the way that the signal-to-noise ratio of sample means depends on the sample size n. From
n = 1 xi , X n i=1

) = Var(X

1 n Var(X ) n2 i =1

1 = 2 (X ), n ) is given by it follows that, for arbitrary distributions, the signal-to-noise ratio of the sample mean SN (X ) = nSN (X ), that is SN (X ) = n E (X ) . SN (X (X ) implies that the signal-to-noise ratio of the sample mean as an estimate This result, known as the law of n, of the population mean increases as n, so that the the relative error in estimating the mean decreases like 1/ n. If f (x) is the density function for a random variable X , then the null and alternative hypotheses can sometimes be expressed as H0 : f (x) = f0 (x) H1 : f (x) = f1 (x) while the error sizes, given a critical region C, are = PH0 (reject H0 ) (i.e., the Type I error) =
C

f0 (x) dx

= PH1 (accept H0 ) (i.e., the Type II error) = 1


C

f1 (x) dx.

Usually is referred to as the signicance level, is the operating characteristic, while 1 is the power, frequently expressed as a percentage, i.e., 100(1 )%, and these will both alter as the critical region is changed. Figure 3.38 illustrates the concepts of signal-to-noise ratio, signicance level, and power. The family of curves on the left are the probability density functions for the distribution of the sample mean x from a normal distribution with mean = 0 and variance 2 = 1. The curves on the right illustrate the signicance level , and operating characteristic for the null and alternative hypotheses H0 : = 0, 2 = 4 H1 : = 1, 2 = 4 for a test using the sample mean from a sample of size n = 25 from a normal distribution, with a critical point C = 0.4. The signicance level is the area under the curve for H0 to the right of the critical point, while the operating characteristic is the area under the curve for H1 to the left of the critical point. Clearly, increasing the critical value C will decrease and increase , while increasing the sample size n will decrease both and .

Statistical calculations

185

Distribution of the mean as a function of sample size


2.25 n = 32 1.00

Significance Level and Power


C = 0.4 H0
0.80

H1

1.50 0.60

pdf

n=8 0.75 n=4 n=2

pdf
0.40 0.20

n = 16

0.00 -2.00 -1.00 0.00 1.00 2.00 0.00 -1.00 0.00

1.00 2.00

Figure 3.38: Signicance level and power

Often it is wished to predict power as a function of sample size, which can sometimes be done if distributions f0 (x) and f1 (x) are assumed, necessary parameters are provided, the critical level is specied, and the test procedure is dened. Essentially, given an implicit expression in k unknowns, this option solves for one given the other k 1, using iterative techniques. For instance, you might set and , then calculate the sample size n required, or you could input and n and estimate the power. Note that 1-tail tests can sometimes be selected instead of 2-tail tests (e.g., by replacing Z/2 by Z in the appropriate formula) and also be very careful to make the correct choice for supplying proportions, half-widths, absolute differences, theoretical parameters or sample estimates, etc. A word of warning is required on the subject of calculating n required for a given power. The values of n will usually prove to be very large, probably much larger than can be used. So, for pilot studies and typical probing investigations, the sample sizes should be chosen according to cost, time, availability of materials, past experience, and so on. Sample size calculations are only called for when Type II errors may have serious consequences, as in clinical trials, so that large samples are justied. Of course, the temptation to choose 1-tail instead of 2-tail tests, or use variance estimates that are too small, in order to decrease the n values should be avoided.
3.9.11.2 Power calculations for 1 binomial sample

The calculations are based on the binomial test (page 103), the binomial distribution (page 283), and the normal approximation to it for large samples and p not close to 0 or 1, using the normal distribution (page 286). If the theoretical binomial parameters p0 and q0 = 1 p0 are not too close to 0 or 1 and it is wished to estimate this with an error of at most , then the sample size required is 2 where P(Z > Z/2 ) = /2, or (Z/2 ) = 1 /2, which, for many purposes, can be approximated by n 1/2 . The power in a binomial or sign test can be approximated, again if the sample estimates p1 and q1 = 1 p1 are not too close to 0 or 1, by 1 = P Z < p1 p0 Z/2 p0 q0 /n p1 q1 p0 q0 +P Z > p1 p0 + Z/2 p0 q0 /n p1 q1 p0 q0 . n=
2 p q Z /2 0 0

3.9.11.3 Power calculations for 2 binomial samples

For two sample proportions p1 and p2 that are similar and not too close to 0 or 1, the sample size n and power 1 associated with a binomial test for H0 : p01 = p02 can be estimated using one of numerous methods

186

S IMFIT reference manual: Part 3

based upon normal approximations. For example n= Z = ( p1 q1 + p2q2 )(Z/2 + Z )2 , ( p1 p2)2 n( p1 p2)2 Z/2, p1 q1 + p2 q2

1 = (Z ).

= P(Z Z ),

Power for the Fisher exact test (page 99) with sample size n used to estimate both p1 and p2 , as for the binomial test, can be calculated using 1 = 1
r=0 Cr 2n

n x

n , rx

where r = total successes, x = number of successes in the group, and Cr = the critical region. This can be inverted by SIMFIT to estimate n, but unfortunately the sample sizes required may be too large to implement by the normal procedure of enumerating probabilities for all 2 by 2 contingency tables with consistent marginals.
3.9.11.4 Power calculations for 1 normal sample

The calculations are based upon the condence limit formula for the population mean from a sample of size n, using the sample mean x , sample variance s2 and the t distribution (page 288), as follows s s P x t/2,n1 x + t/2,n1 n n = 1 ,
n i=1 n i=1

where x = xi / n , s2 = (xi x )2 /(n 1), P(t t/2, ) = 1 /2, and = n 1. You input the sample variance, which should be calculated using a sample size comparable to those predicted above. Power calculations can be done using the half width h = t/2,n1s/ n, or using the absolute difference between the population mean and the null hypothesis mean as argument. The following options are available: u To calculate the sample size necessary to estimate the true mean within a half width h n=
2 s2 t /2,n1

h2

u To calculate the sample size necessary for an absolute difference n= s2 (t +t )2 ; or 2 /2,n1 ,n1

Statistical calculations

187

u To estimate the power t,n1 = s2 /n t/2,n1.

It should be noted that the sample size occurs in the degrees of freedom for the t distribution, necessitating an iterative solution to estimate n.
3.9.11.5 Power calculations for 2 normal samples

These calculations are based upon the same type of t test approach (page 91) as just described for 1 normal 2 sample, except that the pooled variance s2 p should be input as the estimate for the common variance , i.e., )2 + (y j y )2 (xi x
j =1 nx ny

s2 p=

i=1

nx + ny 2

where X has sample size nx and Y has sample size ny . The following options are available: r To calculate the sample size necessary to estimate the difference between the two population means within a half width h 2 2s2 pt/2,2n2 n= ; h2 r To calculate the sample size necessary to detect an absolute difference between population means n= r To estimate the power t,2n2 = 2s2 p /n t/2,2n2. 2s2 p (t +t )2 ; or 2 /2,2n2 ,2n2

The t test has maximum power when nx = ny but, if the two sample sizes are unequal, calculations based on the the harmonic mean nh should be used, i.e., 2nx ny , nx + ny nh nx so that ny = . 2nx nh nh =
3.9.11.6 Power calculations for k normal samples

The calculations are based on the 1-way analysis of variance technique (page 112). Note that the SIMFIT power as a function of sample size procedure also allows you to plot power as a function of sample size (page 259), which is particularly useful with ANOVA designs where the number of columns k can be of interest, in addition to the number per sample n. The power calculation involves the F and non-central F distributions (page 289) and you calculate the required n values by using graphical estimation to obtain starting estimates for the iteration. If you choose a n value that is sufcient to make the power as a function on n plot cross the critical power, the program then calculates the power for sample sizes adjacent to the intersection, which is of use when studying k and n for ANOVA.

188

S IMFIT reference manual: Part 3

3.9.11.7 Power calculations for 1 and 2 variances

The calculations depend on the fact that, for a sample of size n from a normal distribution with true variance 2 2 0 , the function dened as (n 1)s2 2 = 2 0 is distributed as a chi-square variable (page 289) with n 1 degrees of freedom. Also, given variance estimates 2 s2 x and sy obtained with sample sizes nx and ny from the same normal distribution, the variance ratio F (page 91) dened as 2 s2 x sy F = max 2 , 2 sy sx is distributed as an F variable (page 289) with either nx , ny or ny , nx degrees of freedom. If possible nx should equal ny , of course. The 1-tailed options available are:
2 2 u H0 : 2 2 0 against H1 : > 0 2 2 1 = P(2 2 ,n1 0 /s ); 2 2 u H0 : 2 2 0 against H1 : < 0 2 2 1 = P(2 2 1,n1 0 /s ); or 2 u Rearranging the samples, if necessary, so that s2 x > sy then 2 2 2 2 H0 : x = y against H1 : x = y

Z = where m =

2m(ny 2) log m+1 nx 1 . ny 1

s2 x s2 y

3.9.11.8 Power calculations for 1 and 2 correlations

The correlation coefcient r (page 132) calculated from a normally distributed sample of size n has a standard error 1 r2 sr = n2 and is an estimator of the population correlation . A test for zero correlation, i.e., H0 : = 0, can be based on the statistics r t= , sr 1 + |r| or F = , 1 |r|

where t has a t distribution with n 2 degrees of freedom, and F has an F distribution with n 2 and n 2 degrees of freedom. The Fisher z transform and standard error sz , dened as z = tanh1 r, = sz = 1 1+r log , 2 1r 1 , n3

Statistical calculations

189

are also used to test H0 : = 0 , by calculating the unit normal deviate Z= z 0 sz

where 0 = tanh1 0 . The power is calculated using the critical value rc = which leads to the transform zc = tanh1 rc and
2 t /2,n2

2 t /2,n2 + n 2

then the sample size required to reject H0 : = 0, when actually is nonzero, can be calculated using n= Z + Z/2 0
2

Z = (z zc ) n 3

+ 3.

For two samples, X of size nx and Y of size ny , where it is desired to test H0 : x = y , the appropriate Z statistic is zx zy Z= sxy where sxy = and the power and sample size are calculated from Z = |zx zy | Z/2 , sxy Z/2 + Z zx zy
2

1 1 + nx 3 ny 3

and n = 2

+ 3.

3.9.11.9 Power calculations for a chi-square test

The calculations are based on the chi-square test (page 98) for either a contingency table, or sets of observed and expected frequencies. However, irrespective of whether the test is to be performed on a contingency table or on samples of observed and expected frequencies, the null hypotheses can be stated in terms of k probabilities as H0 : the probabilities are p0 (i) , for i = 1, 2, . . . , k, H1 : the probabilities are p1 (i) , for i = 1, 2, . . . , k. The power can then be estimated using the non-central chi-square distribution with non-centrality parameter and degrees of freedom given by = nQ, where Q = ( p0 (i) p1 (i))2 , p0 (i) i=1
k

n = total sample size, and = k 1 no. of parameters estimated. You can either input the Q values directly, or read in vectors of observed and expected frequencies. If you do input frequencies fi 0 they will be transformed internally into probabilities, i.e., the frequencies only have to be positive integers as they are normalized to sum unity using pi = f i

i=1

fi .

190

S IMFIT reference manual: Part 3

In the case of contingency table data with r rows and c columns, the probabilities are calculated from the marginals pi j = p(i) p( j) in the usual way, so you must input k = rc, and the number of parameters estimated as r + c 2, so that = (r 1)(c 1).
3.9.11.10 Parameter condence limits

You choose the distribution required and the signicance level of interest, then input the estimates and sample sizes required. Note that the condence intervals may be asymmetric for those distributions (Poisson, binomial) where exact methods are used, not calculations based on the normal approximation.
3.9.11.11 Condence limits for a Poisson parameter

Given a sample x1 , x2 , . . . , xn of n non-negative integers from a Poisson distribution with parameter , i.e., the sample mean, and condence limits 1 , 2 are calculated as (page 285), the parameter estimate follows K = xi ,
i=1 n

= K /n, 1 1 = 2 , 2n 2K ,/2 1 2 = 2 , 2n 2K +2,1/2 x (n1 ) so that exp(n1 ) = , x! 2 x=K exp(n2 ) (n2 )x = , x! 2 x=0
K

and P(1 2 ) = 1 , using the lower tail critical points of the chi-square distribution (page 289). The following very approximate rule-of-thumb can be used to get a quick idea of the range of a Poisson mean given a single count x and exploiting the fact that the Poisson variance equals the mean P(x 2 x x + 2 x) 0.95.
3.9.11.12 Condence limits for a binomial parameter

For k successes in n trials, the binomial parameter estimate (page 283) p is k/n and three methods are used to calculate condence limits p1 and p2 so that

x=k

n x p (1 p1)nx = /2, x 1 n x p (1 p2)nx = /2. x 2

and

x=0

r If max(k, n k) < 106, the lower tail probabilities of the beta distribution are used (page 290) as follows and p2 = k+1,nk,1/2. p1 = k,nk+1,/2,

Statistical calculations

191

r If max(k, n k) 106 and min(k, n k) 1000, the Poisson approximation (page 285) with = np and the chi-square distribution (page 289) are used, leading to 1 2 , 2n 2k,/2 1 and p2 = 2 . 2n 2k+2,1/2 p1 = r If max(k, n k) > 106 and min(k, n k) > 1000, the normal approximation (page 286) with mean np and variance np(1 p) is used, along with the lower tail normal deviates Z1/2 and Z/2 , to obtain approximate condence limits by solving k np1 = Z1/2 , = Z/2 .

and

np1 (1 p1) k np2 np2 (1 p2)

The following very approximate rule-of-thumb can be used to get a quick idea of the range of a binomial mean np given x and exploiting the fact that the binomial variance variance equals np(1 p) P(x 2 x np x + 2 x) 0.95.

3.9.11.13 Condence limits for a normal mean and variance

If the sample mean is x , and the sample variance is s2 , with a sample of size n from a normal distribution (page 286) having mean and variance 2 , the condence limits are dened by P(x t/2,n1s/ n x + t/2,n1s/ n) = 1 ,

2 2 and P((n 1)s2/2 /2,n1 (n 1)s /1/2,n1) = 1

where the upper tail probabilities of the t (page 288) and chi-square (page 289) distribution are used.
3.9.11.14 Condence limits for a correlation coefcient

If a Pearson product-moment correlation coefcient r (page 132) is calculated from two samples of size n that are jointly distributed as a bivariate normal distribution (page 287), the condence limits for the population parameter are given by P r rc r + rc 1 rrc 1 + rrc = 1 ,
2 t /2,n2 + n 2 2 t /2,n2

where rc =

3.9.11.15 Condence limits for trinomial parameters

If, in a trinomial distribution (page 284), the probability of category i is pi for i = 1, 2, 3, then the probability P of observing ni in category i in a sample of size N = n1 + n2 + n3 from a homogeneous population is given by N! P= p n1 p n2 p n3 n 1 !n 2 !n 3 ! 1 2 3

192

S IMFIT reference manual: Part 3

and the maximum likelihood estimates, of which only two are independent, are p 1 = n1 /N , p 2 = n2 /N , and p 3 = 1 p 1 p 2 . The bivariate estimator is approximately normally distributed, when N is large, so that p 1 p 2 MN2 p1 p2 , p1 (1 p1)/N p1 p2 /N p1 p2 /N p2 (1 p2)/N

where MN2 signies the bivariate normal distribution (page 287). Consequently (( p 1 p1 ), ( p 2 p2)) and hence, with probability 95%, 2 p2 )2 2( p 1 p1 )( p 2 p2) (1 p1 p2 ) (p 1 p1)2 ( p + + 2 . p1 (1 p1) p2 (1 p2) (1 p1)(1 p2) N (1 p1)(1 p2) 2;0.05 Such inequalities dene regions in the ( p1 , p2 ) parameter space which can be examined for statistically signicant differences between pi( j) in samples from populations subjected to treatment j. Where regions are clearly disjoint, parameters have been signicantly affected by the treatments. This plotting technique is illustrated on page 257. p1 (1 p1)/N p1 p2 /N p1 p2 /N p2 (1 p2)/N
1

p 1 p1 p 2 p2

2 2

3.9.11.16 Robust analysis of one sample

Robust techniques are required when samples are contaminated by the presence of outliers, that is, observations that are not typical of the underlying distribution. Such observations can be caused by experimental accidents, such as pipetting enzyme aliquots twice into an assay instead of once, or by data recording mistakes, such as entering a value with a misplaced decimal point into a data table, but they can also occur because of additional stochastic components such as contaminated petri dishes or sample tubes. Proponents of robust techniques argue that extreme observations should always be down-weighted, as observations in the tails of distributions can seriously bias parameter estimates; detractors argue that it is scientically dishonest to discard experimental observations, unless the experimentalists have independent grounds for suspecting particular observations. Table 3.94 illustrates the analysis of robust.tf1. These data are for normal.tf1 but with ve outliers, analyzed rst by the exhaustive analysis of a vector procedure (page 79), then by the robust parameter estimates procedure. It should be noted that the Shapiro-Wilks test rejects normality and the robust estimators give much better parameter estimates in this case. If the sample vector is x1 , x2 , . . . , xn the following calculations are done. u Using the whole sample and the inverse normal function 1 (.), the median M , median absolute deviation D and a robust estimate of the standard deviation S are calculated as M = median(xi ) D = median(|xi M |) S = D/1 (0.75).

u The percentage of the sample chosen by users to be eliminated from each of the tails is 100%, then the trimmed mean T M , and Winsorized mean W M , together with variance estimates V T and VW , are

Statistical calculations

193

Procedure 1: Exhaustive analysis of vector Data: 50 N(0,1) random numbers with 5 outliers Sample mean = 5.124E-01 Sample standard deviation = 1.853E+00: CV% = 361.736% Shapiro-Wilks W statistic = 8.506E-01 Significance level for W = 0.0000 Reject normality at 1% sig.level Procedure 2: Robust 1-sample Total sample size Median value Median absolute deviation Robust standard deviation Trimmed mean (TM) Variance estimate for TM Winsorized mean (WM) Variance estimate for WM Number of discarded values Number of included values Percentage of sample used Hodges-Lehmann estimate (HL) analysis = 50 = 2.0189E-01 = 1.0311E+00 = 1.5288E+00 = 2.2267E-01 = 1.9178E-02 = 2.3260E-01 = 1.9176E-02 = 10 = 40 = 80.00% (for TM and WM) = 2.5856E-01

Table 3.94: Robust analysis of one sample

calculated as follows, using k = [n] as the integer part of n. TM = WM = VT = VW =


n k 1 xi n 2k i= k +1

1 n 1 n2 1 n2

n k i=k+1 n k

xi + kxk+1 + kxnk (xi T M )2 + k(xk+1 T M )2 + k(xnk T M )2 (xi W M )2 + k(xk+1 W M )2 + k(xnk W M )2 .

i=k+1 n k

i=k+1

u If the assumed sample density is symmetrical, the Hodges-Lehman location estimator HL can be used to estimate the center of symmetry. This is HL = median xi + x j ,1 i j n , 2

and it is calculated along with 95% condence limit. This would be useful if the sample was a vector of differences between two samples X and Y for a Wilcoxon signed rank test (page 97) that X is distributed F (x) and Y is distributed F (x ).
3.9.11.17 Robust analysis of two samples

Table 3.95 illustrates the analysis of ttest.tf4 and ttest.tf5 used earlier for a Mann-Whitney U test (page 96). The procedure is based on the assumption that X of size nx is distributed as F (x) and Y of size ny for the difference in location is calculated as as F (x ), so an estimate = median(y j xi , i = 1, 2, . . . , nx , j = 1, 2, . . . , ny ).

194

S IMFIT reference manual: Part 3

X-sample size Y-sample size Difference in location Lower confidence limit Upper confidence limit Percentage confidence limit Lower Mann-whitney U-value Upper Mann-Whitney U-value

= 12 = 7 = -1.8501E+01 = -4.0009E+01 = 2.9970E+00 = 95.30% = 1.9000E+01 = 6.6000E+01

Table 3.95: Robust analysis of two samples

100% condence limits UL and UH are then estimated by inverting the Mann-Whitney U statistic so that P(U UL ) /2 P(U UL + 1) > /2

P(U UH ) /2 P(U UH 1) > /2.

3.9.11.18 Indices of diversity

It is often required to estimate the entropy or degree of randomness in the distribution of observations into categories. For instance, in ecology several indices of diversity are used, as illustrated in table 3.96 for two Data: 5,5,5,5,5 Number of groups Total sample size Pielou J-prime evenness Brillouin J evenness Shannon H-prime Brillouin H Simpson lambda Simpson lambda-prime Data: 1,1,1,17 Number of groups Total sample size Pielou J-prime evenness Brillouin J evenness Shannon H-prime Brillouin H Simpson lambda Simpson lambda-prime

= = = = = = = =

4 20 1.0000 [complement = 0.0000] 1.0000 [complement = 0.0000] 6.021E-01(log10) 1.386E+00(ln) 5.035E-01(log10) 1.159E+00(ln) 0.2500 [complement = 0.7500] 0.2105 [complement = 0.7895]

2.000E+00(log2) 1.672E+00(log2)

= = = = = = = =

4 20 0.4238 [complement = 0.5762] 0.3809 [complement = 0.6191] 2.551E-01(log10) 5.875E-01(ln) 1.918E-01(log10) 4.415E-01(ln) 0.7300 [complement = 0.2700] 0.7158 [complement = 0.2842] Table 3.96: Indices of diversity

8.476E-01(log2) 6.370E-01(log2)

extreme cases. Given positive integer frequencies fi > 0 in k > 1 groups with n observations in total, then proportions pi = fi /n can be dened, leading to the Shannon H , Brillouin H , and Simpson and indices,

Statistical calculations

195

and the evennness parameters J and J dened as follows. Shannon diversity H = pi log pi
i=1 k

= [n log n fi log fi ]/n


i=1

Pielou evenness J = H / log k Brilloin diversity H = [log n! log fi !]/n


i=1 k

Brilloin evenness J = nH /[log n! (k d ) log c! d log(c + 1)!] Simpson lambda = p2 i


i=1 k k

Simpson lambda prime = fi ( fi 1)/[n(n 1)]


i=1

where c = [n/k] and d = n ck. Note that H and H are given using logarithms to bases ten, e, and two, while the forms J and J have been normalized by dividing by the corresponding maximum diversity and so are independent of the base. The complements 1 J , 1 J , 1 , and 1 are also tabulated within the square brackets. In table 3.96 we see that evenness is maximized when all categories are equally occupied, so that fi = 1/k and H = log k, and is minimized when one category dominates.

3.9.11.19 Standard and non-central distributions

SIMFIT uses discrete (page 283) and continuous (page 285) distributions for modelling and hypothesis tests, and the idea behind this procedure is to provide the option to plot and obtain percentage points for the standard statistical distributions to replace table look up. However, you can also obtain values for the distribution functions, given the arguments, for the non-central t , beta, chi-square or F distributions (page 291), or you can plot graphs, which are very useful for advanced studies in the calculation of power as a function of sample size. Figure 3.39 illustrates the chi-square distribution with 10 degrees of freedom for noncentrality parameter at values of 0, 5, 10, 15, and 20.

3.9.11.20 Cooperativity analysis

The binding of ligands to receptors can be dened in terms of a binding polynomial p(x) in the free ligand activity x, as follows p(x) = 1 + K1 x + K2x2 + + Kn xn = 1 + A 1 x + A 1 A 2 x2 + + A i xn
i=1 n

= 1+

n n n B1 x + B 1 B 2 x2 + + 1 2 n

B i xn ,
i=1

where the only difference between these alternative expressions concerns the meaning and interpretation of the binding constants. The fractional saturation is just the scaled derivative of the log of the polynomial with respect to log(x). If the binding polynomial has all real factors, then the fractional saturation y as a function of free ligand is indistinguishable from independent high/low afnity sites or uniformly negative cooperativity with Hill slope H everywhere less than or equal to unity. To see this, observe that for a set of m groups of

196

S IMFIT reference manual: Part 3

Noncentral chi-square Distribution


1.00 =0 =5 = 10 = 15 = 20 0.50

Distribution Function

0.00 0 10 20 30 40 50

Figure 3.39: Noncentral chi-square distribution

receptors, each with ni independent binding sites and binding constant ki then p(x) = (1 + kix)ni ,
i=1 m

and y =

m i=1 ni i=1

1 + ki x ,

n i ki x

so y is just the sum of simple binding curves, giving concave down double reciprocal plots, etc. However, if the binding polynomial has complex conjugate zeros, the Hill slope may exceed unity and there may be evidence of positive cooperativity. The way to quantify the sign of cooperativity is to t the appropriate order n saturation function f (x) to the binding data, i.e., f (x) = Zy + C, where y = 1 n d log( p(x)) d log(x)

to determine the binding constants, where Z accounts for proportionality between site occupation and response, and C is a background constant. Note that the Hill slope cannot exceed the Hill slope of any of the factors of the binding polynomial, so further calculations are required to see if the binding data show evidence of positive or negative cooperativity. Program sft outputs the binding constant estimates in all the conventions and, when n > 2 it also outputs the zeros of the best t binding polynomial and those of the Hessian of the binding polynomial h(x), dened as h(x) = np(x) p (x) (n 1) p(x)2 since it is at positive zeros of the Hessian that cooperativity changes take place. This because the Hill slope

Statistical calculations

197

H is the derivative of the log odds with respect to chemical potential, i.e., H= d log[y/(1 y)] d log(x) xh(x) = 1+ p (x) (np(x) + xp(x))

and positive zeros of h(x) indicate points where the theoretical one-site binding curve coinciding with the actual saturation curve at that x value has the same slope as the higher order saturation curve, which are therefore points of cooperativity change. The SIMFIT cooperativity procedure allows users to input binding constant estimates retrospectively to calculate zeros of the binding polynomial and Hessian, and also to plot species fractions (page 278). The species fractions si which are dened for i = 0, 1, . . . , n as si = Ki xi K0 + K1 x + K2x2 + + Kn xn

with K0 = 1, are interpreted as the proportions of the receptor in the various states of ligation as a function of ligand activity. The species fractions can be also used in a probability model to interpret ligand binding in several interesting ways. For this purpose, consider a random variable U representing the probability of a receptor existing in a state with i ligands bound. Then the the probability mass function, expected values and variance are P(U = i) = si (i = 0, 1, 2, . . ., n), E (U ) = isi , E (U 2 ) = i2 si ,
i=0 i=0 n n

V (U ) = E (U 2 ) [E (U )]2 =x

p (x) + xp (x) xp (x) p(x) p(x) dy =n , d log x

since the fractional saturation y is simply E (U )/n. In other words, the slope of a semi-logarithmic plot of fractional saturation data indicates the variance of the number of occupied sites, namely; all unoccupied when x = 0, distribution with variance increasing as a function of x up to the maximum semi-log plot slope, then nally approaching all sites occupied as x tends to innity. To practise with this procedure, input some binding constants, say 1, 2, 4, 16, and observe how the binding constants are mapped into all spaces, cooperativity coefcients are calculated, zeros of the binding polynomial and Hessian are estimated where appropriate, Hill slope is reported, and species fractions and transformed binding isotherms are displayed. As mentioned, this is done automatically after every high degree t by program sft.
3.9.11.21 Generating random numbers, permutations and Latin squares

In the design of experiments it is frequently necessary to generate sequences of pseudo-random numbers, or random permutations. For instance, assigning patients randomly to groups requires that a consecutive list of integers, names or letters be scrambled, while any ANOVA based on Latin squares should employ randomly generated Latin squares. SIMFIT will generate sequences of random numbers and permutations for these purposes. For example, all possible 4 4 Latin squares can be generated by random permutation of the rows and columns of the four basic designs shown in Table 3.97. Higher order designs that are sufciently random for most purposes can be generated by random permutations of the rows and columns of a default n n matrix with sequentially shifted entries of the type shown in Table 3.98, for a possible 7 7 starting matrix, although this will not generate all possible examples for n > 4. Note that program rannum provides many more options for generating random numbers.

198

S IMFIT reference manual: Part 3

A B C D

B A D C

C D B A

D C A B

A B C D

B C D A

C D A B

D A B C

A B C D

B D A C

C A D B

D C B A

A B C D

B A D C

C D A B

D C B A

Table 3.97: Latin squares: 4 by 4 random designs

A B C D E F G

B C D E F G A

C D E F G A B

D E F G A B C

E F G A B C D

F G A B C D E

G A B C D E F

Table 3.98: Latin squares: higher order random designs

3.9.11.22 Kernel density estimation

This technique is used to create a numerical approximation to the density function given a random sample of observations for which there is no known density. Figure 3.40 illustrates the result when this was done with

0.35 0.30 0.25 0.20 0.15 0.10

1.0 0.8 0.6 0.4 0.2

5 Bins

0.05 0.00 -4 0.60 -2 0 2 4

0.0 -4 1.0 0.8 -2 0 2 4

0.40

0.6 0.4

10 Bins

0.20

0.2
0.00 -4 -2 0 2 4

0.0 -4 -2 0 2 4

Figure 3.40: Kernel density estimation

data in the test le normal.tf1, using 5 bins for the histogram in the top row of gures, but using 10 bins for the histogram in the bottom row. Changing the number of bins k alters the density estimate since, given a sample of n observations x1 , x2 , . . . , xn with A xi B, the Gaussian kernel density estimate f(x) is dened

Statistical calculations

199

as 1 n f(x) = K nh i =1 x xi h

1 where K (t ) = exp(t 2 /2) 2 and h = (B A)/(k 2). Also, note the following details. t The calculation involves fast Fourier transform (FFT) using m equally spaced theoretical points A 3h ti B + 3h, and m can be increased interactively from the default value of 100, if necessary, for better representation of multi-modal proles. t The histograms shown on the left use k bins to contain the sample, and the height of each bin is the number of sample values in the bin interval divided by nh. The value of k can be changed interactively, and the dotted curves are the density estimates for the m values of t . The program generates additional empty bins outside the range set by the data to allow for tails. Hence the total area under the histogram is one, and the density estimates integrates to one between and . t The sample cumulative distributions shown on the right have a vertical step of 1/n at each sample value, and so they increase stepwise from zero to one. The density estimates are integrated numerically to generate the theoretical cdf functions, which are shown as dashed curves. They will only attain an asymptote of one if the number of points m is sufciently large to allow accurate integration, say 100. t The density estimates are unique given the data, k and m, but they will only be meaningful if the sample size is fairly large, say 50, and the bins have a reasonable content, say n/k 10. t The histogram, sample distribution, pdf estimate and cdf estimate can be saved to le by selecting the [Advanced] option then creating ASCII text coordinate les. Clearly, a sensible window width h, as in the top row, generates a realistic density estimate, while using too many bins, as in the second row, leads to obvious over-tting.

200

S IMFIT reference manual: Part 3

3.9.12 Numerical analysis


In data analysis it is frequently necessary to perform calculations rather than tests, e.g., calculating the determinant, eigenvalues, or singular values of a matrix to check for singularity.
3.9.12.1 Zeros of a polynomial of degree n - 1

Every real polynomial f (x) of degree n 1 with n coefcients Ai can be represented in either coefcient or factored form, that is f (x) = A1 xn1 + A2 xn2 + + An1x + An, = A1 (x 1 )(x 2 ) . . . (x n1), where i for i = 1, 2, . . . , n 1 are the n 1 zeros, i.e., roots of the equation f (x) = 0, and these may be non-real and repeated. In data analysis it is frequently useful to calculate the n 1 zeros i of a polynomial given the n coefcients Ai , and this SIMFIT polynomial solving procedure performs the necessary iterative calculations. However, note that such calculations are notoriously difcult for repeated zeros and high degree polynomials. Table 3.99 illustrates a calculation for the roots of the fourth degree polynomial Zeros of f(x) = A(1)x(n-1) + A(2)x(n-2) + ... + A(n) Real Part Imaginary Part A( 1) = 1.0000E+00 0.0000E+00 -1.0000E+00i A( 2) = 0.0000E+00 0.0000E+00 1.0000E+00i A( 3) = 0.0000E+00 -1.0000E+00 A( 4) = 0.0000E+00 1.0000E+00 A( 5) = -1.0000E+00 (constant term) Table 3.99: Zeros of a polynomial f (x) = x4 1

= (x i)(x + i)(x 1)(x + 1)

which are i, i, 1, and 1. Be careful to note when using this procedure that the sequential elements of the input vector must be the polynomial coefcients in order of decreasing degree. Zeros of nonlinear functions of one or several variables are sometimes required, and these can be estimated using usermod.
3.9.12.2 Determinants, inverses, eigenvalues, and eigenvectors

Table 3.100 illustrates an analysis of data in the test le matrix.tf1 to calculate parameters such as the determinant, inverse, eigenvalues, and eigenvectors, that are frequently needed when studying design matrices. Note that the columns of eigenvectors correspond in sequence to the eigenvalues, but with non-real eigenvalues the corresponding adjacent columns correspond to real and imaginary parts of the complex conjugate eigenvectors. Thus, in the case of eigenvalue 1, i.e. 38.861, column 1 is the eigenvector, while for eigenvalue 2, i.e. 2.7508 + 7.2564i, eigenvector 2 has column 2 as real part and column 3 as imaginary part. Similarly, for eigenvalue 3, i.e. 2.7508 7.2564i, eigenvector 3 has column 2 as real part and minus column 3 as imaginary part. Note that with SIMFIT matrix calculations the matrices will usually be written just once to the results le for relatively small matrices if the option to display is selected, but options are also provided to save matrices to le.
3.9.12.3 Singular value decomposition

Table 3.101 shows results from a singular value decomposition of data in f08kff.tf2. Analysis of your own design matrix should be carried out in this way if there are singularity problems due to badly designed experiments, e.g., with independent variables equal to 0 or 1 for binary variables such as female and male. If

Numerical analysis

201

Value of the determinant = 4.4834E+04 Values for the current square matrix are as follows: 1.2000E+00 4.5000E+00 6.1000E+00 7.2000E+00 8.0000E+00 3.0000E+00 5.6000E+00 3.7000E+00 9.1000E+00 1.2500E+01 1.7100E+01 2.3400E+01 5.5000E+00 9.2000E+00 3.3000E+00 7.1500E+00 5.8700E+00 9.9400E+00 8.8200E+00 1.0800E+01 1.2400E+01 4.3000E+00 7.7000E+00 8.9500E+00 1.6000E+00 Values for the current inverse are as follows: -2.4110E-01 6.2912E-02 4.4392E-04 1.0123E-01 2.9774E-02 8.5853E-02 -4.4069E-02 5.2548E-02 -1.9963E-02 -5.8600E-02 1.1818E-01 -1.7354E-01 -5.5370E-03 1.1957E-01 -3.0760E-02 2.2291E-01 6.7828E-02 -1.9731E-02 -2.5804E-01 1.3802E-01 -1.7786E-01 8.6634E-02 -7.6447E-03 1.3711E-01 -7.2265E-02 Eigenvalues: Real Part Imaginary Part 3.8861E+01 0.0000E+00 -8.3436E+00 0.0000E+00 -2.7508E+00 7.2564E+00 -2.7508E+00 -7.2564E+00 -2.2960E+00 0.0000E+00 Eigenvector columns (real parts only) 3.1942E-01 -3.4409E-01 -1.3613E-01 -1.3613E-01 -3.5398E-01 3.7703E-01 -7.1958E-02 -5.0496E-02 -5.0496E-02 6.2282E-02 6.0200E-01 7.8212E-01 8.0288E-01 8.0288E-01 -1.3074E-01 4.8976E-01 -4.4619E-01 -2.6270E-01 -2.6270E-01 7.8507E-01 3.9185E-01 2.5617E-01 -2.1156E-01 -2.1156E-01 -4.8722E-01 Eigenvector columns (imaginary parts only) 0.0000E+00 0.0000E+00 -7.5605E-02 7.5605E-02 0.0000E+00 0.0000E+00 0.0000E+00 3.9888E-01 -3.9888E-01 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 -1.9106E-01 1.9106E-01 0.0000E+00 0.0000E+00 0.0000E+00 -1.3856E-01 1.3856E-01 0.0000E+00 Table 3.100: Matrix example 1: Determinant, inverse, eigenvalues, eigenvectors your design matrix has m rows and n columns, m > n, there should be n nonzero singular values. Otherwise only linear combinations of parameters will be estimable. Actually, many statistical techniques, such as multilinear regression, or principal components analysis, are applications of singular value decompositions. Now, given any m by n matrix A, the SVD procedure calculates a singular value decomposition in the form A = U V T , where U is an m by m orthonormal matrix of left singular vectors, V is an n by n orthonormal matrix of right singular vectors, and is an m by n diagonal matrix, = diag(1 , . . . , n ) with i 0. However, note that SIMFIT can display the matrices U , , or V T , or write them to le, but superuous rows and columns of zeros are suppressed in the output. Another point to note is that, whereas the singular values are uniquely determined, the left and right singular vectors are only pairwise determined up to sign, i.e., corresponding pairs can be multiplied by -1.
3.9.12.4 LU factorization of a matrix, norms and condition numbers

Table 3.102 illustrates the LU factorization of the matrix A in matrix.tf1, displayed previously in table 3.100, along with the vector of row pivot indices corresponding to the pivot matrix P in the factorization A = PLU . As the LU representation is of interest in the solution of linear equations, this procedure also calculates the matrix norms and condition numbers needed to assess the sensitivity of the solutions to perturbations

202

S IMFIT reference manual: Part 3

Current matrix: -5.70000E-01 -1.28000E+00 -3.90000E-01 2.50000E-01 -1.93000E+00 1.08000E+00 -3.10000E-01 -2.14000E+00 2.30000E+00 2.40000E-01 4.00000E-01 -3.50000E-01 -1.93000E+00 6.40000E-01 -6.60000E-01 8.00000E-02 1.50000E-01 3.00000E-01 1.50000E-01 -2.13000E+00 -2.00000E-02 1.03000E+00 -1.43000E+00 5.00000E-01 Index Sigma(i) Fraction Cumulative Sigma(i)2 Fraction Cumulative: rank = 4 1 3.99872E+00 0.4000 0.4000 1.59898E+01 0.5334 0.5334 2 3.00052E+00 0.3002 0.7002 9.00310E+00 0.3003 0.8337 3 1.99671E+00 0.1998 0.9000 3.98686E+00 0.1330 0.9666 4 9.99941E-01 0.1000 1.0000 9.99882E-01 0.0334 1.0000 Right singular vectors by row (V-transpose) 8.25146E-01 -2.79359E-01 2.04799E-01 4.46263E-01 -4.53045E-01 -2.12129E-01 -2.62209E-01 8.25226E-01 -2.82853E-01 -7.96096E-01 4.95159E-01 -2.02593E-01 1.84064E-01 -4.93145E-01 -8.02572E-01 -2.80726E-01 Left singular vectors by column (U) -2.02714E-02 2.79395E-01 4.69005E-01 7.69176E-01 -7.28415E-01 -3.46414E-01 -1.69416E-02 -3.82903E-02 4.39270E-01 -4.95457E-01 -2.86798E-01 8.22225E-02 -4.67847E-01 3.25841E-01 -1.53556E-01 -1.63626E-01 -2.20035E-01 -6.42775E-01 1.12455E-01 3.57248E-01 -9.35234E-02 1.92680E-01 -8.13184E-01 4.95724E-01 Table 3.101: Matrix example 2: Singular value decomposition Matrix 1-norm = 4.3670E+01, Condition no. = 3.6940E+01 Matrix I-norm = 5.8500E+01, Condition no. = 2.6184E+01 Lower triangular/trapezoidal L where A = PLU 1.0000E+00 7.2515E-01 1.0000E+00 7.0175E-02 -2.2559E-01 1.0000E+00 1.7544E-01 -1.1799E-01 4.8433E-01 1.0000E+00 4.1813E-01 3.0897E-01 9.9116E-01 -6.3186E-01 1.0000E+00 Upper triangular/trapezoidal U where A = PLU 1.7100E+01 2.3400E+01 5.5000E+00 9.2000E+00 3.3000E+00 -1.2668E+01 3.7117E+00 2.2787E+00 -7.9298E-01 6.5514E+00 7.0684E+00 7.5895E+00 4.3314E+00 8.1516E+00 7.2934E+00 Row pivot indices equivalent to P where A = PLU 3 5 3 5 5 Table 3.102: Matrix example 3: LU factorization and condition number

when the matrix is square. Given a vector norm . , a matrix A, and the set of vectors x where x = 1, the

Numerical analysis

203

matrix norm subordinate to the vector norm is A = max Ax .


x =1

For a m by n matrix A, the three most important norms are A A A


1

= max ( |ai j |)
1 j n i=1

= (max |AT A|) 2 = max ( |ai j |),


1im j =1 n

so that the 1-norm is the maximum absolute column sum, the 2-norm is the square root of the largest eigenvalue of AT A, and the innity norm is the maximum absolute row sum. The condition numbers estimated are 1 (A) = A (A) = A
1 T

A 1 A
1

= 1 (A ) which satisfy 1 1, and 1 and they are included in the tabulated output unless A is in singular, when they are innite. For a perturbation b to the right hand side of a linear system with m = n we have Ax = b A(x + x) = b + b x b (A) , x b while a perturbation A to the matrix A leads to (A + A)(x + x) = b x A (A) , x + x A and, for complete generality, (A + A)(x + x) = b + b (A) x x 1 (A) A / A A b + A b

provided (A) A / A < 1. These inequalities estimate bounds for the relative error in computed solutions of linear equations, so that a small condition number indicates a well-conditioned problem, a large condition number indicates an ill-conditioned problem, while an innite condition number indicates a singular matrix and no solution. To a rough approximation; if the condition number is 10k and computation involves n-digit precision, then the computed solution will have about (n k)-digit precision.
3.9.12.5 QR factorization of a matrix

Table 3.103 illustrates the QR factorization of data in matrix.tf2. This involves factorizing a n by m matrix as in A = QR when n = m = Q1 Q2 R 0 when n > m

= Q(R1 R2 ) when n < m where Q is a n by n orthogonal matrix and R is either upper triangular or upper trapezoidal. You can display or write to le the matrices Q, Q1 , R, or R1 .

204

S IMFIT reference manual: Part 3

The orthogonal matrix Q1 -1.0195E-01 -2.5041E-01 7.4980E-02 7.3028E-01 -3.9929E-01 -2.1649E-01 7.2954E-01 -2.9596E-01 -5.3861E-01 2.6380E-01 -3.2945E-01 -1.4892E-01 -3.1008E-01 -3.0018E-01 -1.5220E-01 -4.2044E-01 -2.8205E-01 -5.2829E-01 7.5783E-02 2.7828E-01 -3.0669E-01 -3.1512E-01 -5.5196E-01 3.2818E-02 -5.1992E-01 5.9358E-01 1.4157E-01 3.1879E-01 The upper triangular/trapezoidal matrix R -1.1771E+01 -1.8440E+01 -1.3989E+01 -1.0803E+01 -6.8692E+00 -1.2917E-01 -4.5510E+00 5.8895E+00 -3.4487E-01 8.6062E+00

5.5734E-01 2.7394E-01 1.8269E-01 6.5038E-02 -6.9913E-01 1.4906E-01 -2.5637E-01 -1.5319E+01 -4.2543E+00 -5.7542E-03 -1.1373E+00 2.5191E+00

Table 3.103: Matrix example 4: QR factorization

3.9.12.6 Cholesky factorization of a positive-denite symmetric matrix

Table 3.104 shows how factorization of data in matrix.tf3 into a lower and upper triangular matrix can be

Current positive-definite symmetric matrix 4.1600E+00 -3.1200E+00 5.6000E-01 -1.0000E-01 -3.1200E+00 5.0300E+00 -8.3000E-01 1.0900E+00 5.6000E-01 -8.3000E-01 7.6000E-01 3.4000E-01 -1.0000E-01 1.0900E+00 3.4000E-01 1.1800E+00 Lower triangular R where A = R(RT) 2.0396E+00 -1.5297E+00 1.6401E+00 2.7456E-01 -2.4998E-01 7.8875E-01 -4.9029E-02 6.1886E-01 6.4427E-01 6.1606E-01 Upper triangular RT where A = R(RT) 2.0396E+00 -1.5297E+00 2.7456E-01 -4.9029E-02 1.6401E+00 -2.4998E-01 6.1886E-01 7.8875E-01 6.4427E-01 6.1606E-01 Table 3.104: Matrix example 5: Cholesky factorization

achieved. Note that factorization as in A = RRT will only succeed when the matrix A supplied is symmetric and positive-denite, as when A is a covariance matrix. In all other cases, error messages will be issued.

Numerical analysis

205

3.9.12.7 Matrix multiplication

Given two matrices A and B, it is frequently necessary to form the product, or the product of the transposes, as an m by n matrix C, where m 1 and n 1. The options are C = AB, where A is m k, and B is k n, C = AT B, where A is k m, and B is k n,

C = ABT , where A is m k, and B is n k,

C = AT BT , where A is k m, and B is n k,

as long as k 1 and the dimensions of A and B are appropriate to form the product, as indicated. For instance, using the singular value decomposition routine just described, followed by multiplying the U , , and V T matrices for the simple 4 by 3 matrix indicated shows that 1/ 6 0 1/ 2 1 0 0 1/ 2 0 1/ 2 3 0 0 0 1 0 0 1 0 0 1 0 0 1 0 . 0 0 1 = 1/ 6 0 1/ 2 0 0 1 1/ 2 0 1/ 2 1 0 1 2/ 6 0 0
3.9.12.8 Evaluation of quadratic forms

Table 3.105 illustrates a special type of matrix multiplication that is frequently required, namely Title of matrix A: 4 by 4 positive-definite symmetric matrix Title of vector x: Vector with 4 components 1, 2, 3, 4 (xT)*A*x = 5.5720E+01 Title of matrix A: 4 by 4 positive-definite symmetric matrix Title of vector x: Vector with 4 components 1, 2, 3, 4 (xT)*(A{-1})*x = 2.0635E+01 Table 3.105: Matrix example 6: Evaluation of quadratic forms Q1 = xT Ax Q2 = x T A 1 x for a square n by n matrix A and a vector x of length n. In this case the data analyzed are in the test les matrix.tf3 and vector.tf3. The form Q1 can always be calculated but the form Q2 requires that the matrix A is positive-denite and symmetric, which is the case when A is a covariance matrix and a Mahalanobis distance is required.
3.9.12.9 Solving Ax = b (full rank)

Table 3.106 shows the computed solution for Ax = b x = A 1 b

206

S IMFIT reference manual: Part 3

Solution to Ax = b where the square matrix A is: Matrix of dimension 5 by 5 (i.e. matrix.tf1} and the vector b is: Vector with 5 components 1, 2, 3, 4, 5 (i.e. vector.tf1} rhs vector (b) Solution (x) 1.0000E+00 4.3985E-01 2.0000E+00 -2.1750E-01 3.0000E+00 7.8960E-02 4.0000E+00 -4.2704E-02 5.0000E+00 1.5959E-01 Table 3.106: Solving Ax = b: square where A1 exists where A is the square matrix of table 3.100 and b is the vector 1, 2, 3, 4. When the n by m matrix A is not square or is singular, a m by n pseudo inverse A+ can be dened in terms of the QR factorization or singular value decomposition as A + = R 1 QT 1 if A has full column rank = V U T if A is rank decient, where diagonal elements of are reciprocals of the singular values.
3.9.12.10 Solving Ax = b (L1 , L2 , L norms)

Table 3.107 illustrates the solutions of the overdetermined linear system Ax = b where A is the 7 by 5 matrix L1-norm solution to Ax = b 1.9514E+00 4.2111E-01 -5.6336E-01 4.3038E-02 -6.7286E-01 L1-norm objective function =

4.9252E+00

L2-norm solution to Ax = b 1.2955E+00 7.7603E-01 -3.3657E-01 8.2384E-02 -9.8542E-01 The rank of A (from SVD) = 5 L2-norm objective function = 1.0962E+01 L_infinity norm solution to Ax = b 1.0530E+00 7.4896E-01 -2.7683E-01 2.6139E-01 -9.7905E-01 L_infinity norm objective function =

1.5227E+00

Table 3.107: Solving Ax = b: overdetermined in 1, 2 and norms

Numerical analysis

207

of table 3.101 b is the vector (1, 2, 3, 4, 5, 6, 7, ), i.e. the test les matrix.tf2 and vector.tf2. The solutions illustrated list the parameters that minimize the residual vector r = Ax b corresponding to the three usual vector norms as follows. The 1-norm r 1 This nds a possible solution such that the sum of the absolute values of the residuals is minimized. The solution is achieved by iteration from starting estimates provided, which can be all -1, all 0 (the usual rst choice), all 1, all user-supplied, or chosen randomly from a uniform distribution on [-100,100]. It may be necessary to scale the input data and experiment with starting estimates to locate a global best-t minimum with difcult cases. The 2-norm r 2 This nds the unique least squares solution that minimizes the Euclidean distance, i.e. the sum of squares of residuals. The -norm r This nds the solution that minimizes the largest absolute residual.
3.9.12.11 The symmetric eigenvalue problem

Table 3.108 illustrates the solution for a symmetric eigenvalue problem, that is, nding the eigenvectors and Matrix A: 2.400E-01 3.900E-01 4.200E-01 -1.600E-01 3.900E-01 -1.100E-01 7.900E-01 6.300E-01 4.200E-01 7.900E-01 -2.500E-01 4.800E-01 -1.600E-01 6.300E-01 4.800E-01 -3.000E-02 Matrix B: 4.160E+00 -3.120E+00 5.600E-01 -1.000E-01 -3.120E+00 5.030E+00 -8.300E-01 1.090E+00 5.600E-01 -8.300E-01 7.600E-01 3.400E-01 -1.000E-01 1.090E+00 3.400E-01 1.180E+00 Eigenvalues...Case: Ax = lambda*Bx -2.2254E+00 -4.5476E-01 1.0008E-01 1.1270E+00 Eigenvectors by column...Case Ax = lambda*Bx -6.9006E-02 3.0795E-01 -4.4694E-01 -5.5279E-01 -5.7401E-01 5.3286E-01 -3.7084E-02 -6.7660E-01 -1.5428E+00 -3.4964E-01 5.0477E-02 -9.2759E-01 1.4004E+00 -6.2111E-01 4.7425E-01 2.5095E-01 Table 3.108: The symmetric eigenvalue problem eigenvalues for the system Ax = Bx,

where A and B are symmetric matrices of the same dimensions and, in addition, B is positive denite. In the case of table 3.108, the data for A are contained in test le matrix.tf4, while B is the matrix in matrix.tf3. It should be noted that the alternative problems ABx = x and BAx = x can also be solved and, in each case, the eigenvectors are available as the columns of a matrix X that is normalized so that X T BX = I , for Ax = Bx, and ABx = x, X T B1 X = I , for BAx = x.

208

S IMFIT reference manual: Part 3

3.10 Areas, slopes, lag times and asymptotes


It frequently happens that measurements of a response y as a function of time t are made in order to measure an initial rate, a lag time, an asymptotic steady state rate, a horizontal asymptote or an area under the curve (AUC). Examples could be the initial rate of an enzyme catalyzed reaction or the transport of labelled solute out of loaded erythrocytes. Stated in equations we have the responses yi = f (ti ) + i , i = 1, 2, . . . , n given by a deterministic component plus a random error and it is wished to measure the following limiting values df at t = 0 dt df the asymptotic slope = as t dt the nal asymptote = f as t the initial rate = the AUC =

f (t ) dt .

There are numerous ways to make such estimates in SIMFIT and the method adopted depends critically on the type of experiment. Choosing the wrong technique can lead to biased estimates, so you should be quite clear which is the correct method for your particular requirements.

3.10.1 Models used by program inrate


The models used in this program are f1 = Bt + C f2 = At 2 + Bt + C f3 = [1 exp(t )] + C Vt n f4 = n n + C K +t f5 = Pt + Q[1 exp(Rt )] + C and there are test les to illustrate each of these. It is usual to assume that f (t ) is an increasing function of t with f (0) = 0, which is easily arranged by suitably transforming any initial rate data. For instance, if you have measured efux of an isotope from vesicles you would analyze the rate of appearance in the external solute, that is, express your results as f (t ) = initial counts - counts at time t so that f (t ) increase from zero at time t = 0. All you need to remember is that, for any constant K , d df {K f (t )} = . dt dt However it is sometimes difcult to know exactly when t = 0, e.g., if the experiment involves quenching, so there exists an option to force the best t curve to pass through the origin with some of the models if this is essential. The models available will now be summarized. 1. f1 : This is used when the data are very close to a straight line and it can only measure initial rates. 2. f2 : This adds a quadratic correction and is used when the data suggest only a slight curvature. Like the previous it can only estimate initial rates.

Areas, slopes, lag times and asymptotes

209

3. f3 : This model is used when the data rapidly bend to a horizontal asymptote in an exponential manner. It can be used to estimate initial rates and nal horizontal asymptotes. 4. f4 : This model can be used with n xed (e.g., n = 1) for the Michaelis-Menten equation or with n varied (the Hill equation). It is not used for initial rates but is sometimes better for estimating nal horizontal asymptotes than the previous model. 5. f5 : This is the progress curve equation used in transient enzyme kinetics. It is used when the data have an initial lag phase followed by an asymptotic nal steady state. It is not used to estimate initial rates, nal horizontal asymptotes or AUC. However, it is very useful for experiments with cells or vesicles which require a certain time before attaining a steady state, and where it is wished to estimate both the length of lag phase and the nal steady state rate. To understand these issues, see what happens the test les. These are, models f1 and f2 with inrate.tf1, model f3 with inrate.tf2, model f4 with inrate.tf3 and model f5 using inrate.tf4.

3.10.2 Estimating initial rates using inrate


A useful method to estimate initial rates when the true deterministic equation is unknown is to t quadratic At 2 + Bt + C, in order to avoid the bias that would inevitably result from tting a line to nonlinear data. Use inrate to t the test le inrate.tf1, and note that, when the model has been tted, it also estimates the slope at the origin. The reason for displaying the tangent in this way, as in gure 3.41, is to give you some idea of what is involved in extrapolating the best t curve to the origin, so that you will not accept the estimated initial rate uncritically.

Using INRATE to Determine Initial Rates


40

30

Data Best Fit Quadratic Tangent at x = 0

y
20 10 0

10

12

Figure 3.41: Fitting initial rates

3.10.3 Lag times and steady states using inrate


Use inrate to t Pt + Q[1 exp (Rt )] + C to inrate.tf4 and observe that the asymptotic line is displayed in addition to the tangent at the origin, as in gure 3.42. However, sometimes a burst phase is appropriate, rather than lag phase, as gure 3.43.

210

S IMFIT reference manual: Part 3

Using INRATE to Fit Lag Kinetics


10

Data Best Fit Asymptote Tangent at x = 0

y
4 2 0 0

10

Figure 3.42: Fitting lag times

Using INRATE to Fit Burst Kinetics


14 Data Best Fit Asymptote Tangent 12 10 8

y
6 4 2 -4 -2 0

10

Figure 3.43: Fitting burst kinetics

Areas, slopes, lag times and asymptotes

211

3.10.4 Model-free tting using compare


SIMFIT can t arbitrary models, where the main interest is data smoothing by model-free or nonparametric techniques, rather than tting mathematical models. For instance, polnom ts polynomials while calcurve ts splines for calibration. For now we shall use compare to t the test les compare.tf1 and compare.tf2 as in gure 3.44, where the aim is to compare two data sets by model free curve tting using the automatic

Data Smoothing by Cubic Splines


1.00

0.75

Y-values

0.50

0.25

0.00 0 1 2 3 4 5 6

X-values

Figure 3.44: Model free curve tting spline knot placement technique described on page 214. Table 3.109 then summarizes the differences reported Area under curve 1 ( 2.50E-01 < x < 5.00E+00) (A1) Area under curve 2 ( 3.00E-01 < x < 5.50E+00) (A2) For window number 1: 3.00E-01 < x < 5.00E+00,y_min Area under curve 1 inside window 1 (B1) Area under curve 2 inside window 1 (B2) Integral of |curve1 - curve2| for the x_overlap (AA) For window number 2: 3.00E-01 < x < 5.00E+00,y_min Area under curve 1 inside window 2 (C1) Area under curve 2 inside window 2 (C2) Estimated percentage differences between the curves: Over total range of x values: 100|A1 - A2|/(A1 + A2) In window 1 (with a zero baseline): 100*AA/(B1 + B2) In window 2 (with y_min baseline): 100*AA/(C1 + C2) Table 3.109: Comparing two data sets for the two curves shown in gure 3.44. Program compare reads in one or two data sets, calculates means and standard errors of means from replicates, ts constrained splines, and then compares the two ts. You can change a smoothing factor until the t is acceptable and you can use the spline coefcients for calculations, or store them for re-use by program spline. Using spline coefcients you can plot curve sections, estimate = = = = = = = = = = = = 2.69E+00 2.84E+00 0.00E+00 2.69E+00 2.63E+00 2.62E-01 2.81E-02 2.56E+00 2.50E+00 2.63 % 4.92 % 5.18 %

212

S IMFIT reference manual: Part 3

derivatives and areas, calculate the arc length and total absolute curvature of a curve, or characterize and compare data sets which do not conform to known mathematical models. Comparing raw data sets with proles as in gure 3.44 is complicated by the fact that there may be different numbers of observations, and observations may not have been made at the same x values. Program compare replaces a comparison of two data sets by a comparison of two best-t curves, chosen by data smoothing. Two windows are dened by the data sets as well as a window of overlap, and these would be identical if both data sets had the same x-range. Perhaps the absolute area between the two curves over the range where the data sets overlap AA is the most useful parameter, which may be easier to interpret as a percentage. Note that, where data points or tted curves have negative y values, areas are replaced by areas with respect to a baseline in order to remove ambiguity and makes areas positive over any window within the range set by the data extremes. The program also reports the areas calculated by the trapezoidal method, but the calculations reported in table 3.109 are based on numerical integration of the best-t spline curves.

3.10.5 Estimating averages and AUC using deterministic equations


Observations yi are often made at settings of a variable xi as for a regression, but where the main aim is to determine the area under a best t theoretical curve AUC rather than any best t parameters. Frequently also yi > 0, which is the case we now consider, so that there can be no ambiguity concerning the denition of the area under the curve. One example would be to determine the average value faverage of a function f (x) for x dened as 1 f (u) du. faverage = Another example is motivated by the practise of tting an exponential curve in order to determine an elimination constant k by extrapolation, since
0

1 exp(kt ) dt = . k

Yet again, given any arbitrary function g(x), where g(x) 0 for x , a probability density function fT can always be constructed for a random variable T using fT (t ) = g(t )

g(u) du

which can then be used to model residence times, etc. If the data do have a known form, then tting an appropriate equation is probably the best way to estimate slopes and areas. For instance, in pharmacokinetics you can use program ext to t sums of exponentials and also estimate areas over the data range and AUC by extrapolation from zero to innity since
n

0 i=1

Ai exp(kit ) dt = ki
i=1

Ai

which is calculated as a derived parameter with associated standard error and condence limits. Other deterministic equations can be tted using program qnt since, after this program has tted the requested equation from the library or your own user-supplied model, you have the option to estimate slopes and areas using the current best-t curve.

3.10.6 Estimating AUC using average


The main objection to using a deterministic equation to estimate the AUC stems from the fact that, if a badly tting model is tted, biased estimates for the areas will result. For this reason, it is frequently better to consider the observations yi , or the average value of the observations if there are replicates, as knots with coordinates xi , yi dening a linear piecewise spline function. This can then be used to calculate the area for any sub range a, b where A a b B.

Areas, slopes, lag times and asymptotes

213

Trapezoidal Area Estimation


10

y-values

Threshold
0 0 5 10 15

x-values

Figure 3.45: Trapezoidal method for areas/thresholds

To practise, read average.tf1 into program average and create a plot like gure 3.45. Another use for the trapezoidal technique is to calculate areas above or below a baseline, or fractions of the x range above and below a threshold, for example, to record the fraction of a certain time interval that a patients blood pressure was above a baseline value. Note that, in gure 3.45, the base line was set at y = 3.5, and program average calculates the points of intersection of the horizontal threshold with the linear spline in order to work out fractions of the x range above and below the baseline threshold. For further versatility, you can select the end points of interest, but of course it is not possible to extrapolate beyond the data range to estimate AUC from zero to innity.

214

S IMFIT reference manual: Part 3

3.11 Spline smoothing


It often happens that a mathematical model is not available for a given data set because of one of these reasons. t The data are too sparse or noisy to justify deterministic model tting. t The fundamental processes involved in generating the data set are unknown. t The mathematical description of the data is too complex to warrant model tting. t The error structure of the data is unknown, so maximum likelihood cannot be invoked. t The users merely want a smooth representation of the data to display trends, calculate derivatives, or areas, or to use as a standard curve to predict x given y. The traditional method to model such situations was to t a polynomial f (x) = a0 + a1x + a2x2 + + anxn , by weighted least squares, where the degree n was adjusted to obtain optimum t. However, such a procedure is seldom adopted nowadays because of the realization that polynomials are too exible, allowing over-t, and because polynomials cannot t the horizontal asymptotes that are often encountered in experimental data, e.g. growth curves, or dose response curves. Because of these restrictions, polynomials are only used in situations where data sets are monotonic and without horizontal asymptotes, and only local modelling is anticipated, with no expectation of meaningful extrapolation beyond the limits of the data. Where such model free tting is required, then simple deterministic models, such as the exponential or logistic models, can often be useful. However, here the problem of systematic bias can be encountered, where the xed curve shape of the simple model can lead to meaningless extrapolation or prediction. To circumvent this problem, piecewise cubic splines can be used, where a certain number of knot positions are prescribed and, between the knot positions, cubic polynomials are tted, one between each knot, with the desirable property of identical function and derivative value at the knots. Here again it is necessary to impose additional constraints on the splines and knot placements, otherwise under or over tting can easily result, particularly when the splines attempt to t outliers leading to undulating best t curves. SIMFIT allows users to t one of three types of spline curve. 1. Splines with user-dened xed knots. 2. Splines with automatically calculated knots. 3. Splines chosen using cross validation. Given n data values x, y, s and m knots, then each type of spline curve tting technique minimizes an objective function involving the weighted sum of squares W SSQ given by W SSQ =
n

i=1

yi f (xi ) s

where f (t ) is the spline curve dened piecewise between the m knots, but each type of spline curve has advantages and limitations, which will be discussed after dealing with the subject of replicates. All x, y values must be supplied, and the s values should either be all equal to 1 for unweighted tting, or equal to the standard deviation of y otherwise. It frequently happens that data sets contain replicates and, to avoid confusion, SIMFIT automatically compresses data sets with replicates before tting the splines, but then reports residuals and other goodness of t criteria in terms of the full data set. If there are groups of replicates, then the sample standard deviations

Spline smoothing

215

within groups of replicates are calculated interactively for weighting, and the s values supplied are used for single observations. Suppose that there are N distinct x values x j and at each of these there are k j replicates, where all of the replicates have the same s value s j at x = x j for weighting. Then we would have n=

j =1

kj
k j +l 1

y j = (1/k j ) W SSQ =

i=l

yi
2

j =1

y j f (x j ) s j/ k j

so, whether users input all n replicates with s = 1 or the standard deviation of y, or just N mean values with s j equal to the standard errors of the means y j , the same spline will result. However, incorrect goodness of t statistics, such as the runs and signs tests or half normal residuals plots, will result if means are supplied instead of all replicates.

3.11.1 Fixed knots


Here the user must specify the number of interior knots and their spacing in such a way that genuine dips, spikes or asymptotes in the data can be modelled by clustering knots appropriately. Four knots are added automatically to correspond to the smallest x value, and four more are also added to equal the largest x value. If the data are monotonic and have no such spike features, then equal spacing can be resorted to, so users only need to specify the actual number of interior knots. The programs calcurve and csat offer users both of these techniques, as knot values can be provided after the termination of the data values in the data le, while program spline provides the best interface for interactive spline tting. Fixed knot splines have the advantage that the effect of the number of knots on the best t curve is fully intuitive; too few knots lead to under-t, while too many knots cause over-t. Figure 3.46 illustrates the effect of changing the number of equally
1 1

One Interior Knot

Four Interior Knots

0 0 1 2 3 4 5

Y
0 0 1 2 3 4 5

Figure 3.46: Splines: equally spaced interior knots spaced knots when tting the data in compare.tf1 by this technique. The vertical bars at the knot positions were generated by replacing the default symbols (dots) by narrow (size 0.05) solid bar-chart type bars. It is clear that the the t with one interior knot is quite sufcient to account for the shape of the data, while using four gives a better t at the expense of excessive undulation. To overcome this limitation of xed knots SIMFIT provides the facility to provide knots that can be placed in specied patterns and, to illustrate this, gure 3.47 illustrates several aspects of the t to e02baf.tf1. The left hand gure shows the result when spline knots were input from the spline le e02baf.tf2, while the right hand gure shows how program

216

S IMFIT reference manual: Part 3

10

10.0

User Defined Interior Knots


8 7.5 6

Knot 1 Knot 2

Calculating X Given Y
Y
5.0

Y
4

2.5 2

0 0 2 4 6 8 10 12

0.0 0.00

1.00

2.00

3.00

Figure 3.47: Splines: user spaced interior knots


spline can be used to predict X given values of Y . Users simply specify a range of X within the range set by the data, and a value of Y , whereupon the intersection of the dashed horizontal line at the the specied value of Y is calculated numerically, and projected down to the X value predicted by the vertical dashed line. Note that, after tting e02baf.tf1 using knots dened in e02baf.tf2, the best t spline curve was saved to the le spline.tf1 which can then always be input again into program spline to use as a deterministic equation between the limits set by the data in e02baf.tf1.

3.11.2 Automatic knots


Here the knots are generated automatically and the spline is calculated to minimize =
m5 i=5

2 i,

where i is the discontinuity jump in the third derivative of the spline at the interior knot i, subject to the constraint W SSQ F 0 where F is user-specied. If F is too large there will be under-t and best t curve will be unsatisfactory, but if F is too small there will be over-t. For example, setting F = 0 will lead to an interpolating spline passing through every point, while choosing a large F value will produce a best-t cubic polynomial with = 0 and no internal knots. In weighted least squares tting W SSQ will often be approximately a chi-square variable with degrees of freedom equal to the number of experimental points minus the number of parameters tted, so choosing a value for F n will often be a good place to start. The programs compare and spline provide extensive options for tting splines of this type. Figure 3.48, for example, illustrates the effect of tting e02bef.tf1 using smoothing factors of 1.0, 0.5, and 0.1.

3.11.3 Cross validation


Here there is one knot for each distinct x value and the spline f (x) is calculated as that which minimizes W SSQ +

( f (x))2 dx.

As with the automatically generated knots, a large value of the smoothing parameter gives under-t while = 0 generates an interpolating spline, so assigning controls the overall t and smoothness. As splines are linear in parameters then a matrix H can be found such that y = Hy

Spline smoothing

217

F = 1.0

F = 0.5

F = 0.1

Y
2 0 0 -2 -2 0 2 4 6 8 0 2 4 6 8

-2 0 2 4 6 8

Figure 3.48: Splines: automatically spaced interior knots and the degrees of freedom can be dened in terms of the leverages hii in the usual way as = Trace(I H ) = (1 hii).
i=1 N

This leads to three ways to specify the spline coefcients by nding . 1. The degrees of freedom can be specied as = 0 , and can be estimated such that TraceH = 0 . 2. The cross validation CV can be minimized by varying , where the ri are residuals, and CV = 1 N ri N i 1 hii =1
2

3. The generalized cross validation GCV can be minimized by varying , where GCV = N
2 N i=1 ri 2 (N i=1 (1 hii ))

3.11.4 Using splines


Splines are dened by knots and coefcients rather than equations, so special techniques are required to reuse best t functions. Input a spline le such as spline.tf1 into program spline to appreciate how to re-use a best t spline stored from spline, calcurve, or compare, to estimate derivatives, areas, curvatures and arc lengths. SIMFIT spline les of length k 12, such as spline.tf1, have (k + 4)/2 knots, then (k 4)/2 coefcients as follows. There must be at least 8 nondecreasing knots The rst 4 of these knots must all be equal to the lowest x value The next (k 12)/2 must be the non-decreasing interior knots The next 4 of these knots must all be equal to the highest x value

218

S IMFIT reference manual: Part 3

Then there must be (k 4)/2 spline coefcients ci With n spline intervals (i.e. one greater than the number of interior knots), 1 , 2 , 3 , 4 are knots corresponding to the lowest x value, 5 , 6 , . . . , n +3 are interior knots, while n +4 , n +7 correspond to the +5 , n +6 , n largest x value. Then the best-t spline f (x) is
n +3

f (x) =

i=1

ci Ni (x).

where the ci are the spline coefcients, and the Ni (x) are normalized B-splines of degree 3 dened on the knots i , i+1 , . . . , i+4 . When the knots and coefcients are dened in this way, the function y = f (x) can be used as a model-free best t curve to obtain point estimates for the derivatives y , y , y , as well as the area A, arc length L, or total absolute curvature K over a range x , dened as A= L= K=
L 0

y dx 1 + y2 dx |y |
3

(1 + y2) 2 |y | = dx 2 1+y

dl

which are valuable parameters to use when comparing data sets. For instance, the arc length s provides a valuable measure of the length of the tted curve, while the total absolute curvature indicates the total angle turned by the tangent to the curve and indicates the amount of oscillatory behaviour. Table 3.110 presents From spline fit with From spline fit with From spline fit with 1 automatic knots, WSSQ = 5 automatic knots, WSSQ = 8 automatic knots, WSSQ = 1.000E+00 5.001E-01 1.000E-01

Spline knots and coefficients from fitting the file: C:\simfit5\temp\e02bef.tf1 X-value 2.000E+00 4.000E+00 6.000E+00 spline 2.125E+00 4.474E+00 5.224E+00 1st.deriv. 1.462E+00 5.562E-01 1.058E+00 Area 3.092E+01 1.687E+01 8.970E+00 2nd.deriv. 2.896E+00 1.905E+00 2.932E-01 3rd.deriv. 1.243E+01 5.211E+00 -1.912E+00 (In degrees) 2.618E+02 2.129E+02 1.327E+02

A B 0.000E+00 8.000E+00 2.000E+00 6.000E+00 3.000E+00 5.000E+00

s=Arc-length Integral|K|ds 1.280E+01 4.569E+00 5.574E+00 3.715E+00 2.235E+00 2.316E+00

Table 3.110: Spline calculations typical results from the tting illustrated in gure 3.48.

Part 4

Graph plotting techniques


4.1 Graphical objects
4.1.1 Symbols

Plotting Symbols

Size and Line Thickness

Bar Fill Styles

Bar Width and Thickness

Figure 4.1: Symbols, ll styles, sizes and widths.

219

220

S IMFIT reference manual: Part 4

Figure 4.1 shows how individual sizes and line thicknesses of plotting symbols can be varied independently. Also, bars can be used as plotting symbols, with the convention that the bars extend from a baseline up to the x, y coordinate of the current point. Such bars can have widths, ll-styles and line thicknesses varied.

4.1.2 Lines: standard types


There are four standard SIMFIT line types, normal, dashed, dotted and dot-dashed, and error bars can terminate with or without end caps if required, as shown in gure 4.2. Special effects can be created using stair step lines, which can be used to plot cdfs for statistical distributions, or survival curves from survivor functions, and vector type lines, which can be used to plot orbits of differential equations. Note that steps can be rst y then x, or rst x then y, while vector arrows can point in the direction of increasing or decreasing t , and lines can have variable thickness.

Line Types

Error Bars

Line Thickness

Steps and Vectors

Figure 4.2: Lines: standard types Program simplot reads in default options for the sequence of line types, symbol types, colors, barchart styles, piechart styles and labels which will then correspond to the sequence of data les. Changes can be made interactively and stored as graphics conguration templates if required. However, to make permanent changes to the defaults, you congure the defaults from the main SIMFIT conguration option, or from program simplot.

Graphical objects

221

4.1.3 Lines: extending to boundaries


Figure 4.3 illustrates the alternative techniques available in SIMFIT when the data to be plotted are clipped to boundaries so as to eliminate points that are identied by symbols and also joined by lines.
20

10

-10

-20

20

40

60

80

100

-5

20

40

60

80

100

-5

20

40

60

80

100

Figure 4.3: Lines: extending to boundaries The rst graph shows what happens when the test le zigzag.tf1 was plotted with dots for symbols and lines connecting the points, but with all the data within the boundaries. The second graph illustrates how the lines can be extended to the clipping boundary to indicate the direction in which the next undisplayed symbol is located, while the third gure shows what happens when the facility to extend lines to boundaries is suppressed. Note that these plots were rst generated as .ps les using the at-shape plotting option, then a PostScript x stretch factor of 2 (page 248) was selected, followed by the use of GSview to transform to .eps and so recalculate the BoundingBox parameters.

222

S IMFIT reference manual: Part 4

4.1.4 Text
. Figure 4.4 shows how fonts can be used in any size or rotation and with many nonstandard accents, e.g.,

Fonts
Times-Roman Times-Italic Times-Bold Times-BoldItalic Helvetica Helvetica-Oblique Helvetica-Bold Helvetica-BoldOblique Courier Courier-Oblique Courier-Bold Courier-BoldOblique Symbol

Size and Rotation Angle


1. 6, an gl e= 45

110

size = 1.4, angle = 0


e= siz 1. 2, g an = le 5 -4
size = 1, angle = -90

^ = (1/n )X(i) =X T = 21C [Ca++] = 1.210-9M /t = 2 () = t-1e-tdt 1x + 2x2 1 + 1x + 2x2

Maths and Accents

IsoLatin1Encoding Vector
220-227: 230-237: 240-247: 250-257: 260-267: 270-277: 300-307: 310-317: 320-327: 330-337: 340-347: 350-357: 360-367: 370-377: 0 1 2 3 ` 4 5 6 - 7

Figure 4.4: Text, maths and accents. Special effects can be created using graphics fonts such as ZapfDingbats, or user-supplied dedicated special effect functions, as described elsewhere (page 338). Scientic symbols and simple mathematical equations can be generated, but the best way to get complicated equations, chemical formulas, photographs or other bitmaps into SIMFIT graphs is to use PSfrag or editps. Figure 4.4 demonstrates several possibilities for displaying mathematical formulae directly in SIMFIT graphs, and it also lists the octal codes for some commonly required characters from the IsoLatin1 encoding. Actually, octal codes can be typed in directly (e.g., \361 instead of n ), but note that text strings in SIMFIT plots can be edited at two levels: at the simple level only standard characters can be typed in, but at the advanced level nonstandard symbols and maths characters can be selected from a font table. Note that, while accents can be added individually to any standard character, they will not be placed so accurately as when using the corresponding hard-wired characters e.g., from the IsoLatin1 encoding.

size = 1.8, angle = 90

siz

e=

= size

2, a

= ngle

Graphical objects

223

4.1.5 Fonts, character sizes and line thicknesses


The fonts, letter sizes, and line thicknesses used in SIMFIT graphics are those chosen from the PostScript menu, so, whenever a font or line thickness is changed, the new details are written to the PostScript conguration le w_ps.cfg. If the size or thickness selected is not too extreme, it will then be stored as the default to be used next time. However, it should be noted that, when the default sizes are changed, the titles, legends, labels, etc. may not be positioned correctly. You can, of course, always make a title, legend, or label t correctly by moving it about, but, if this is necessary, you may nd that the defaults are restored next time you use SIMFIT graphics. If you insist on using an extremely small or extremely large font size or line thickness and SIMFIT keeps restoring the defaults, then you can overcome this by editing the PostScript conguration le w_ps.cfg and making it read-only. Users who know PostScript will prefer to use the advanced PostScript option, whereby the users own header le can be automatically added to the PostScript le after the SIMFIT dictionary has been dened, in order to re-dene the fonts, line thicknesses or introduce new denitions, logos plotting symbols, etc.

4.1.6 Arrows
Figure 4.5 shows that arrows can be of three types: line, hollow or solid and these can be of any size. However

Arrow Types
Line Arrow

Arrows and Boxes

Outline Arrow

Transparent Box

Solid Arrow

Opaque Box

K=1

Figure 4.5: Arrows and boxes use can be made of headless arrows to create special effects. From this point of view a headless line arrow is simply a line which can be solid, dashed, dotted or dash-dotted. These are useful for adding arbitrary lines. A headless outline arrow is essentially a box which can be of two types: transparent or opaque. Note that the order of priority in plotting is Extra Text > Graphical Objects > Data plotted, titles and legends and this allows boxes to be used to simply obliterate plotted data or to surround extra text allowing the background to show through. Transparent boxes, are useful for surrounding information panels, opaque boxes are required for chemical formulae or mathematical equations, while background colored solid boxes can be used to blank out features as shown in gure 4.5. To surround a text string by a rectangular box for emphasis, position the string, generate a transparent rectangular box, then drag the opposing corners to the required coordinates.

224

S IMFIT reference manual: Part 4

4.1.7 Example of plotting without data: Venn diagram


It is possible to use program simplot as a generalized diagram drawing program without any data points, as illustrated in gure 4.6.

Venn Diagram for the Addition Rule A AB

AB

BA B

P{AB} = P{A} + P{B} - P{AB}

Figure 4.6: Venn diagrams The procedure used to create such graphs using any of the SIMFIT graphical objects will be clear from the details now given for this particular Venn diagram. r Program simplot was opened using an arbitrary dummy graphics coordinate le. r The [Data] option was selected and it was made sure that the dummy data would not be plotted as lines or symbols, by suppressing the lines and symbols for this dummy data set. r This transformed program simplot into an arbitrary diagram creation mode and the display became completely blank, with no title, legends, or axes. r The circles were chosen (as objects) to be outline circle symbols, the box was selected (as an arrowline-box) to be a horizontal transparent box, the text strings were composed (as text objects), and nally the arrow was chosen (as an arrow-line-box) to be a solid script arrow. r The diagram was then completed by editing the text strings (in the expert mode) to introduce the mathematical symbols.

Graphical objects

225

4.1.8 Polygons
Program simplot allows lled polygons as an optional linetype. So this means that any set of n coordinates (xi , yi ) can be joined up sequentially to form a polygon, which can be empty if a normal line is selected, or lled with a chosen color if the lled polygon option is selected. If the last (xn , yn ) coordinate pair is not the same as the rst (x1 , y1 ), the polygon will be closed automatically. This technique allows the creation of arbitrary plotting objects of any shape, as will be evident from the sawtooth plot and stars in gure 4.7.

Plotting Polygons
15

10

y
5 0 0

10

15

20

x
Figure 4.7: Polygons The sawtooth graph above was generated from a set of (x, y) points in the usual way, by suppressing the plotting symbol but then requesting a lled polygon linetype, colored light gray. The open star was generated from coordinates that formed a closed set, but then suppressing the plotting symbol and requesting a normal, i.e. solid linetype. The lled star was created from a similar set, but selecting a lled polygon linetype, colored black. If you create a set of ASCII text plotting coordinates les containing arbitrary polygons, such as logos or special plotting symbols, these can be added to any graph. However, since the les will simply be sets of coordinates, the position and aspect ratio of the resulting objects plotted on your graph will be determined by the ranges you have chosen for the x and y axes, and the aspect ratio chosen for the plot. Clearly, objects created in this way cannot be dragged and dropped or re-scaled interactively. The general rule is that the axes, title, plot legends, and displayed data exist in a space that is determined by the range of data selected for the coordinate axes. However, extra text, symbols, arrows, information panels, etc. occupy a xed space that does not depend on the magnitude of data plotted. So, selecting an interactive data transformation will alter the position of data dependent structures, but will not move any extra text, lines, or symbols.

226

S IMFIT reference manual: Part 4

4.2 Sizes and shapes


4.2.1 Alternative axes and labels
It is useful to move axes to make plots more meaningful, and it is sometimes necessary to hide labels, as with the plot of y = x3 in gure 4.8, where the second x and third y label are suppressed. The gure also illustrates moving an axis in barcharts with bars above and below a baseline.
1.00

y=x

0.50

y
-1.00

0.00

0.50

1.00

-0.50

-1.00

5.00

Value Recorded

2.50

0.00

-2.50

-5.00 Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

Time in Days

Figure 4.8: Axes and labels

4.2.2 Transformed data


Data should not be transformed before analysis as this distorts the error structure, but there is much to be said for viewing data with error bars and best t curves in alternative spaces, and program simplot transforms automatically as in gure 4.9.

Sizes and shapes

227

Original x,y Coordinates


2.00 6.00

Dixon Plot

4.00 1.00

1/y
2.00 10.0 20.0 30.0 40.0 50.0 0.00 0.0

0.00 0.0

10.0

20.0

30.0

40.0

50.0

Single Reciprocal Plot


2.00 2.00

Eadie-Hofstee Plot

1.00

1.00

0.00 0.00

y
1.00 2.00 3.00 4.00 5.00

0.00 0.00

0.20

0.40

0.60

0.80

1.00

1/x

y/x

Hanes Plot
30.0 2.00

x-Semilog Plot

20.0

x/y

1.00

10.0

0.0 0.0

10.0

20.0

30.0

40.0

50.0

0.00 -1 10

10

102

log x

Hill Plot
10 10

Log-Log Plot

log[y/(A-y)]

log y
10-1

A = 1.8

10-2 -1 10

10

102

10-1 -1 10

10

102

log x

log x

Lineweaver-Burk Plot
6.00 1.00

Scatchard Plot

1:1 fit (extraploated)


4.00

0.80

2:2 fit

0.60

1/y

y/x

1:1 fit (extraploated)


0.40

2.00 0.20

2:2 fit

-1.00

0.00

1.00

2.00

3.00

4.00

5.00

0.00 0.00

1.00

2.00

1/x

Figure 4.9: Plotting transformed data

228

S IMFIT reference manual: Part 4

4.2.3 Alternative sizes, shapes and clipping


Plots can have horizontal, square or vertical format as in gure 4.10, and user-dened clipping schemes can be used. After clipping, SIMFIT adds a standard BoundingBox so all plots with the same clipping scheme will have the same absolute size but, when GSview/Ghostscript transforms ps into eps, it clips individual les to the boundary of white space and the desirable property of equal dimensions will be lost.
1.00
1.00

1.00

0.50

Horizontal Format x2 + y2 = 1

0.50

Square Format x2 + y2 = 1

0.50

Vertical Format x2 + y2 = 1

0.00

0.00

0.00

-0.50

-0.50

-0.50

-1.00 -1.00

-0.50

0.00

0.50

1.00

-1.00 -1.00

-0.50

0.00

0.50

1.00

-1.00 -1.00

-0.50

0.00

0.50

1.00

Figure 4.10: Sizes, shapes and clipping.

4.2.4 Rotated and re-scaled graphs


PostScript les can be read into editps which has options for re-sizing, re-scaling, editing, rotating, making collages, etc. In gure 4.11 the box and whisker plot was turned on its side to generate a side-on barchart. To do this sort of thing you should learn how to browse a SIMFIT PostScript le in the SIMFIT viewer to read BoundingBox coordinates, in PostScript units of 72 to one inch, and calculate how much to translate, scale, rotate, etc. PostScript users should be warned that the special structure of SIMFIT PostScript les that allows
100%

Percentage Improvement In Overall Output


0%
January

Percentage Improvement In Overall Output

100%

0%

Figure 4.11: Rotating and re-scaling extensive retrospective editing using editps, or more easily if you know how using a simple text editor like notepad, is lost if you read such graphs into a graphics editor program like Adobe Illustrator. Such programs start off by redrawing vector graphics les into their own conventions which are only machine readable.

February March April May


January February March April May

Sizes and shapes

229

4.2.5 Changed aspect ratios and shear transformations


The barchart in gure 4.12 below was scaled to make the X-axis longer than the Y-axis and vice-versa, but note how this type of differential scaling changes the aspect ratio as illustrated. Since rotation and scaling do not commute, the effect created depends on the order of concatenation of the transformation matrices. For instance, scaling then rotation cause shearing which can be used to generate 3-dimensional perspective effects as in the last sub-gure.
Bar Chart Overlaps, Groups and Stacks
6.00

4.00

Values
2.00 0.00

St

Bar Chart Overlaps, Groups and Stacks


Values
6.00

O ve rla p

G ro up
6.00 4.00

Bar Chart Overlaps, Groups and Stacks

ac k

4.00

Values

2.00

2.00

0.00
0.00

Sta

erl Ov

Gr ou

ck

St

ta

ck

a n

ro

s,

er

O v

a rt

ha

rt O

ve

rl a

ps

Stack

la

,G

ro

up

an

St

ac

ks

O ve a rl p

G u ro p

ac k

ap

.0

Group

6.

00

ar

Stack

.0

s lue Va

a V

es lu

4.

Group

00

p Overla
0

.0

2.

00

Overlap

.0

Figure 4.12: Aspect ratios and shearing effects

0.

00

230

S IMFIT reference manual: Part 4

4.2.6 Reduced or enlarged graphs


Response Against Time
5.00 2.50

5.00 2.50

Response

Response Against Time

0.00 -2.50 -5.00 Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

5.00 2.50

Response

Response Against Time

0.00 -2.50 -5.00 Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

Response

0.00 -2.50 -5.00


5.00 2.50

Response Against Time Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9
Response

Response Against Time


0.00 -2.50
5.00 2.50

Response

-5.00 Day 0 Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

0.00 -2.50 -5.00

Day 9 Day 8 Day 7 Day 6 Day 5 Day 4 Day 3 Day 2 Day 1 Day 0

Figure 4.13: Resizing fonts It is always valuable to be able to edit a graph retrospectively, to change line or symbol types, eliminate unwanted data, suppress error bars, change the title, and so on. SIMFIT PostScript les are designed for just this sort of thing, and a typical example would be altering line widths and font sizes as a gure is re-sized. In gure 4.13 the upper sub-gures are derived from the large gure by reduction, so the text becomes progressively more difcult to read as the gures scale down. In the lower sub-gures, however, line thicknesses and font sizes have been increased as the gure is reduced, maintaining legibility. Such editing can be done interactively, but SIMFIT PostScript les are designed to make such retrospective editing easy as described in the w_readme.* les and now summarized. Line thickness: Changing 11.00 setlinewidth to 22 setlinewidth doubles, while, e.g. 5.5 setlinewidth halves all line thicknesses, etc. Relative thicknesses are set by simplot. Fonts: Times-Roman, Times-Bold, Helvetica, Helvetica-Bold (set by simplot), or, in fact, any of the fonts installed on your printer. Texts: ti(title), xl(x legend), yl(y legend), tc(centered for x axis numbers), tl(left to right), tr(right to left), td(rotated down), ty(centered for y axis numbers). Lines: pl(polyline), li(line), da(dashed line), do(dotted line), dd(dashed dotted). Symbols: ce(i.e. circle-empty), ch(circle-half- lled), cf(circle-lled), and similarly for triangles(te, th, tf), squares(se, sh, sf) and diamonds(de, dh, df). Coordinates and sizes are next to the abbreviations to move, enlarge, etc. If les do not print after editing you have probably added text to a string without padding out the key. Find the fault using the GSview/Ghostscript package then try again.

Sizes and shapes

231

4.2.7 Split axes


Sometimes split axes can show data in a more illuminating manner as in gure 4.14. The options are to delete the zero time point and use a log scale to compress the sparse asymptotic section, or to cut out the uninformative part of the best t curve between 10 and 30 days.
1.20

1.00

Fraction of Final Size

0.80

0.60

0.40

0.20

0.00

10

20

30

Time (Days)

1.20

1.00

Fraction of Final Size

0.80

0.60

0.40

0.20

0.00

4 6 Time (Days)

10

30

Figure 4.14: Split axes Windows users can do such things with enhanced metales (*.emf), but there is a particularly powerful way for PostScript users to split SIMFIT graphs in this way. When the SIMFIT PostScript le is being created there is a menu selectable shape option that allows users to chop out, re-scale, and clip arbitrary pieces of graphs, but in such a way that the absolute position, aspect ratio, and size of text strings does not change. In this way a master graph can be decomposed into any number of appropriately scaled slave sub-graphs. Then editps can be used to compose a graph consisting of the sub-graphs rearranged, repositioned, and resized in any conguration. Figure 4.14 was created in this way after rst adding the extra lines shown at the splitting point.

232

S IMFIT reference manual: Part 4

4.3 Equations
4.3.1 Maths
You can add equations to graphs directly, but this will be a compromise, as specialized type setting techniques A are required to display maths correctly. The L TEX system is pre-eminent in the eld of maths type-setting and the PSfrag system, as revised by David Carlisle and others, provides a simple way to add equations to SIMFIT graphs. For gure 4.15, makdat generated a Normal cdf with = 0 and = 1, then simplot created cdf.eps with the key phi(x), which was then used by this stand-alone code to generate the gure, where the A equation substitutes for the key. L TEX PostScript users should be aware that SIMFIT PostScript le format has been specially designed to be consistent with the PSfrag package but, if you want to then use GhostScript to create graphics le, say .png from .eps, the next section should be consulted. \documentclass[dvips,12pt]{article} \usepackage{graphicx} \usepackage{psfrag} \pagestyle{empty} \begin{document} \large \psfrag{phi(x)}{$\displaystyle \frac{1}{\sigma \sqrt{2\pi}} \int_{-\infty}x \exp\left\{ -\frac{1}{2} \left( \frac{t-\mu}{\sigma} \right)2 \right\}\,dt$} \mbox{\includegraphics[width=6.0in]{cdf.eps}} \end{document}

The Cumulative Normal Distribution Function

1.00

0.50
1 p 2

(
exp

1 t 2

)
dt

-2.00

-1.00

0.00

1.00

2.00

3.00

x
Figure 4.15: Plotting mathematical equations

Equations

233

4.3.2 Chemical Formul


A L TEX code, as below, is intended for document preparation and adds white space to the nal .ps le. The easiest way round this complication is to add an outline box to the plot, as in gure 4.16. Then, after the .png le has been created, it can be input into, e.g., GIMP, for auto clipping to remove extraneous white space, followed by deletion of the outline box if required.

\documentclass[dvips,12pt]{article} \usepackage{graphicx} \usepackage{psfrag} \usepackage{carom} \pagestyle{empty} \begin{document} \psfrag{formula} {\begin{picture}(3000,600)(0,0) \thicklines \put(0,0){\bzdrv{1==CH$_{2}$NH$_{2}$;4==CH$_{2}$N(Me)$_{2}$}} \put(700,450){\vector(1,0){400}} \put(820,550){[O]} \put(1000,0){\bzdrv{1==CH=0;4==CH$_{2}$N(Me)$_{2}$}} \put(1650,400){+} \put(1750,400){NH$_{3}$} \put(2000,450){\vector(1,0){400}} \put(2120,550){[O]} \put(2300,0){\bzdrv{1==C0$_{2}$H;4==CH$_{2}$N(Me)$_{2}$}} \end{picture}} \mbox{\includegraphics{chemistry.eps}} \end{document}

Oxidation of p-Dimethylaminomethylbenzylamine
2

CH2 NH2
"b " " b "

CH=0 [O]
"b " b " "

C02 H [O]
"b " b " " b b b " b"

+NH3
b b b " b"

Concentration (mM)

b b b " b"

CH2 N(Me)2
1

CH2 N(Me)2

CH2 N(Me)2

0 0 1 2 3 4 5

Time (min)

Figure 4.16: Plotting chemical structures

234

S IMFIT reference manual: Part 4

4.3.3 Composite graphs


The technique used to combine sub-graphs into a composite graph is easy. First use your drawing or painting program to save the gures of interest in the form of eps les. Then the SIMFIT graphs and any component eps les are read into editps to move them and scale them until the desired effect is achieved. In gure 4.17, data were generated using deqsol, error was added using adderr, the simulated experimental data were tted using deqsol, the plot was made using simplot, the chemical formulae and mathematical equations were A generated using L TEX and the nal graph was composed using editps.

A kinetic study of the oxidation of p-Dimethylaminomethylbenzylamine

"bb " " " b b b b""


d dt

CH2 NH2 [O]

CH=0

"bb " " - " b b b b""


k
1

[O] + NH3

"bb " " - " b b b b""

C02 H

CH2 N(Me)2

CH2 N(Me)2 0 k+2 ) k k


2

CH2N(Me)2

0x1 0 @ y A=@
z

k+1 k+1 0

k+2

10 x 1 0 x 1 0 1 1 0 A @ A @ y y 0 A=@ 0 A 2
;

z0

1.20

x(t) y(t)

1.00

z(t)

x(t), y(t), z(t)

0.80

0.60

0.40

0.20

0.00 0.00

1.00

2.00

3.00

4.00

5.00

t (min)
Figure 4.17: Chemical formulas

Pie charts and bar charts

235

4.4 Bar charts and pie charts


4.4.1 Perspective effects
Perspective can be useful in presentation graphics but it must be realized that, when pie chart segments are moved centrifugally, spaces adjacent to large segments open more than those adjacent to small sections. This creates an impression of asymmetry to the casual observer, but it is geometrically correct. Again, diminished curvature compared to the perimeter as segments move out becomes more noticeable where adjacent segments have greatly differing sizes so, to minimize this effect, displacements can be adjusted individually. A PostScript special (page 353) has been used to display the logo in gure 4.18.

SIMFIT SIMFIT SIMFIT SIMFIT SIMFIT SIMFIT


Style 4

Pie Chart Fill Styles


Style 3 Style 2 Pie key 1 Pie key 2 Pie key 3

Style 5

Style 1

Pie key 4 Pie key 5 Pie key 6

Style 6

Style 10

Pie key 7 Pie key 8

Style 7 Style 8

Style 9

Pie key 9 Pie key 10

SIMFIT SIMFIT SIMFIT SIMFIT SIMFIT SIMFIT Perspective Effects In Bar Charts
Ranges, Quartiles, Medians
7.50

5.00

2.50

0.00

-2.50
a nu Ja ry Fe a bru ry rch Ma ril Ap y Ma

Figure 4.18: Perspective in barcharts, box and whisker plots and piecharts

236

S IMFIT reference manual: Part 4

4.4.2 Advanced barcharts


SIMFIT can plot barcharts directly from data matrices, using the exhaustive analysis of a matrix procedure in simstat, but there is also an advanced barchart le format which gives users complete control over every individual bar, etc. as now summarized and illustrated in gure 4.19. Individual bars can have arbitrary position, width, ll style and color. Bars can be overlapped, grouped, formed into hanging groups or stacked vertically. Error bars can be capped or straight, and above or below symbols. Extra features such as curves, arrows, panel identiers or extra text can be added.

Bar Chart Features


55% Stack

0%

Box/Whisker

O r ve la
-35%

Figure 4.19: Advanced bar chart features Of course the price that must be paid for such versatility is that the le format is rather complicated and the best way to understand it is to consult the w_readme les for details, browse the test les barchart.tf?, then read them into simplot to observe the effect produced before trying to make your own les. Labels can be added automatically, appended to the data le, edited interactively, or input as simple text les.

N m or al G up ro up
Hanging Group

i pp ng

ro G

Pie charts and bar charts

237

4.4.3 Three dimensional barcharts


The SIMFIT surface plotting function can be used to plot three dimensional bars as, for example, using the test le barcht3d.tf1 to generate gure 4.20. Blank rows and shading can also be added to enhance the three dimensional effect.

Three Dimensional Bar Chart

100%

50%

0% Year 1 Year 2 Year 3 Year 4 Year 5 June May April March February January

Three Dimensional Bar Chart

100%

50%

0% Year 1 Year 2 June May April March February January

Figure 4.20: Three dimensional barcharts Such plots can be created from n by m matrix les, or special vector les, e.g. with n values for x and m values for y a nm + 6 vector is required with n, then m, then the range of x and range of y, say (0, 1) and (0, 1) if arbitrary, followed by values of f (x, y) in order of increasing x at consecutive increasing values of y.

238

S IMFIT reference manual: Part 4

4.5 Error bars


4.5.1 Error bars with barcharts
Barcharts can be created interactively from a table of values. For example, gure 4.21 was generated by the exhaustive analysis of a matrix procedure in simstat from matrix.tf1.
Original Axes
30.

20.

y
10. 0. Label 1 Label 2 Label 3 Label 4 Label 5

35 30

Number Infected

25 20 15 10 5 0 April May June July August

Figure 4.21: Error bars 1: barcharts If the elements are measurements, the bars would be means, while error bars should be calculated as 95% condence limits, i.e. assuming a normal distribution. Often one standard error of the mean is used instead of condence limits to make the data look better, which is dishonest. If the elements are counts, approximate error bars should be added to the matrix le in simplot from a separate le, using twice the square root of the counts, i.e. assuming a Poisson distribution. After creating barcharts from matrices, the temporary advanced barchart les can be saved.

Error bars

239

4.5.2 Error bars with skyscraper and cylinder plots


Barcharts can be created for tables, z(i, j) say, where cells are values for plotting as a function of x (rows) and y (columns). The x, y values are not required, as such plots usually require labels not numbers. Figure 4.22 shows the plot generated by simplot from the test le matrix.tf2.

Simfit Skyscraper Plot with Error Bars


11

Values

0 Case 1 Case 2 Case 3 Case 4 Case 5 Month 7 Month 6 Month 5 Month 4 Month 3 Month 2 Month 1

Simfit Cylinder Plot with Error Bars


11

Values

0 Case 1 Case 2 Case 3 Case 4 Case 5 Month 7 Month 6 Month 5 Month 4 Month 3 Month 2 Month 1

Figure 4.22: Error bars 2: skyscraper and cylinder plots Errors are added from a le, and are calculated according to the distribution assumed. They could be twice square roots for Poisson counts, binomial errors for proportions or percentages, or they could be calculated from sample standard deviations using the t distribution for means. As skyscraper plots with errors are dominated by vertical lines, error bars are plotted with thickened lines, but a better solution is to plot cylinders instead of skyscrapers, as illustrated.

240

S IMFIT reference manual: Part 4

4.5.3 Slanting and multiple error bars


Error bar les can be created by program edit after editing curve tting les with all replicates, and such error bars will be symmetrical, representing central condence limits in the original(x, y) space. But, note that these error bars can become unsymmetrical or slanting as a result of a transformation, e.g. log(y) or Scatchard, using program simplot. Program binomial will, on the other hand, always generates noncentral condence limits, i.e. unsymmetrical error bars for binomial parameter condence limits, and Log-Odds plots. However, sometimes it is necessary to plot asymmetrical error bars, slanting error bars or even multiple error bars. To understand this, note that the standard error bar test le errorbar.tf1 contains four columns with the x coordinate for the plotting symbol, then the y-coordinates for the lower bar, middle bar and upper bar. However, the advanced error bar test le errorbar.tf2 has six columns, so that the (x1 , y1 ), (x2 , y2 ), (x3 , y3 ) coordinates specied, can create any type of error bar, even multiple error bars, as will be seen in gure 4.23.

Slanting and Multiple Error Bars


6.0

4.0

y
2.0 0.0 0.0 2.0 4.0 6.0

x
Figure 4.23: Error bars 3: slanting and multiple Note that the normal error bar les created interactively from replicates by edit, qnt, simplot, or compare will only have four columns, like errorbar.tf1, with x, y1 , y2 , y3 , in that order. The six-column les like errorbar.tf2 required for multiple, slanting, or unsymmetrical error bars must be created as matrix les with the columns containing x1 , x2 , x3 , y1 , y2 , y3 , in that order.

Error bars

241

4.5.4 Calculating error bars interactively


Figure 4.24 shows the best t curve estimated by qnt when tting a sum of three Gaussians to the test le gauss3.tf1 using the expert mode. Note that all the data must be used for tting, not means. edit can generate error bar plotting les from such data les with replicates, but error bars can also be calculated interactively after tting, as illustrated for 95% condence limits.

Data and Best Fit Curve


0.600

0.400

y
0.200 0.000 -4

12

16

Means and Best Fit Curve


0.600

0.400

y
0.200 0.000 -4

12

16

Figure 4.24: Error bars 4: calculated interactively

242

S IMFIT reference manual: Part 4

4.5.5 Binomial parameter error bars


Figure 4.25 shows binomial parameter estimates for y successes in N trials. The error bars represent exact, unsymmetrical condence limits (see page 190), not those calculated using the normal approximation.
Binomial Parameter Estimates
1.00

p = y/N

0.50

0.00 0 1 2 3 4 5

Control Variable x

Figure 4.25: Error bars 4: binomial parameters

4.5.6 Log-Odds error bars


Figure 4.25 can also be manipulated by transforming the estimates p = y/N and condence limits. For instance, the ratio of success to failure (i.e. Odds y/(N y)) or the logarithm (i.e. Log Odds) can be used, as in gure 4.26, to emphasize deviation from a xed p value, e.g. p = 0.5 with a log-odds of 0. Figure 4.26 was created from a simple log-odds plot by using the [Advanced] option to transfer the x, p /(1 p ) data into simplot, then selecting a reverse y-semilog transform.
Log Odds Plot
5 4

Control Variable x

3 2 1

-1.50

-0.50

0.50

1.50

log10[p /(1 - p )]y]

Figure 4.26: Error bars 5: log odds

Error bars

243

4.5.7 Log-Odds-Ratios error bars


It is often useful to plot Log-Odds-Ratios, so the creation of gure 4.27 will be outlined. (1) The data Test les meta.tf1, meta.tf2, and meta.tf3 were analyzed in sequence using the SIMFIT Meta Analysis procedure (page 127). Note that, in these les, column 3 contains spacing coordinates so that data will be plotted consecutively. (2) The ASCII coordinate les During Meta Analysis, 100(1 )% condence limits on the Log-Odds-Ratio resulting from a 2 by 2 contingency tables with cell frequencies ni j can be constructed from the approximation e where 1 1 1 1 + + + . n11 n12 n21 n22 When Log-Odds-Ratios with error bars are displayed, the overall values (shown as lled symbols) with error bars are also plotted with a x coordinate one less than smallest x value on the input le. For this gure, error bar coordinates were transferred into the project archive using the [Advanced] option to save ASCII coordinate les. e = Z/2 (3) Creating the composite plot Program simplot was opened and the six error bar coordinate les were retrieved from the project archive. Experienced users would do this more easily using a library le of course. Reverse ysemilog transformation was selected, symbols were chosen, axes, title, and legends were edited, then half bracket hooks identifying the data were added as arrows and extra text. (4) Creating the PostScript le Vertical format was chosen then, using the option to stretch PostScript les (page 248), the y coordinate was stretched by a factor of two. (5) Editing the PostScript le A To create the nal PostScript le for L TEX a tighter bounding box was calculated using gsview then, using notepad, clipping coordinates at the top of the le were set equal to the BoundingBox coordinates, to suppress excess white space. This can also be done using the [Style] option to omit painting a white background, so that PostScript les are created with transparent backgrounds, i.e. no white space, and clipping is irrelevant.

meta.tf2
-1.00 -0.50

meta.tf3 meta.tf1
0.00

0.50

1.00

1.50

log10[Odds Ratios]
Figure 4.27: Error bars 6: log odds ratios

244

S IMFIT reference manual: Part 4

4.6 Statistical graphs


4.6.1 Clusters, connections, correlations, and scattergrams
Clusters are best plotted as sideways displaced and reduced symbols, while connections need individual data les for distinct lines and symbols, as in gure 4.28.

Plotting Clusters and Connections


5

Scores and Averages

4 Scores Smith Jones Brown Bell

ne Ju

Since the correlation coefcient r and m1 , m2 , the regression slopes for y(x) and x(y), are related by |r| = m1 m2 , SIMFIT plots both regression lines, as in gure 4.29. Other alternatives are plotting the major, or
8.50

n Ja ry ua

Figure 4.28: Clusters and connections

r ua br Fe y

il pr

ch ar

ay

Tail Length (cm)

8.00

7.50

7.00 10.0

10.5

11.0

11.5

Wing Length (cm)

Figure 4.29: Correlations and scattergrams reduced major axes single best t lines, which allow for variation in both x and y, or indicating the inclination of the best t bivariate normal distribution by condence ellipses, as discussed next.

Statistical graphs

245

4.6.2 Bivariate condence ellipses 1: basic theory


For a p-variate normal sample of size n with mean x and variance matrix estimate S, the region P (x )T S1 (x ) p(n 1) F n(n p) p,n p 1

can be regarded as a 100(1 )% condence region for . Figure 4.30 illustrates this for columns 1 and 2 of cluster.tf1 discussed previously (page 134). Alternatively, the region satisfying P (x x )T S1 (x x ) p(n2 1) F n(n p) p,n p 1

can be interpreted as a region that with probability 1 would contain another independent observation x, as shown for the swarm of points in gure 4.30. The condence region contracts with increasing n, limiting application to small samples, but the new observation ellipse does not, making it useful for visualizing if data do represent a bivariate normal distribution, while inclination of the principal axes away from parallel with the plot axes demonstrates linear correlation. This technique is only justied if the data are from a bivariate normal distribution and are independent of the variables in the other columns, as indicated by the correlation matrix.
99% Confidence Region for the Mean
25

20

Column 2

15

10

0 0 5 10 15 20

Column 1

95% Confidence Region for New Observation


20

10

-10

-20 -8 -4 0 4 8

Figure 4.30: Condence ellipses for a bivariate normal distribution

246

S IMFIT reference manual: Part 4

4.6.3 Bivariate condence ellipses 2: regions


Often a two dimensional swarm of points results from projecting data that have been partitioned into groups into a subspace of lower dimension in order to visualize the distances between putative groups, e.g., after principal components analysis or similar. If the projections are approximately bivariate normal then condence ellipses can be added, as in gure 4.31. The following steps were used to create gure 4.31 and can be easily adapted for any number of sets of two dimensional group coordinates. r For each group a le of values for x and y coordinates in the projected space was saved. r Each le was analyzed for correlation using the SIMFIT correlation analysis procedure. r After each correlation analysis, the option to create a 95% condence ellipse for the data was selected, and the ellipse coordinates were saved to le. r A library le was created with the ellipse coordinates as the rst three les, and the groups data les as the next three les. r The library le was read into simplot, then colors and symbols were chosen. Note that, because the ellipse coordinates are read in as the rst coordinates to be plotted, the option to plot lines as closed polygons can be used to represent the condence ellipses as colored background regions.

95% Confidence Ellipses


16

-8

-16 -16 -8 0 8 16

x
Figure 4.31: 95% condence regions

Statistical graphs

247

4.6.4 Dendrograms 1: standard format


Dendrogram shape is arbitrary in two ways; the x axis order is arbitrary as clusters can be rotated around any clustering distance leading to 2n1 different orders, and the distance matrix depends on the settings used. For instance, a square root transformation, Bray-Curtis similarity, and a group average link generates the second dendrogram in gure 4.32 from the rst. The y plotted are dissimilarities, while labels are 100 y, which should be remembered when changing the y axis range. Users should not manipulate dendrogram parameters to create a dendrogram supporting some preconceived clustering scheme. You can set a label threshold and translation distance from the [X -axis] menu so that, if the number of labels exceeds the threshold, even numbered labels are translated, and font size is decreased.
50

Untransformed data
40

Euclidean distance Unscaled Single link

Distance
0%

30 20 10 0

Bray-Curtis Similarity Dendrogram

Percentage Similarity

20%

40%

60%

80%

100%
6 C H A 61 A 25 A 91 B 28 B 24 B 53 B 0 10 A 37 A 99 A 72 B 32 B 35 B 31 A 36 B 60 A 97 68 52 A 27 B 61 76 34 B 30 B 33 4 C H 7 PC3 PC6 PC5 PC1 PC 5 C H B 25 B 91 B 97 B 26 73 A 33 47 B 99 B 72 A 35 A 32 A 31 B 36 29 A 26 A 28 B 37 B 27 A 60 A 30 A 53 A 0 10 B 76 A 24 7 C H 4 PC 8 C H 8 PC2 PC

PC1 PC6 HC5 HC6 91A 91B HC7 HC8 HC4 25B 61A 52 27A 100A 34 76 30B 27B 37B 24B 26B 28A PC5 28B 97B 97A PC2 53A PC8 24A 33B 68 25A 29 32B 36B 60A 76B 61B PC4 PC7 PC3 60B 73 31B 33A 53B 35A 37A 72B 31A 36A 32A 30A 35B 72A 99A 47 26A 99B 100B

Figure 4.32: Dendrograms 1: standard format

248

S IMFIT reference manual: Part 4

4.6.5 Dendrograms 2: stretched format


SIMFIT PostScript graphs have a very useful feature: you can stretch or compress the white space between plotted lines and symbols without changing the line thickness, symbol size, or font size and aspect ratio. For instance, stretching, clipping and sliding procedures are valuable in graphs which are crowded due to overlapping symbols or labels, as in gure 4.32. If such dendrograms are stretched retrospectively using editps, the labels will not separate as the fonts will also be stretched so letters become ugly due to altered aspect ratios. SIMFIT can increase white space between symbols and labels while maintaining correct aspect ratios for the fonts in PostScript hardcopy and, to explain this, the creation of gure 4.33 will be described. The title, legend and double x labelling were suppressed, and landscape mode with stretching, clipping and sliding was selected from the PostScript control using the [Shape] then [Landscape +] options, with an x stretching factor of two. Stretching increases the space between each symbol, or the start of each character string, arrow or other graphical object, but does not turn circles into ellipses or distort letters. As graphs are often stretched to print on several sheets of paper, sub-sections of the graph can be clipped out, then the clipped sub-sections can be slid to the start of the original coordinate system to facilitate printing. If stretch factors greater than two are used, legends tend to become detached from axes, and empty white space round the graph increases. To remedy the former complication, the default legends should be suppressed or replaced by more closely positioned legends while, to cure the later effect, GSview can be used to calculate new BoundingBox coordinates (by transforming .ps to .eps). If you select the option to plot an opaque background even when white (by mistake), you may then nd it necessary to edit the resulting .eps le in a text editor to adjust the clipping coordinates (identied by %#clip in the .eps le) and background polygon lling coordinates (identied by %#pf in the .ps le) to trim away unwanted white background borders that are ignored by GSview when calculating BoundingBox coordinates. Another example of this technique is on page 243, where it is also pointed out that creating transparent backgrounds by suppressing the painting of a white background obviates the need to clip away extraneous white space.
100%
PC1 PC2 PC5 PC8 PC6 HC8 PC3 PC4 PC7 HC7 HC4 24A 33B 76B 30B 100A 34 53A 76 30A 61B 60A 27A 27B 52 37B 68 28A 97A 26A 60B 29 36A 36B 31B 31A 35B 32A 32B 35A 72A 72B 99A 99B 37A 47 100B 33A 53B 73 24B 26B 28B 97B 91A 91B 25A 25B 61A HC5 HC6

Figure 4.33: Dendrograms 2: stretched format

80%

60%

40%

20%

0%

Statistical graphs

249

4.6.6 Dendrograms 3: plotting subgroups


The procedure described on page 248 can also be used to improve the readability of dendrograms where subgroups have been assigned by partial clustering (page 142). Figure 4.34 shows a graph from iris.tf1 when three subgroups are requested, or a threshold is set corresponding to the horizontal dotted line. Figure 4.35 was created by these steps. First the title was suppressed, the y-axis range was changed to (0, 4.25) with 18 tick marks, the (x, y) offset was cancelled as this suppresses axis moving, the label font size was increased from 1 to 3, and the x-axis was translated to 0.8. Then the PostScript stretch/slide/clip procedure was used with these parameters xstretch = 1.5 ystretch = 2.0 xclip = 0.15, 0.95 yclip = 0.10, 0.60. Windows users without PostScript printing facilities must create a *.eps le using this technique, then use the SIMFIT procedures to create a graphics le they can use, e.g. *.jpg. Use of a larger font and increased x-stretching would be required to read the labels, of course.
4.25 4.00 3.75 3.50 3.25 3.00 2.75 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00
139 134 127 102 114 115 116 149 117 111 112 129 113 142 121 141 125 135 103 131 130 106 119 120 128 150 100 124 147 143 122 101 137 104 138 148 105 133 140 146 144 145 109 107 108 126 136 123 110 132 18 50 29 38 11 20 22 32 12 24 44 46 10 26 31 39 14 45 17 34 42 53 77 55 66 52 86 92 74 75 69 84 54 70 82 80 56 62 83 89 95 85 94 61 8 4 7 6 118 41 40 28 36 49 47 21 37 25 27 13 35 30 43 23 19 15 33 16 51 87 78 59 76 57 64 79 72 98 88 71 73 90 81 65 60 91 68 93 96 97 67 63 58 48 99 1 5 2 3 9

Distance

Figure 4.34: Dendrograms 3: plotting subgroups

1 18 41 8 50 40 29 28 5 38 36 11 49 20 47 22 21 32 37 12 25 24 27 44 2 46 13 10 35 26 30 31 3 4 48 7 9 39 43 14 23 6 45 19 17 15 34 33 42 16 53 51 77 87 55 78 66 59 52 76 86 57 92 64 74 79 75 72 69 98 120 88 71 128 139 150 73 84 134 124 127 147 102 143 114 122 115 101 116 137 149 104 117 138 111 148 112 105 129 133 113 140 142 146 121 144 141 145 125 109 135 54 90 70 81 82 65 80 60 56 91 62 68 83 93 89 96 95 100 97 85 67 107 63 94 58 61 99 108 103 126 131 136 130 123 106 110 119 132 118

Figure 4.35: Dendrograms 3: plotting subgroups

250

S IMFIT reference manual: Part 4

4.6.7 K-Means clustering 1: UK airports


Stretching and clipping are also valuable when graphs have to be re-sized to achieve geometrically correct aspect ratios, as in the map shown in gure 4.36, which can be generated by the K-means clustering procedure using program simstat (see page 145) as follows. Input ukmap.tf1 with coordinates for UK airports. Input ukmap.tf2 with coordinates for starting centroids. Calculate centroids then transfer the plot to advanced graphics. Read in the UK coastal outline coordinates as an extra le from ukmap.tf3. Suppress axes, labels, and legends, then clip away extraneous white space. Stretch the PS output using the [Shape] then [Portrait +] options, and save the stretched eps le.

K-Means Clusters

Figure 4.36: K-means clustering for UK airports

Statistical graphs

251

4.6.8 K-Means clustering 2: highlighting centroids


It is frequently useful to be able highlight groups of data points in a two dimensional swarm, as in gure 4.37.

K-means cluster centroids


40 I (47.8, 35.8, 16.3, 2.4, 6.7) H D

Variable 2

30

OQ M (64.0, 25.2, 10.7, 2.8, 6.7) K J E N PL C T (81.2, 12.7, 7.2, 2.1, 6.6) A RG S B F 40 60 80 100

20

10

Variable 1
Figure 4.37: Highlighting K-means cluster centroids

In this case a partition into three groups has been done by K-means clustering, and to appreciate how to use this technique, note that gure 4.37 can be generated by the K-means clustering procedure using program simstat (see page 145) as follows. Input the K-means clustering test le kmeans.tf1. Calculate the centroids, using the starting estimates appended to the test le. View them, which then adds them to the results le, then record the centroid coordinates from the results le. Select to plot the groups with associated labels, but then it will prove necessary to move several of the labels by substituting new labels, or shifting the x or y coordinates to clarify the graph, as described on page 252. Add the solid background ellipses using the lines/arrows/boxes option because both head and tail coordinate must be specied using the red arrow, as well as an eccentricity value for the ellipses. Of course, any lled shapes such as circles, squares, or triangles can be chosen, and any size or color can be used. Add the centroid coordinates as extra text strings. Of course, this technique can be used to highlight or draw attention to any subsets of data points, for instance groups in principal component analysis.

252

S IMFIT reference manual: Part 4

4.6.9 Principal components


Principal components for multivariate data can be explored by plotting scree diagrams and scattergrams after using the calculations options in program simstat. If labels are appended to the data le, as with cluster.tf2, they can be plotted, as in gure 4.38.
0.500 0.250 0.000 -0.250 -0.500 F-6 -0.750 -0.500 -0.250 0.000 0.250 0.500 0.750 D-4 C-3 G-7 E-5

A-1 H-8

B-2 I-9

PC 2

J-10 L-12 K-11

PC 1

0.500 0.250 0.000 -0.250 -0.500 D-4

A-1 H-8

B-2 I-9 J-10

PC 2

G-7 C-3 E-5 L-12

K-11

F-6 -0.750 -0.500 -0.250 0.000 0.250 0.500 0.750

PC 1

Figure 4.38: Principal components The labels that are usually plotted along the x axis are used to label the points, but moved to the side of the plotting symbol. Colors are controlled from the [Colour] options as these are linked to the color of the symbol plotted, even if the symbol is suppressed. The font is the one that would be used to label the x axis if labels were plotted instead of numbers. Clearly arbitrary labels cannot be plotted at the same time on the x axis. Often it is required to move the labels because of clashes, as above. This is done by using the x axis editing function, setting labels that clash equal to blanks, then using the normal mechanism for adding arbitrary text and arrows to label the coordinates in the principal components scattergram. To facilitate this process, the default text font is the same as the axes numbering font. Alternatively, the plotting menus provide the option to move labels by dening parameters to shift individual labels horizontally or vertically.

Statistical graphs

253

4.6.10 Labelling statistical graphs


Labels are text strings (with associated template strings) that do not have arbitrary positions, but are plotted to identify the data. Some examples would be as follows. Labels adjacent to segments in a pie chart. Labels on the X axis to indicate groups in bar charts (page 235). Labels on the X axis to identify clusters in dendrograms (page 247). Test les such as cluster.tf1 illustrate the usual way to supply labels appended to data les in order to over-ride the defaults set from the conguration options, but sometimes it is convenient to supply labels interactively from a le, or from the clipboard, and not all procedures in SIMFIT use the labels supplied appended to data les. Figure 4.39 illustrates this. Test le cluster.tf1 was input into the procedure Labels plotted alongside symbols in 2D plots, such as principal components (page 252).

25 9 20 11 12

25

20

Column 2

15 6 10 7 5 1 3 0 0 2 8 4 10 5

Column 2

10

15

10

0 20 0 10 20

Column 1

Column 1

Figure 4.39: Labelling statistical graphs for exhaustive analysis of a matrix in simstat, and the option to plot columns as an advanced 2D plot was selected. This created the left hand gure, where default integer labels indicate row coordinates. Then the option to add labels from a le was chosen, and test le labels.txt was input. This is just lines of characters in alphabetical order to overwrite the default integers. Then the option to read in a template was selected, and test le templates.txt was input. This just contains a succession of lines containing 6, indicating that alphabetical characters are to be plotted as bold maths symbols, resulting in the right hand gure. To summarize, the best way to manipulate labels in SIMFIT plots is as follows. 1. Write the column of case labels, or row of variable labels, from your data-base or spread-sheet program into an ASCII text le. 2. This le should just consist of one label per line and nothing else (like labels.txt) 3. Paste this le at the end of your SIMFIT data matrix le, editing the extra line counter (as in cluster.tf1) as required. 4. If there are n lines of data, the extra line counter (after the data but before the labels) must be at least n to use this label providing technique (See page 378). 5. Alternatively use the more versatile begin{labels} ... end{labels} technique (See page 378). 6. Archive the labels le if interactive use is anticipated as in gure 4.39. 7. If Special symbols or accents are required, a corresponding templates le with character display codes (page 349) can be prepared.

254

S IMFIT reference manual: Part 4

4.6.11 Probability distributions


Discrete probability distributions are best plotted using vertical lines, as in gure 4.40 for the binomial distribution but, when tting continuous distributions, the cdf should be used since histogram shapes (and parameter estimates) depend on the bins chosen. A good compromise to illustrate goodness of t is to plot the scaled pdf along with the best t cdf, as with the beta distribution illustrated.

Binomial Probability Plot for N = 50, p = 0.6


0.120 0.100 0.080

Pr(X = x)

0.060 0.040 0.020 0.000

10

20

30

40

50

Using QNFIT to fit Beta Function pdfs and cdfs


20.0 1.00

Histogram and pdf fit

0.80

Step Curve and cdf fit

0.60 10.0 0.40

0.20

0.0 0.00

0.25

0.50

0.75

0.00 1.00

Random Number Values

Figure 4.40: Probability distributions

Statistical graphs

255

4.6.12 Survival analysis


It is often necessary to display survival data for two groups in order to assess the relative hazards. For instance, using gct in mode 3, or alternatively simstat, to analyze the test les survive.tf5 and survive.tf6, which contain survival data for stage 3 (group A) and stage 4 (group B) tumors, yields the result that the samples differ signicantly according to the Mantel-Haenszel log rank test, and that the estimated value of = 0.3786, with condence limits 0.1795, 0.7982. Figure 4.41 illustrates the proportional hazard constant is the Kaplan-Meier product limit survivor functions for these data using the advanced survival analysis plotting mode. In this plot, the survivor functions are extended to the end of the data range for both sets of data, even

Kaplan-Meier Product-Limit Survivor Estimates


1.00 + indicates censored times 0.75 Stage 3

(t) S

0.50

0.25 Stage 4 0.00 0 50 100 150 200 250 300 350

Days from start of trial


Figure 4.41: Survival analysis though no deaths occurred at the extreme end of the time range. There are two other features of this graph that should be indicated. The rst is that the coordinates supplied for plotting were not in the form of a stair step type of survival curve. They were just for the corners, and the step function was created by choosing a survival function step curve line type, not a cumulative probability type. Another point is that censored observations are also plotted, in this case using plus signs as plotting symbols, which results in the censored observations being identied as short vertical lines.

256

S IMFIT reference manual: Part 4

4.6.13 Goodness of t to a Poisson distribution


After using a Kolmogorov-Smirnov 1-sample or Fisher exact test to estimate goodness of t to a Poisson distribution, the sample cdf can be compared with the theoretical cdf as in gure 4.42. The theoretical cdf is shown as a lled polygon for clarity. Also, sample and theoretical frequencies can be compared, as shown for a normal graph with bars as plotting symbols, and sample values displaced to avoid overlapping. Observe that the labels at x = 1 and x = 8 were suppressed, and bars were added as graphical objects to identify the bar types.

Goodness Of Fit to a Poisson Distribution,


1.0

=3

Cumulative Distributions

Sample (n = 25)
0.8

0.6

0.4

Theoretical cdf
0.2

0.0 0

Values

Goodness Of Fit to a Poisson Distribution,


8

=3

Theoretical
6

Sample (n = 25)

Frequencies

Values

Figure 4.42: Goodness of t to a Poisson distribution

Statistical graphs

257

4.6.14 Trinomial parameter joint condence regions


A useful rule of thumb to see if parameter estimates differ signicantly is to check their approximate central 95% condence regions. If the regions are disjoint it indicates that the parameters differ signicantly and, in fact, parameters can differ signicantly even with limited overlap. If two or more parameters are estimated, it is valuable to inspect the joint condence regions dened by the estimated covariance matrix and appropriate chi-square critical value. Consider, for example, gure 4.43 generated by the contour plotting function of binomial. Data triples x, y, z can be any partitions, such as number of male, female or dead hatchlings from a batch of eggs where it is hoped to determine a shift from equi-probable sexes. The contours are dened by (( p x px ), ( p y py )) px (1 px)/N px py /N px py /N py (1 py)/N
1

p x px p y py

= 2 2:0.05

where N = x + y + z, p x = x/N and p y = y/N as discussed on page 191. When N = 20 the triples 9,9,2 and 7,11,2 cannot be distinguished, but when N = 200 the orbits are becoming elliptical and converging to asymptotic values. By the time N = 600 the triples 210,330,60 and 270,270,60 can be seen to differ signicantly.

Trinomial Parameter 95% Confidence Regions


0.85 0.75 0.65 0.55

7,11,2 70,110,20 210,330,60

py
0.45 0.35

270,270,60
0.25 0.15 0.05

90,90,20
0.15 0.25 0.35

9,9,2
0.45 0.55 0.65 0.75

px
Figure 4.43: Trinomial parameter joint condence contours

258

S IMFIT reference manual: Part 4

4.6.15 Random walks


Many experimentalists record movements of bacteria or individual cells in an attempt to quantify the effects of attractants, etc. so it is useful to compare experimental data with simulated data before conclusions about persistence or altered motility are reached. Program rannum can generate such random walks starting from arbitrary initial coordinates and using specied distributions. The probability density functions for the axes can be chosen independently and different techniques can be used to visualize the walk depending on the number of dimensions. Figure 4.44 shows a classical unrestricted walk on an integer grid, that is, the steps can be +1 with probability p and 1 with probability q = 1 p. It also shows 2- and 3-dimensional walks for standard normal distributions.
1-Dimensional Random Walk
9 8

2-Dimensional Random Walk

6 4

Position

y
0 0 10 20 30 40 50 -4 -7

-3

-4

Number of Steps

3-Dimensional Random Walk

-1 Y 11 1 X

-11 -11

Figure 4.44: Random walks

Statistical graphs

259

4.6.16 Power as a function of sample size


It is important in the design of experiments to be able to estimate the sample size needed to detect a signicant effect. For such calculations you must specify all the parameters of interest except one, then calculate the unknown parameter using numerical techniques. For example, the problem of deciding whether one or more samples differ signicantly is a problem in the Analysis of Variance, as long as the samples are all normally distributed and with the same variance. You specify the known variance, 2 , the minimum detectable difference between means, , the number of groups, k, the signicance level, , and the sample size per group, n. Then, using nonlinear equations involving the F and noncentral F distributions, the power, 100(1 ) can be calculated. It can be very confusing trying to understand the relationship between all of these parameters so, in order to obtain an impression of how these factors alter the power, a graphical technique is very useful, as in gure 4.45.

ANOVA (k = no. groups, n = no. per group)


100

80

k= 4 k= 8 k= 16

Power (%)

k=

40
2

60

32

20

= 1 (variance) = 1 (difference)

20

40

60

80

Sample Size (n)


Figure 4.45: Power as a function of sample size
simstat was used to create this graph. The variance, signicance level, minimum detectable difference and

number of groups were xed, then power was plotted as a function of sample size. The ASCII text coordinate les from several such plots were collected together into a library le to compose the joint plot using simplot. Note that, if a power plot reaches the current power level of interest, the critical power level is plotted (80% in the above plot) and the n values either side of the intersection point are displayed.

260

S IMFIT reference manual: Part 4

4.7 Three dimensional plotting


4.7.1 Surfaces and contours
SIMFIT uses isometric projection, but surfaces can be viewed from any corner, and data can be shown as a surface, contours, surface with contours, or 3-D bar chart as in gure 4.46.
Using SIMPLOT to plot Probability Contours
1 10 Z 0

Using SIMPLOT to plot a Wavy Surface

0 0 Y 1 1 X 1 1 0 Y X

0 0

Using SIMPLOT to plot a Fluted Surface

Using SIMPLOT to plot a Surface and Contours

f(x,y) = x2 - y2
1 100 Z -1

0.000 Y 1.000102 1.000102 X

0 0.000 -1 Y 1 1 X -1

Using SIMPLOT to plot a Contour Diagram


0.000 Key Contour
1 2 3 4 5 6 7 8 9 10 9.02510-2 0.181 0.271 0.361 0.451 0.542 0.632 0.722 0.812 0.903

Three Dimensional Bar Chart

2 7 3 6 10 9 4 8 5 1 2 3 5 4

100%

50%

3 2 4 5 2

0%
1

1.000 0.000

1.000

Year 1 Year 2 Year 3 Year 4 Year 5

June May April March February January

Figure 4.46: Three dimensional plotting

Three dimensional plotting

261

4.7.2 The objective function at solution points


SIMFIT tries to minimize W SSQ/NDOF which has expectation unity at solution points, and it is useful to view this as a function of the parameters close to a minimum. Figure 4.47 shows the objective function for equation 2.1 after tting 1 exponential to the test le exfit.tf2.

WSSQ/NDOF = f(k,A)
44 WSSQ/NDOF 1

9.00010-1 A 1.200
WSSQ/NDOF = f(Vmax,Km)
19 WSSQ/NDOF 1

9.00010-1 k 1.200
Contours for WSSQ/NDOF = f(Vmax(1),Km(1))
1.080
13 1514 11 12 10 8 1 6 3

Key Contour
9 7 5 3

2 2

7 4 6 9 8 11 10 1213

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1.336 1.581 1.827 2.073 2.318 2.564 2.810 3.055 3.301 3.546 3.792 4.038 4.283 4.529 4.775

8.50010-1 Km 1.150 1.050 Vmax

9.00010-1

9.80010

Km(1)
-1

9.80010-1

Vmax(1)

1.060

Figure 4.47: The objective function at solution points The gure also shows details for equation 2.4 after tting to mmfit.tf2, and for equation 2.6 after tting a model of order 2 to mmfit.tf4. Such plots are created by qnt after tting, by selecting any two parameters and ranges of variation. Information about the eccentricity is also available from the parameter covariance matrix, and the eigenvalues and condition number of the Hessian matrix in internal coordinates. Some contour diagrams show long valleys at solution points, sometimes deformed considerably from ellipses, illustrating the increased difculty encountered with such ill-conditioned problems.

262

S IMFIT reference manual: Part 4

4.7.3 Sequential sections across best t surfaces


Figure 4.48 shows the best t surface after using qnt to t an inhibition kinetics model. Such surfaces can also be created for functions of two variables using makdat.

Inhibition Kinetics: v = f([S],[I])


89

v = f([S],[I]) 0

0.000 [I] 2.000 8.00010


1

0.000 [S]

Inhibition Kinetics: v = f([S],[I])


100 80

v([S],[I])/ M.min-1

[I] = 0 [I] = 0.5mM [I] = 1.0mM

60 40 20 0 0 20

[I] = 2.0mM

40

60

80

[S]/mM

Figure 4.48: Sequential sections across best t surfaces Also, qnt allows slices to be cut through a best t surface for xed values of either variable. Such composite plots show successive sections through a best t surface, and this is probably the best way to visualize goodness of t of a surface to functions of two variables.

Three dimensional plotting

263

4.7.4 Plotting contours for Rosenbrock optimization trajectory


Care is sometimes needed to create satisfactory contour diagrams, and it helps both to understand the mathematical properties of the function f (x, y) being plotted, and also to appreciate how SIMFIT creates a default contour diagram from the function values supplied. The algorithm for creating contours rst performs a scan of the function values for the minimum and maximum values, then it divides the interval into an arithmetic progression, plots the contours, breaks the contours randomly to add keys, and prints a table of contour values corresponding to the keys. As an example, consider gure 4.49 which plots contours for Rosenbrocks function (page 328) f (x, y) = 100(y x2)2 + (1 x)2

in the vicinity of the unique minimum at (1,1). For smooth contours, 100 divisions were used on each axis,

Contours for Rosenbrock Optimization Trajectory


1.500
8 7 5 3 3 5 2 2 4 7 6 1

Key Contour
1 2 3 4 5 6 7 8 9 10 1.425 2.838 5.663 11.313 22.613 45.212 90.412 1.808102 3.616102 7.232102

4 6

10 9 8 9

10

-1.500 -1.500 1.500

X
Figure 4.49: Contour diagram for Rosenbrock optimization trajectory and user-dened proportionately increasing contour spacing was used to capture the shape of the function around the minimum. If the default arithmetic or geometric progressions are inadequate, as in this case, users can select contour spacing by supplying a vector of proportions, which causes contours to be placed at those proportions of the interval between the minimum and maximum function values. The number of contours can be varied, and the keys and table of contour values can be suppressed. The optimization trajectory and starting and end points were supplied as extra data les, as described on page 328.

264

S IMFIT reference manual: Part 4

4.7.5 Three dimensional space curves


Sets of x, y, z coordinates can be plotted in three dimensional space to represent either an arbitrary scatter of points, a surface, or a connected space curve. Arbitrary points are best plotted as symbols such as circles or triangles, surfaces are usually represented as a mesh of orthogonal space curves, while single space curves can be displayed as symbols or may be connected by lines. For instance, space curves of the form x = x(t ), y = y(t ), z = z(t ) can be plotted by generating x, y, z data for constant increments of t and joining the points together to create a smooth curve as in gure 4.50.

x(t), y(t), z(t) curve and projection onto y = - 1

1.000

-1.000 Y 1.000 1.000 X

0.000 -1.000

Figure 4.50: Space curves and projections Such space curves can be generated quite easily by preparing data les with three columns of x, y, z data values, then displaying the data using the space curve option in simplot. However users can also generate space curves from x(t ), y(t ), z(t ) equations, using the option to plot parametric equations in simplot or usermod. The test le helix.mod shows you how to do this for a three dimensional helix. Note how the rear (x, y) axes have been subdued and truncated just short of the origin, to improve the three dimensional effect. Also, projections onto planes are generated by setting the chosen variable to a constant, or by writing model les to generate x, y, z data with chosen coordinates equal to the value for the plane.

Three dimensional plotting

265

4.7.6 Projecting space curves onto planes


Sometimes it is useful to project space curves onto planes for purposes of illustration. Figure 4.51 shows a simulation using usermod with the model le twister.mod. The parametric equations are x = t cos t , y = t sin t , z = t and projections are created by xing one the variables to a constant value.

Twister Curve with Projections onto Planes

400 300 200 100 0 20 10 10 0 -10 -20 -20 -10 0

z(t)

20

y(t)

x(t)

Figure 4.51: Projecting space curves onto planes Note the following about the model le twister.mod. There are 3 curves so there are 9 functions of 1 variable The value of x supplied is used as the parameter t Functions f (1), f (4), f (7) are the x(t ) proles Functions f (2), f (5), f (8) are the y(t ) proles Functions f (3), f (6), f (9) are the z(t ) proles Also observe that the model parameters x the values of the projection planes just outside the data range, at p(1) = 20, p(2) = 20.

266

S IMFIT reference manual: Part 4

4.7.7 Three dimensional scatter diagrams


Often it is necessary to plot sets of x, y, z coordinates in three dimensional space where the coordinates are arbitrary and are not functions of a parameter t . This is the case when it is wished to illustrate scattering by using different symbols for subsets of data that form clusters according to some distance criteria. For this type of plotting, the sets of x, y, z triples, say principal components, are collected together as sets of three column matrices, preferably referenced by a library le, and a default graph is rst created. The usual aim would be to create a graph looking something like gure 4.52.

Three Dimensional Scatter Plot


Type A Type B
5 4 3

Z
2 1 5 2 4 3 4 5 1 2 3

Figure 4.52: Three dimensional scatter plot In this graph, the front axes have been removed for clarity, a subdued grid has been displayed on the vertical axes, but not on the base and perpendiculars have been dropped from the plotting symbols to the base of the plot, in order to assist in the identication of clusters. Note that plotting symbols, minus signs in this case, have been added to the foot of the perpendiculars to assist in visualizing the clustering. Also, note that distinct data sets, requiring individual plotting symbols, are identied by a simple rule; data values in each data le are regarded as representing the same cluster, i.e. each cluster must be in a separate le.

Three dimensional plotting

267

4.7.8 Two dimensional families of curves


Users may need to plot families of curves indexed by parameters. For instance, diffusion of a substance from an instantaneous plane source is described by the equation 1 x2 f (x) = exp 4Dt 2 Dt which is, of course, a normal distribution with = 0 and 2 = 2Dt , where D is the diffusion constant and t is time, so that 2Dt is the mean square distance diffused by molecules in time t . Now it is easy to plot the concentration f (x) predicted by this equation as a function of distance x and time t given a diffusion constant D, by simulating the equation using makdat, saving the curves to a library le or project archive, then plotting the collected curves. However, there is a much better way using program usermod which has the important advantage that families of curves indexed by parameters can be plotted interactively. This is a more powerful technique which provides numerous advantages and convenient options when simulating systems to observe the behavior of the proles as the indexing parameters vary. Figure 4.53 shows the above equation plotted (in arbitrary units) using the model parameters pi = 2Dti , for i = 1, 2, 3, 4 to display the diffusion proles as a function of time. The plot was created using the model le family2d.mod, which simply denes four identical equations corresponding to the diffusion equation but with four different parameters pi . Program usermod was then used to read in the model, simulate it for the parameter values indicated, then plot the curves simultaneously.

Diffusion From a Plane Source


1.60 1.40 0.25 1.20

Concentration

1.00 0.80 0.60 0.40 0.20 0.00 0.5 0.75 1.0 -3 -2 -1 0 1 2 3

Distance

Figure 4.53: Two dimensional families of curves

268

S IMFIT reference manual: Part 4

4.7.9 Three dimensional families of curves


Users may need to plot families of curves indexed by parameters in three dimensions. To show how this is done, the diffusion equation dealt with previously (page 267) is reformulated, using y = 2Dt , as 1 1 z(x, y) = exp 2 y 2 x y
2

and is plotted in gure 4.54 for the same parameter values used before, but now as sections through the surface of a function of two variables.

Diffusion From a Plane Source

1.6 1.2 0.8 0.4 0.0 -3 0.25 -2 0.50 0.75 1.00 1.25 3 2 1 -1 0

0.00

Figure 4.54: Three dimensional families of curves This is, of course, a case of a family of parametric space curves projected onto the xed values of y. Now the model le family3d.mod was used by program usermod to create this gure, using the option to plot n sets of parametric space curves, but you should observe a number of important facts about this model le before attempting to plot your own families of space curves. There are 4 curves so there are 12 functions of 1 variable Functions f (1), f (4), f (7), f (10) are the parameter t , i.e. x Functions f (2), f (5), f (8), f (11) are the y values, i.e. 2Dt Functions f (3), f (6), f (9), f (12) are the z values, i.e. the concentration proles Finally, it is clear that n space curves require a model le that species 3n equations, but you should also realize that space curves cannot be plotted if there is insufcient variation in any of the independent variables, e.g. if all y = k, for some xed parameter k.

Differential equations

269

4.8 Differential equations


4.8.1 Phase portraits of plane autonomous systems
When studying plane autonomous systems of differential equations it is useful to be able to generate phase portraits. Consider, for instance a simplied version of the Lotka-Volterra predator prey equations given by dy(1) = y(1)(1 y(2)) dx dy(2) = y(2)(y(1) 1) dx which clearly has singular points at (0,0) and (1,1). Figure 4.55 was generated by simulating the system using deqsol, then reading the vector eld ASCII coordinate le vfield.tf1 into simplot and requesting a vector eld type of plot. Using deqsol you can choose the equations, the range of independent variables, the number of grid points and the precision required to identify singularities. At each grid point the arrow direction is dened by the right hand sides of the dening equations and the singular points are emphasized by an automatic change of plotting symbol. Note that the arrows will not, in general, be on related orbits.

Phase Portrait for the Lotka-Volterra System


2

y(1)
0 -1 -1

y(2)
Figure 4.55: Phase portraits of plane autonomous systems

270

S IMFIT reference manual: Part 4

4.8.2 Orbits of differential equations


To obtain orbits where the y(i) are parameterized by time, rather than in a time independent phase portrait as in gure 4.56, trajectories have to be integrated and collected together. For instance the simple system dy(1) = y(2) dx dy(2) = (y(1) + y(2)) dx was integrated by deqsol for the initial conditions illustrated, then the orbits, collected together as ASCII coordinate les in the library le orbits.tfl, were used to create the following orbit diagram using program simplot. The way to create such orbit diagrams is to integrate the selected equations repeatedly for different initial conditions and then to store the required orbits, which is a facility available in deqsol. Clearly orbits generated in this way can also be plotted as an overlay on a phase portrait in order to emphasize particular trajectories. All that is required is to create a library le with the portrait and orbits together and choose the vector eld option in program simplot. With this option, les with four columns are interpreted as arrow diagrams while les with two columns are interpreted in the usual way as coordinates to be joined up to form a continuous curve.

Orbits for a System of Differential Equations


2.50 2.00 1.50

y(1)

1.00 0.50 0.00 -0.50 -1.25

0.00

1.25

y(2)
Figure 4.56: Orbits of differential equations

Specialized techniques

271

4.9 Specialized techniques


4.9.1 Deconvolution 1: Graphical deconvolution of complex models
Figure 4.57 shows the graphical deconvolution of the best t curve from qnt into its three component Gaussian pdfs after tting the test le gauss3.tf1.

Deconvolution of 3 Gaussians
0.500

0.400

0.300

y
0.200 0.100 0.000 -3.0

1.5

6.0

10.5

15.0

Deconvolution of Exponentials
2.00

1.00

0.00 0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

Figure 4.57: Deconvolution 1: Graphical deconvolution of complex models Graphical deconvolution, which displays graphs for the individual components making up a composite functions dened as the sum of these components, should always be done after tting sums of monomials, Michaelis-Mentens, High/Low afnity sites, exponentials, logistics or Gaussians, to assess the contribution of the individual components to the overall t, before accepting statistical evidence for improved t. Many claims for three exponentials or Michaelis-Mentens would not have been made if this simple graphical technique had been used.

272

S IMFIT reference manual: Part 4

4.9.2 Deconvolution 2: Fitting convolution integrals


Fitting convolution integrals (page 311) involves parameter estimation in f (t ), g(t ), and ( f g)(t ), where
t

( f g)(t ) =

f (u)g(t u) du,

and such integrals occur as output functions from the response of a device to an input function. Sometimes the input function can be controlled independently so that, from sampling the output function, parameters of the response function can be estimated, and frequently the functions may be normalized, e.g. the response function may be modelled as a function integrating to unity as a result of a unit impulse at zero time. However, any one, two or even all three of the functions may have to be tted. Figure 4.58 shows the graphical display

Fitting a Convolution Integral f*g


1.00

f(t) = exp(-t) g(t) = 2 t exp(-t)

f(t), g(t) and f*g

0.50

f*g

0.00 0 1 2 3 4 5

Time t

Figure 4.58: Deconvolution 2: Fitting convolution integrals following the tting of a convolution integral using qnt where, to demonstrate the procedure to be followed for maximum generality, replicates of the output function at unequally spaced time points have been assumed. The model is convolv3.mod, and the data le is convolv3.tfl, which just species replicates for the output function resulting from f (t ) = exp(t )

g(t ) = 2t exp(t ). Note how missing data for f (t ) and g(t ) are indicated by percentage symbols in the library le so, in this case, the model convolve.mod could have been tted as a single function. However, by tting as a function of three variables but with data for only one function, a visual display of all components of the convolution integral evaluated at the best-t parameters can be achieved.

Specialized techniques

273

4.9.3 Extrapolation
Best t nonlinear curves from qnt can be extrapolated beyond data limits but, for purposes of illustration, extensive extrapolation of best t models is sometimes required. For instance, tting the mixed, or noncompetitive inhibition model VS v(S, I ) = K (1 + I /Kis) + (1 + I /Kii)S as a function of two variables is straightforward but, before the computer age, people used to t this sort of model using linear regression to the double reciprocal form 1 1 = v V 1+ I Kii + K V 1+ I Kis

which is still sometimes used to demonstrate the intersection point of the best-t lines. Figure 4.59 was obtained by tting the mixed model to inhibit.tf1 followed by plotting sections through the best-t surface

Double Reciprocal Plot For Inhibition Data


45 I=4

35 I=3 25 I=2 I=1 15 I=0

-5

1/v

-5

10

15

20

1/S

Figure 4.59: Extrapolation at xed inhibitor concentration and saving these as ASCII text les. These les (referenced by the library le inhibit.tfl) were plotted in double reciprocal space using simplot, and gure 4.59 was created by overlaying extra lines over each best-t line and extending these beyond the xed point at 1 1 = S K Kis Kii , 1 1 = v V 1 Kis Kii .

The best-t lines restricted to the data range were then suppressed, and the other cosmetic changes were implemented. It should be pointed out that, for obvious mathematical reasons, extrapolation of best t curves for this model in transformed space cannot be generated by simply requesting an extended range for a best-t curves in the original space. Note that, to avoid the extrapolated best-t lines passing through the plotting symbols, the option to plot extra lines in the background, i.e., before the data, was selected.

274

S IMFIT reference manual: Part 4

4.9.4 Segmented models with cross-over points


Often segmented models are required with cross-over points where the model equation swaps over at one or more values of the independent variable. In gure 4.60, for instance, data are simulated then tted using the model updown.mod, showing how the three way get command get3(.,.,.) can be used to swap over from one model to another at a xed critical point.

Up-Down Normal/Normal-Complement Model


1.20 1.00 0.80

f(x)

0.60 0.40 0.20 0.00 0.0 2.0

Cross-Over Point

4.0

6.0

8.0

10.0

12.0

x
Figure 4.60: Models with cross over points The model dened by updown.mod is f (x) = = 1 ((x p3)/ p4 ) otherwise ((x p1 )/ p2 ) for x 6

where (.) is the cumulative normal distribution, and this is the relevant swap-over code. x 6 subtract get3(1,1,2) f(1) The get3(1,1,2) command pops the x 6 off the stack and uses get(1) or get(2) depending on the magnitude of x, since a get3(i,j,k) command simply pops the top value off the stack and then uses get(i) if this is negative, get(j) if this is zero (to machine precision), or get(k) otherwise. The cross-over point can also be xed using an extra parameter that is then estimated, but this can easily lead to ill-determined parameters and a rank decient covariance matrix if the objective function is insensitive to small variations in the extra parameter.

Specialized techniques

275

4.9.5 Plotting single impulse functions


Plotting single impulse functions, as in gure 4.61, sometimes requires care due to the discontinuities.

Impulse Functions
20.0

Gauss

f(x)

Spike 10.0 Impulse Kronecker Heaviside 0.0 -1.00 -0.50 0.00 0.50 1.00

x
Figure 4.61: Plotting single impulse functions These graphs were created using program usermod together with the model le impulse.mod, which denes the ve impulse functions of one variable described on page 297, and uses a = p(1) > 0 to x the location, and b = p(2) > 0 to set the pulse width where necessary. The Heaviside unit function h(x a). A pdf or survival curve stepped line type is required in order to plot the abrupt step at x = a. The Kronecker delta symbol i j . The x-coordinate data were edited interactively in program usermod in order to plot the vertical signal when i = j as a distinct spike. After editing there was one x value at precisely x = a, where the function value is one, and one at a short distance either side, where the function values are zero. The square impulse function of unit area. Again a stepped line type is necessary to plot the abrupt increase and decrease of this discontinuous function, and it should be noted that, by decreasing the pulse width, the Dirac delta function can be simulated. The triangular spike function of unit area is straightforward to simulate and plot as long as the three x-coordinates for the corners of the triangles are present. Note that the model le impulse.mod uses scaling factors and additive constants so that all ve functions can be displayed in a convenient vertically stacked format. The Gauss function of unit area is easy to plot.

276

S IMFIT reference manual: Part 4

4.9.6 Plotting periodic impulse functions


Plotting periodic impulse functions, as in gure 4.62, sometimes requires care due to the discontinuities.

Periodic Impulse Functions


20 Unit Impulse 15 Half Sine Rectified Sine 10

f(x)

Saw Tooth 5 Morse Dot Rectified Triangle 0 Square Wave -5 -5 -3 0 3 5

x
Figure 4.62: Plotting periodic impulse functions These graphs were created using program usermod together with the model le periodic.mod, which denes the seven impulse functions of one variable described on page 297, and uses a = p(1) > 0 to x the period and b = p(2) > 0 to set the width where required. The square wave function oscillates between plus and minus one, so a pdf or survival curve stepped line type is required in order to plot the abrupt step at x = a, for positive integer . The rectied triangular wave plots perfectly as long as the x-coordinate data are edited interactively in program usermod to include the integer multiples of a. The Morse dot is just the positive part of the square wave, so it also requires a stepped line type. The sawtooth function is best plotted by editing the x-coordinate data to include a point immediately either side of the multiples of a. The rectied sine wave and half-wave merely require sufcient points to create a smooth curve. The unit impulse function requires a second parameter to dene the width b, and this is best plotted using a stepped line type. Note that the model le periodic.mod uses scaling factors and additive constants so that all seven functions can be displayed in a convenient vertically stacked format.

Specialized techniques

277

4.9.7 Flow cytometry


csat should be used with simulated data from makcsa to become familiar with the concepts involved before analyzing actual data, as this is a very specialized analytical procedure. Figure 4.63 demonstrates typical ow cytometry data tting.

Using CSAFIT for Flow Cytometry Data Smoothing


500

400

Number of Cells

300

200

100

50

100

150

200

250

Channel Number

Figure 4.63: Flow cytometry

4.9.8 Subsidiary gures as insets


This is easily achieved using editps with the individual PostScript les as shown in gure 4.64.
2.00
0.250

log10f(t) against t
log 10 f(t)

-0.250

f(t)

1.00
-0.750 0.00 1.00 2.00

2 Exponentials 1 Exponential
0.00 0.00 1.00 2.00

Figure 4.64: Subsidiary gures as insets

278

S IMFIT reference manual: Part 4

4.9.9 Nonlinear growth curves


Figure 4.65 illustrates the use of male and female plotting symbols to distinguish experimental data, which are also very useful when plotting correlation data for males and females.
125

Percentage of Average Final Size

MALE FEMALE
100

Using GCFIT to fit Growth Curves

75

50

25

10

Time (weeks)

Figure 4.65: Growth curves

4.9.10 Ligand binding species fractions


Species fractions (page 195) are very useful when analyzing cooperative ligand binding data as in gure 4.66. They can be generated from the best t binding polynomial after tting binding curves with program sft, or by input of binding constants into program simstat. At the same time other important analytical results like factors of the Hessian and minimax indexHill slope Hill slope are also calculated.
Using SIMSTAT to Plot Species fractions for
1.00

f(x) = 1 + x + 2x2 + 0.5x3 + 8x4

0.80

Species Fractions

0.60

0.40

Species 0 Species 1 Species 2 Species 3 Species 4

0.20

0.00 0.00

1.00

2.00

Figure 4.66: Ligand binding species fractions

Specialized techniques

279

4.9.11 Immunoassay and dose-response dilution curves


Antibodies are used in bioassays in concentrations known up to arbitrary multiplicative factors, and dose response curves are constructed by dilution technique, usually 1 in 2, 1 in 3, 1 in 10 or similar. By convention, plots are labelled in dilutions, or powers of the dilution factor and, with this technique, afnities can only be determined up to the unknown factor. Figure 4.67 was constructed using makl in dilution mode with dilutions 1, 2, 4, 8, 16 and 32 to create a data le with concentrations 1/32, 1/16, 1/8, 1/4, 1/2, 1. hlt tted response as a function of concentration and a dilution curve was plotted.
Doubling Dilution Assay
100% 100%

Doubling Dilution Curve


Percentage of Maximum Response

Percentage of Maximum Response

50%

50%

0% 0.00

0.20

0.40

0.60

0.80

1.00

0% 1/1

1/2

1/4

1/8

1/16

1/32

1/64

Proportion of Maximum Concentration

Dilution Factor

Doubling Dilution Curve


100%

Percentage of Maximum Response

50%

0%

2-1

2-2

2-3

2-4

2-5

2-6

Dilution Factor
Figure 4.67: Immunoassay and dose-response dilution curves The transformation is equivalent to plotting log of reciprocal concentration (in arbitrary units) but this is not usually appreciated. SIMPLOT can plot log(1/x) to bases 2, 3, 4, 5, 6, 7, 8 and 9 as well as e and ten, allowing users to plot trebling, quadrupling dilutions, etc. To emphasize this, intermediate gradations can be added and labelling can be in powers of the base, as now shown.

280

S IMFIT reference manual: Part 4

4.10 Parametric curves


Figure 4.50 and gure 4.53 are examples for parametric curves of the form x(t ), y(t ), while gure 4.51 and gure 4.54 are for x(t ), y(t ), z(t ). Examples for r() follow.

4.10.1 r = r() parametric plot 1: Eight leaved rose


Figure 4.68, for example, was generated using the SIMFIT model le rose.mod from usermod to dene an eight leaved rose in r = r() form using the following code. % Example: Eight leaved rose r = A*sin(4*theta): where theta = x, r = f(1) and A = p(1) % 1 equation 1 variable 1 parameter % x 4 multiply sin p(1) multiply f(1) %

Rhodoneae of Abbe Grandi, r = sin(4 )


1

y
-1

-1

x
Figure 4.68: r = r() parametric plot 1. Eight leaved Rose

Specialized techniques

281

4.10.2 r = r() parametric plot 2: Logarithmic spiral with tangent


Figure 4.69 illustrates the logarithmic spiral r() = A exp( cot ), dened in SIMFIT model le camalot.mod for A = 1, p(1) = , x = , r = f (1) as follows. 1 p(1) tan divide x multiply exp f(1)

Logarithmic Spiral and Tangent


5

r()

Tangent

-5 -6 0 6

Figure 4.69: r = r() parametric plot 2. Logarithmic Spiral with Tangent This prole is used in camming devices such as Camalots and Friends to maintain a constant angle between the radius vector for the spiral and the tangent to the curve, dened in tangent.mod as r= A exp(0 cot )[sin 0 tan(0 + ) cos 0 ] . sin tan(0 + ) cos

Figure 4.69 used = p(1) = 1.4, 0 = P(2) = 6 and usermod to generate individual gures over the range 0 = x 10, then simplot plotted the ASCII text coordinates simultaneously, a technique that can be used to overlay any number of curves.

282

S IMFIT reference manual: Part 4

Appendix A

Distributions and special functions


Techniques for calling these functions from within user dened models are discussed starting on page 299.

A.1 Discrete distribution functions


A discrete random variable X can have one of n possible values x1 , x2 , . . . , xn and has a mass function fX 0 and cumulative distribution function 0 FX 1 that dene probability, expectation, and variance by P(X = x j ) = fX (x j ), for j = 1, 2, . . . , n = 0 otherwise P(X x j ) = fX (xi ) 1 = fX (xi ) E (g(X )) = g(xi ) fX (xi ) E (X ) = xi f (xi ) V (X ) = (xi E (X ))2 fX (xi )
i=1 i=1 n i=1 n i=1 n i=1 n j

= E (X 2 ) E (X )2 .

A.1.1 Bernoulli distribution


A Bernoulli trial has only two possible outcomes, X = 1 or X = 0 with probabilities p and q = 1 p. P(X = k) = pk q1k for k = 0 or k = 1 E (X ) = p V (X ) = pq

A.1.2 Binomial distribution


This models the case of n independent Bernoulli trials with probability of success (i.e. Xi = 1) equal to p and failure (i.e. Xi = 0) equal to q = 1 p. The random binomial variable Sn is dened as the sum of the n values 283

284

S IMFIT reference manual: Part 5

of Xi without regard to order, i.e. the number of successes in n trials. Sn = Xi


i=1 n

P(Sn = k) =

n k p (1 p)nk , for k = 0, 1, 2, . . . , n k

E (Sn ) = np V (Sn ) = np(1 p) The run test, sign test, analysis of proportions, and many methods for analyzing experiments with only two possible outcomes are based on the binomial distribution.

A.1.3 Multinomial distribution


Extending the binomial distribution to k possible outcomes of frequency fi in a sample of size n is described by P(X1 = f1 , X2 = f2 , . . . , Xk = fk ) = n! f f f p 1 p 2 pk k f1 ! f2 ! fk ! 1 2 where f1 + f2 + + fk = n and p1 + p2 + + pk = 1.

An example would be the trinomial distribution, which is used to analyse the outcome of incubating a clutch of eggs; they can hatch male, or female, or fail to hatch.

A.1.4 Geometric distribution


This is the distribution of the number of failures prior to the rst success, where P(X = k) = pqk E (X ) = q/ p V (X ) = q/ p2 .

A.1.5 Negative binomial distribution


The probability of k failures prior to the rth success is the random variable Sr , where P(Sr = k) = r+k1 r k pq k

E (Sr ) = rq/ p V (Sr ) = rq/ p2 .

A.1.6 Hypergeometric distribution


This models sampling without replacement, where n objects are selected from N objects, consisting of M N of one kind and N M of another kind, and denes the random variable Sn as Sn = X1 + X2 + . . . Xn where Xi = 1 for success with P(Xi = 1) = M /N , and Xi = 0 for failure. P(Sn = k) = M k N M nk N a , where = 0 when b > a > 0 n b

E (Sn ) = nM /N V (Sn ) = npq(N n)/(N 1) Note that when N n this reduces to the binomial distribution with p = M /N .

Statistical distributions supported by S IMFIT

285

A.1.7 Poisson distribution


This is the limiting form of the binomial distribution for large n and small p but nite np = > 0. P(X = k) = k exp(), for x = 0, 1, 2, . . . , k! E (X ) = V (X ) =

The limiting result, for xed np > 0, that lim n k (np)k exp(np) p (1 p)nk = k! k

can be used to support the hypothesis that counting is a Poisson process, as in the distribution of bacteria in an sample, so that the error is of the order of the mean. The Poisson distribution also arises from Poisson processes, like radioactive decay, where the probability of k events which occur at a rate per unit time is P(k events in (0, t )) = (t )k exp(t ). k!

The Poisson distribution has the additive property that, given n independent Poisson variables Xi with paramn eters i , the sum Y = n i=1 Xi has a Poisson distribution with parameter y = i=1 i .

A.2 Continuous distributions


A continuous random variable X is dened over some range by a probability density function fX 0 and cumulative distribution function 0 FX 1 that dene probability, expectation, and variance by
x

FX (x) =

P(A x B) = Fx (B) FX (A)


B

fX (t ) dt

= 1= E (g(X )) = E (X ) = V (X ) =
A

fX (t ) dt fX (t ) dt g(t ) fX (t ) dt t fX (t ) dt (t E (X ))2 fX (t ) dt .

In the context of survival analysis, the random survival time X 0, with density f (x), cumulative distribution function F (x), survivor function S(x), hazard function h(x), and integrated hazard function H (x) are dened by S(x) = 1 F (x) h(x) = f (x)/S(x)
x

H (x) =
0

h(u) du

f (x) = h(x) exp{H (x)}.

286

S IMFIT reference manual: Part 5

A.2.1 Uniform distribution


This assumes that every value is equally likely for A X B, so that fX (x) = 1/(B A) E (X ) = (A + B)/2 V (X ) = (A + B)2/12.

A.2.2 Normal (or Gaussian) distribution


This has mean and variance 2 and, for convenience, X is often standardized to Z , so that if X N (, 2 ), then Z = (x )/ N (0, 1). 1 (x )2 fX (x) = exp 22 2 E (X ) =

V (X ) = 2 (z) = FX (z) z 1 = exp(t 2 /2) dt . 2 It is widely used in statistical modelling, e.g., the assumption of normally distributed dosage tolerance leads to a probit regression model for the relationship between the probability of death and dose. There are several important results concerning the normal distribution which are heavily used in hypothesis testing.
A.2.2.1 Example 1. Sums of normal variables
n Given n independent random variables Xi N (i , 2 i ), then the linear combination Y = i=1 ai Xi is normally n n 2 2 2 distributed with parameters y = i=1 ai i and y = i=1 ai i .

A.2.2.2 Example 2. Convergence of a binomial to a normal distribution

If Sn is the sum of n Bernoulli variables that can be 1 with probability p, and 0 with probability 1 p, then Sn is binomially distributed and, by the central limit theorem, it is asymptotically normal in the sense that
n

lim P

Sn np

np(1 p)

= (z).

The argument that experimental error is the sum of many errors that are equally likely to be positive or negative can be used, along with the above result, to support the view that experimental error is often approximately normally distributed.
A.2.2.3 Example 3. Distribution of a normal sample mean and variance

If X N (, 2 ) and from a sample of size n the sample mean x = xi / n


i=1 n

and the sample variance S2 = (xi x )2 /n


i=1 n

are calculated, then N (, 2 /n); (a) X (b) nS2/2 2 (n 1), E (S2 ) = (n 1)2/n, V (S2 ) = 2(n 1)4/n2; and and S2 are stochastically independent. (c) X

Statistical distributions supported by S IMFIT

287

A.2.2.4 Example 4. The central limit theorem

If independent random variables Xi have mean and variance 2 from some distribution, then the sum Sn = n i=1 Xi , suitably normalized, is asymptotically normal, that is Sn n z = (z), or n y n P(X1 + X2 + + Xn y) . n
n

lim P

Under appropriate restrictions, even the need for identical distributions can be relaxed.

A.2.3 Lognormal distribution


This is frequently used to model unimodal distributions that are skewed to the right, e.g., plasma concentrations which cannot be negative, that is, where the logarithm is presumed to be normally distributed so that, for X = exp(Y ) where Y N (, 2 ), then x 2 E (X ) = exp( + 2/2) fX (x) = 1 exp (log(x) )2 22

V (X ) = (exp(2 ) 1) exp(2 + 2).

A.2.4 Bivariate normal distribution


If variables X and Y are jointly distributed according to a bivariate normal distribution the density function is fX ,Y = where Q = 1 2X Y 1 1 2 1 2 1 exp Q 2

(x X )(y Y ) (y Y )2 (x X )2 2 + 2 2 X Y X Y

2 with 2 X > 0, Y > 0, and 1 < < 1. Here the marginal density for X is normal with mean X and variance 2 2 , and when the correlation is zero, X X , the marginal density for Y is normal with mean Y and variance Y and Y are independent. At xed probability levels, the quadratic form Q2 denes an ellipse in the X , Y plane which will have axes parallel to the X , Y axes if = 0, but with rotated axes otherwise.

A.2.5 Multivariate normal distribution


If a m dimensional random vector X has a N (, ) distribution , the density is 1 fX (x) = (2)m/2 ||1/2 exp{ (x )T 1 (x )}. 2 Contours of equi-probability are dened by f (x) = k for some k > 0 as a hyper-ellipsoid in m dimensional space, and the density has the properties that any subsets of X or linear transformations of X are also multivariate normal. Many techniques, e.g., MANOVA, assume this distribution.

288

S IMFIT reference manual: Part 5

A.2.6 t distribution
The t distribution arises naturally as the distribution of the ratio of a normalized normal variate Z divided by the square root of a chi-square variable 2 divided by its degrees of freedom . t () = fX (x) = Z 2 ()/ , or setting X = t ()
(+1)/2

(( + 1)/2) x2 1+ (/2) E (X ) = 0 V (X ) = /( 2) for > 2.

The use of the t test for testing for equality of means with two normal samples X1 , and X2 (page 91), with sizes n1 and n2 and the same variance, uses the fact that the sample means are normally distributed, while the sample variances are chi-square distributed, so that under H0 , Z= U= x 1 x 2 1/n1 + 1/n2

2 n1 s2 1 + n2 s2 2 (n1 + n2 2) T = Z/ U

T F (1, n1 + n2 2). For the case of unequal variances the Welch approximation is used, where the above test statistic T and degrees of freedom calculated using a pooled variance estimate, are replaced by T= = x 1 x 2
2 s2 1 /n1 + s2 /n2 2 2 (s2 1 /n1 + s2 /n2 ) . 2 2 2 (s1 /n1 ) /(n1 1) + (s2/n2 )2 /(n2 1)

t (n1 + n2 2)

The paired t test (page 93) uses the differences di = xi yi between correlated variables X and Y and only assumes that the differences are normally distributed, so that the test statistic for the null hypothesis is d = di /n 2 s2 d = (di d ) /(n 1)
i=1 i=1 n n

T = d

s2 /n. d

A.2.7 Cauchy distribution


This is the distribution of the ratio of two normal variables. For, instance, if X1 N (0, 1) and X2 N (0, 1) then the ratio X = X1 /X2 has a Cauchy distribution, where E (X ) and V (X ) are not dened, with fX = 1 . (1 + x2)

This is a better model for experimental error than the normal distribution as the tails are larger than with a normal distribution. However, because of the large tails, the mean and variance are not dened, as with the t distribution with = 1 which reduces to a Cauchy distribution.

Statistical distributions supported by S IMFIT

289

A.2.8 Chi-square distribution


The 2 distribution with degrees of freedom results from adding together the squares of independent Z variables.
2 2 () = z2 i or, setting X = (), i=1

fX (x) =

1 2/2 (/2)

x/21 exp(x/2)

E (X ) = V (X ) = 2. It is the distribution of the sample variance from a normal distribution, and is widely used in goodness of t testing since, if n frequencies Ei are expected and n frequencies Oi are observed, then

n
lim

(Oi Ei )2 = 2 (). E i i=1

Here the degrees of freedom is just n 1 minus the number of extra parameters estimated from the data to dene the expected frequencies. Cochrans theorem is another result of considerable importance in several areas of data analysis, e.g., the analysis of variance, and this considers the situation where Z1 , Z2 , . . . , Zn are independent standard normal variables that can be written in the form

i=1

Zi2 = Q1 + Q2 + + Qk
n = r1 + r2 + + rk

where each Qi is a sum of squares of linear combinations of the Zi . If the rank of each Qi is ri and

then the Qi have independent chi-square distributions, each with ri degrees of freedom.

A.2.9 F distribution
The F distribution arises when a chi-square variable with 1 degrees of freedom (divided by 1 ) is divided by another independent chi-square variable with 2 degrees of freedom (divided by 2 ). F (1 , 2 ) = fX (x) = 2 (1 )/1 or, setting X = F (1 , 2 ), 2 (2 )/2
/2 /2

11 22 ((1 + 2 )/2) (1 2)/2 x (1 x + 2 )(1 +2 )/2 (1 /2)(2 /2) E (X ) = 2 /(2 2) for 2 > 2. The F distribution is used in the variance ratio tests and analysis of variance, where sums of squares are partitioned into independent chi-square variables whose normalized ratios, as described above, are tested for equality, etc. as variance ratios.

A.2.10 Exponential distribution


This is the distribution of time to the next failure in a Poisson process. It is also known as the Laplace or negative exponential distribution, and is dened for X 0 and > 0. E (X ) = 1/ fX (x) = exp(x)

V (X ) = 1/2 .

290

S IMFIT reference manual: Part 5

Note that, if 0 X 1 has a uniform distribution, then Y = (1/) log(X ) has an exponential distribution. Also, when used as a model in survival analysis, this distribution does not allow for wear and tear, as the hazard function is just the constant , as follows: h(x) = H (x) = x. S(x) = exp(x)

A.2.11 Beta distribution


This is useful for modelling densities that are constrained to the unit interval 0 x 1, as a great many shapes can be generated by varying the parameters r > 0 and s > 0. (r + s) r1 x (1 x)s1 (r)(s) E (X ) = r/(r + s) fX (x) = V (X ) = rs/((r + s + 1)(r + s)2 ) It is also used to model situations where a probability, e.g., the binomial parameter, is treated as a random variable, and it is also used in order statistics, e.g., the distribution of the kth largest of n uniform (0, 1) random numbers is beta distributed with r = k and s = n k + 1.

A.2.12 Gamma distribution


This distribution with x > 0, r > 0, and > 0 arises in modelling waiting times from Poisson processes. r xr 1 exp(x) (r) E (X ) = r/ fX (x) = V (X ) = r/2 When r is a positive integer it is also known as the Erlang density, i.e. the time to the rth occurrence in a Poisson process with parameter .

A.2.13 Weibull distribution


This is used to model survival times where the survivor function S(x) and hazard rate or failure rate function h(x) are dened as follows. FX (x) = 1 exp(Ax)B S(x) = 1 FX (x) h(x) = fX (x)/S(x) = AB(Ax)B1 . It reduces to the exponential distribution when B = 1, but it is a much better model for survival times, due to the exibility in curve shapes when B is allowed to vary, and the simple forms for the survivor and hazard functions. Various alternative parameterizations are used, for instance x 1 x exp 1 E (X ) = + +1 2 1 V (X ) = 2 + 1 2 +1 . fX (x) = fX (x) = AB(Ax)B1 exp(Ax)B

Statistical distributions supported by S IMFIT

291

A.2.14 Logistic distribution


This resembles the normal distribution and is widely used in statistical modelling, e.g., the assumption of logistic dosage tolerances leads to a linear logistic regression model. exp[(x )/] {1 exp[(x )/]}2 exp[(x )/] FX (x) = 1 + exp[(x )/] E (X ) = fX (x) = V (x) = 2 2 /3

A.2.15 Log logistic distribution


By analogy with the lognormal distribution, if log X has the logistic distribution, and = log , = 1/, then the density, survivor function, and hazard functions are simpler than for the log normal distribution, namely x 1 {1 + (x)}2 1 S(x) = 1 + (x) f (x) = h(x) = x 1 . 1 + (x)

A.3 Non-central distributions


These distributions are similar to the corresponding central distributions but they require an additional noncentrality parameter, . One of the main uses is in calculations of statistical power as a function of sample size. For instance, calculating the power for a chi-square test requires the cumulative probability function for the non-central chi-square distribution.

A.3.1 Non-central beta distribution


The lower tail probability for parameters a and b is P(X x) = (/2) exp(/2) P(a,b x), i! i=0

where 0 x 1, a > 0, b > 0, 0, and P(a,b x) is the lower tail probability for the central beta distribution.

A.3.2 Non-central chi-square distribution


The lower tail probability for degrees of freedom is P(X x) = (/2)i exp(/2) P(2 +2i x), i ! i=0

where x 0, 0, and P(2 k x) is the lower tail probability for the central chi-square distribution with k degrees of freedom.

292

S IMFIT reference manual: Part 5

A.3.3 Non-central F distribution


The lower tail probability P(X x) for 1 and 2 degrees of freedom is
x 0

(/2)i (1 + 2i)(1 +2i)/2 22 exp(/2) (1 +2i2)/2 u [2 + (1 + 2i)u](1+2i+2 )/2 du i! B((1 + 2i)/2, 2 /2) i=0

/2

where x 0, 0, and B(a, b) is the beta function for parameters a and b.

A.3.4 Non-central t distribution


The lower tail probability for degrees of freedom is P(X x) = 1 (/2)2(2)/2
0

ux u1 exp(u2 /2) du,

where (y) is the lower tail probability for the standard normal distribution and argument y.

A.4 Variance stabilizing transformations


A number of transformations are in use that attempt to create new data that is more approximately normally distributed than the original data, or at least has more constant variance, as the two aims can not usually both be achieved. If the distribution of X is known, then the variance of any function of X can of course be calculated. However, to a very crude rst approximation, if a random variable X is transformed by Y = f (X ), then the variances are related by the differential equation V (Y ) dY dX
2

V (X )

which yields f (.) on integration, e.g., if V (Y ) = constant is required, given V (X ).

A.4.1 Angular transformation


This arcsine transformation is sometimes used for binomial data with parameters N and p, e.g., for X successes in N trials, when X b(N , p) Y = arcsin( X /N ) E (Y ) arcsin( p) V (Y ) 1/(4N ) (using radial measure). However, note that the variance of the transformed data is only constant in situations where there are constant binomial denominators.

A.4.2 Square root transformation


This is often used for counts, e.g., for Poisson variables with mean , when X Poisson() Y= x E (Y ) V (Y ) 1/4.

Special functions supported by S IMFIT

293

A.4.3 Log transformation


When the variance of X is proportional to a known power of E (X ), then the power transformation Y = X will stabilize variance for = 1 /2. The angular and square root transformations are, of course, just special cases of this, but a singular case of interest is the constant coefcient of variation situation V (X ) E (X )2 which justies the log transform, as follows E (X ) = V (X ) 2 Y = log X V (Y ) = k, a constant.

A.5 Special functions


A.5.1 Binomial coefcient
This is required in the treatment of discrete distributions. It is just the number of selections without regard to order of k objects out of n. n n! = k k!(n k)! n(n 1)(n 2) (n k + 1) = k(k 1)(k 2) 3.2.1 n = nk

A.5.2 Gamma and incomplete gamma functions


The gamma function is widely used in the treatment of continuous random variables and is dened as follows. () =
0

t 1 exp(t ) dt

( + 1) = (). and (k + 1/2) = (2k 1)(2k 3) 5.3.1. /2k . The incomplete gamma function P(x) and incomplete gamma function complement Q(x) given parameter usually, as here, normalized by the complete gamma function, are also frequently required. 1 () 1 Q(x, ) = () P(x, ) =
x 0

So that (k) = (k 1)! for integer k 0

t 1 exp(t ) dt t 1 exp(t ) dt

As the gamma distribution function with G > 0, > 0, and > 0 is P(x, , ) = 1 ()
0 x

G1 exp(G/) dG,

the incomplete gamma function is also the cumulative distribution function for a gamma distribution with second parameter equal to one.

294

S IMFIT reference manual: Part 5

A.5.3 Beta and incomplete beta functions


Using the gamma function, the beta function is then dened as B(g, h) = and the incomplete beta function for 0 x 1 as R(x, g, h) = 1 B(g, h)
x 0

(g)(h) (g + h)

t g1 (1 t )h1 dt .

The incomplete beta function is also the cumulative distribution function for the beta distribution.

A.5.4 Exponential integrals


E1 (x) = exp(t ) dt t x exp(t ) Ei (x) = dt t x

where x > 0, excluding the origin in Ei(x).

A.5.5 Sine and cosine integrals and Eulers gamma


x

Si(x) =
0

sin t dt t
x 0

Ci(x) = + log x +

cos t 1 dt , x > 0 t 1 1 1 1 = lim {1 + + + + log m} m 2 3 4 m = .5772156649 . . .

A.5.6 Fermi-Dirac integrals


f (x) = 1 (1 + )
0

t dt 1 + exp(t x)

A.5.7 Debye functions


f (x) = n xn
x 0

tn dt , x > 0, n 1 exp(t ) 1

A.5.8 Clausen integral


x

f (x) =

log 2 sin
0

t dt , 0 x 2

A.5.9 Spence integral


x

f (x) =
0

log |1 t | dt t

Special functions supported by S IMFIT

295

A.5.10 Dawson integral


f (x) = exp(x2 )
x 0

exp(t 2 ) dt

A.5.11 Fresnel integrals


C(x) = 2 t dt 2 x S(x) = sin t 2 dt 2 0
x

cos

A.5.12 Polygamma functions


The polygamma function is dened in terms of the gamma function (x), or the Psi function (i.e. digamma function) (x) = (x)/(x) (n) (x) = d n +1 log (x) dxn+1 dn = n (x) dx t n exp(xt ) = (1)n+1 dt . 0 1 exp(t )

So the case n = 0 is the digamma function, the case n = 1 is the trigamma function, and so on.

A.5.13 Struve functions


2( 1 z) H (z) = 2 ( + 1 2) 2( 1 z) L (z) = 2 ( + 1 2)
2

0
2

sin(z cos ) sin2 d sinh(z cos ) sin2 d

A.5.14 Kummer conuent hypergeometric functions


az (a)2 z2 (a)n zn + + + + b (b)2 2! (b)n n! where (a)n = a(a + 1)(a + 2) . . . (a + n 1), (a)0 = 1 M (a, b, z) = 1 + U (a, b, z) = U (a, n + 1, z) = sin b

(1)n+1 [M (a, n + 1, z) log z n!(a n) (a)r zr + {(a + r) (1 + r) (1 + n + r)} ] r=0 (n + 1)r r! (n 1)! n z M (a n, 1 n, z)n (a)

M (a, b, z) M (1 + a b, 2 b, z) z1b (1 + a b)(b) (a)(2 b)

A.5.15 Abramovitz functions


The Abramovitz functions of order n = 0, 1, 2 are dened for x 0 as f (x) =
0

t n exp(t 2 x/t ) dt

296

S IMFIT reference manual: Part 5

A.5.16 Legendre polynomials


m are dened in terms of the hypergeometric function F or Rodrigues The Legendre polynomials Pn (x) and Pn formula for 1 x 1 by

1 Pn (x) = F n, n + 1, 1, (1 x) 2 n 1 d = n (x2 1)n 2 n! dxn dm m Pn (x) = (1)m (1 x2 )m/2 m Pn (x). dx

A.5.17 Bessel, Kelvin, and Airy functions


J (z) = ( 1 2 z) 4 k!( + k + 1)

( 1 z2 )k

k =0

J (z) cos() J(z) Y (z) = sin()


1 z) I (z) = ( 2 k =0

k!(4+ k + 1)

( 1 z2 )k

I (z) I (z) K (z) = 1 2 sin()


1 ber x + ibei x = exp( 1 2 i)I (x exp( 4 i)) 1 ker x + ikei x = exp( 1 2 i)K (x exp( 4 i)) 2 3/2 Ai(z) = 1 3 z[I1/3 () I1/3 ()], where = 3 z

z/3[I1/3 () + I1/3()] Bi (z) = (z/ 3)[I2/3 () + I2/3()] Bi(z) =

1 Ai (z) = 3 z[I2/3 () I2/3()]

A.5.18 Elliptic integrals


dt t + x (t + y) 0 dt 0 (t + x)(t + y)(t + z) 3 dt RD (x, y, z) = 2 0 (t + x)(t + y)(t + z)3 RC (x, y) = R j (x, y, z, ) = u=
0

1 2 1 RF (x, y, z) = 2

3 2

dt (t + ) (t + x)(t + y)(t + z) d 1 m sin2

CN (u|m) = cos DN (u|m) = 1 m sin2

SN (u|m) = sin

Special functions supported by S IMFIT

297

A.5.19 Single impulse functions


These discontinuous functions all generate a single impulse, but some of them require special techniques for plotting, which are described on page 275.
A.5.19.1 Heaviside unit function

The Heaviside unit function h(x a) is dened as h(x a) = 0, for x < a = 1, for x a, so it provides a useful way to construct models that switch denition at critical values of the independent variable, in addition to acting in the usual way as a ramp function.
A.5.19.2 Kronecker delta function

The Kronecker delta function i j is dened as i j = 0, for i = j = 1, for i = j, which can be very useful when constructing models with simple logical switches.
A.5.19.3 Unit impulse function

The single square wave impulse function f (x, a, b) of width 2b > 0 with unit area is dened as f (x, a, b) = 0, for x < a b, x > a + b = 1/2b, for a b x a + b, so it can be used to model the Dirac delta function by using extremely small values for b.
A.5.19.4 Unit spike function

The triangular spike function f (x, a, b) of width 2b > 0 with unit area is dened as f (x, a, b) = 0, for x < a b, x > a + b

= (a + b x)/b2, for a x a + b.
A.5.19.5 Gauss pdf

= (x a + b)/b2, for a b x a

The probability density function f (x, a, b) for the normal distribution with unit area is dened for b > 0 as 1 1 f (x, a, b) = exp 2 2 b xa b
2

which is very useful for modelling bell shaped impulse functions.

A.5.20 Periodic impulse functions


These generate pulses at regular intervals, and some of them require special techniques for plotting, as described on page 276.

298

S IMFIT reference manual: Part 5

A.5.20.1 Square wave function

This has an amplitude of one and period of 2a > 0, and can be described for t 0,in terms of the Heaviside unit function h(t ) as f1 (t ) = h(t ) 2h(t a) + 2h(t 2a) , so it oscillates between plus and minus one at each x-increment of length a, with pulses of area plus and minus a.
A.5.20.2 Rectied triangular wave

This generates a triangular pulse of unit amplitude with period 2a > 0 and can be dened for t 0 as f2 = 1 a
t 0

f1 (u) du

so it consists of a series of contiguous isosceles triangles of area a.


A.5.20.3 Morse dot wave function

This is essentially the upper half of the square wave. It has a period of 2a > 0 and can be dened for t 0 by
1 f3 (t ) = [h(t ) + f1 (t )] = (1)i h(t ia), 2 i=0

so it alternates between sections of zero and squares of area a.


A.5.20.4 Sawtooth wave function

This consists of half triangles of unit amplitude and period a > 0 and can be dened for t 0 by f4 (t ) =
t h(t ia), a i=1

so it generates a sequence of right angled triangles with area a/2


A.5.20.5 Rectied sine wave function

This is just the absolute value of the sine function with argument at , that is so it has unit amplitude and period /a.
A.5.20.6 Rectied sine half-wave function

f5 (t ) = | sin at |,

This is just the positive part of the sine wave 1 f6 (t ) = [sin at + | sin at |] 2 so it has period 2/a.
A.5.20.7 Unit impulse wave function

This is a sequence of Dirac delta function pulses with unit area given by f7 (t ) = (i at ).
i=1

Of course, the width of the pulse has to be specied as a parameter b, so the function can be used to generate a spaced sequence of rectangular impulse with arbitrary width but unit area. Small values of b simulate pulses of the Dirac delta function at intervals of length a.

Appendix B

User dened models


B.1 Supplying models as a dynamic link library
This is still the best method to supply your own models, but it requires programming skill. You can write in a language, such as Fortran or C, or even use assembler. Using this technique, you can specify your own numerical analysis library, or even reference other dynamic link libraries, as long as all the locations and addresses are consistent with the other SIMFIT dynamic link libraries. Since the development of program usermod it is now only necessary to create new entries in the dynamic link library for convenience, or for very large sets of differential equations, or for models requiring special functions not supported by w models.dll.

B.2 Supplying models as ASCII text les


The method that has been developed for the SIMFIT package works extremely well, and can be used to create very complex model equations, even though it requires no programming skill. You can use all the usual mathematical functions, including trigonometric and hyperbolic functions, as well as the gamma function, the log-gamma function, the normal probability integral and the erfc function. The set of allowable functions will increase as w models.dll is upgraded. The essence of the method is to supply an ASCII text le containing a set of instructions in reverse Polish, that is, postx, or last in rst out, which will be familiar to all programmers, since it is, after all, essentially the way that computers evaluate mathematical expressions. Using reverse Polish, any explicit model can be written as a series of sequential instructions without using any brackets. Just think of a stack to which arguments are added and functions which operate on the top item of the stack. Suppose the top item is the current value of x and the operator log is added, then this will replace x by log(x) as the current item on the top of the stack. What happens is that the model le is read in, checked, and parsed just once to create a virtual instruction stack. This means that model execution is very rapid, since the le is only read once, but also it allows users to optimize the stack operations by rolling, duplicating, storing, retrieving, and so on, which is very useful when developing code with repeated subexpressions, such as occur frequently with systems of differential equation and Jacobians. So, to supply a model this way, you must create an ASCII text le with the appropriate model or differential equation. This is described in the w_readme.? les and can be best understood by browsing the test les supplied, i.e. usermod1.tf? for functions of one variable, usermod2.tf? for functions of two variables, usermod3.tf? for functions of three variables and usermodd.tf? for single differential equations. The special program usermod should be used to develop and test your models before trying them with makdat or qnt. Note that usermod checks your model le for syntax errors, but it also allows you to evaluate the model, plot it, or even use it to nd areas or zeros of n functions in n unknowns. Note that new syntax is being developed for this method, as described in the w_readme.* les. For instance put and get commands considerably simplify the formulation of models with repeated sub-expressions.

299

300

S IMFIT reference manual: Part 5

Further details about the performance of this method for supplying mathematical models as ASCII text les can be found in an article by Bardsley, W.G. and Prasad, N. in Computers and Chemistry (1997) 21, 7182. Examples will now be given in order to explain the format that must be adopted by users to dene their own models for simulation, tting and plotting.

B.2.1 Example 1: a straight line


This example illustrates how the test le usermod1.tf1 codes for a simple straight line. % Example: user supplied function of 1 variable ... a straight line ............. p(1) + p(2)*x ............. % 1 equation 1 variable 2 parameters % p(1) p(2) x multiply add f(1) % Now exactly the same model but with comments added to explain what is going on. Note that in the model le, all text to the right of the instruction is treated as comment for the benet of users and it is not referenced when the model is parsed. % start of text defining model indicated by % Example: user supplied function of 1 variable ... a straight line ............. p(1) + p(2)*x ............. % end of text, start of parameters indicated by % 1 equation number of equations to define the model 1 variable number of variables (or differential equation) 2 parameters number of parameters in the model % end of parameters, start of model indicated by % p(1) put p(1) on the stack: stack = p(1) p(2) put p(2) on the stack: stack = p(1), p(2) x put an x on the stack: stack = p(1), p(2), x multiply multiply top elements: stack = p(1), p(2)*x add add the top elements: stack = p(1) + p(2)*x f(1) evaluate the model f(1) = p(1) + p(2)*x % end of the model definitions indicated by %

B.2.2 Example 2: damped simple harmonic motion


This time test le usermod1.tf9 illustrates trigonometric and exponential functions.

User dened models

301

% Example: user supplied function of 1 variable ... damped SHM Damped simple harmonic motion in the form f(x) = p(4)*exp[-p(3)*x]*cos[p(1)*x - p(2)] where p(i) >= 0 % 1 equation 1 variable 4 parameters % p(1) x multiply p(2) subtract cosine p(3) x multiply negative exponential multiply p(4) multiply f(1) %

B.2.3 Example 3: diffusion into a capillary


Test le usermod1.tf8 codes for diffusion into a capillary and shows how to call special functions, in this case the error function complement with argument equal to distance divided by twice the square root of the product of the diffusion constant and time (i.e. p2 ). % Example: user supplied function of 1 variable ... capillary diffusion f(x) = p(1)*erfc[x/(2*sqrt(p(2))] % 1 equation 1 variable 2 parameters % x p(2) squareroot 2 multiply divide erfc p(1) multiply f(1) %

302

S IMFIT reference manual: Part 5

B.2.4 Example 4: dening three models at the same time


The test le line3.mod illustrates the technique for dening several models for simultaneous tting by program qnt, in this case three straight lines unlinked for simplicity, although the models can be of arbitrary complexity and they can be linked by common parameters. % f(1) = p(1) + f(2) = p(3) + f(3) = p(5) + Example: user % 3 equations 1 variable 6 parameters % p(1) p(2) x multiply add f(1) p(3) p(4) x multiply add f(2) p(5) p(6) x multiply add f(3) %

p(2)x: (line 1) p(4)x: (line 2) p(6)x: (line 3) supplied function of 1 variable ... 3 straight lines

B.2.5 Example 5: Lotka-Volterra predator-prey differential equations


A special extended version of this format is needed with systems of differential equations, where the associated Jacobian can be supplied, as well as the differential equations if the equations are stiff and Gears method is required. However, supplying the wrong Jacobian is a common source of error in differential equation solving, so you should always compare results with the option to calculate the Jacobian numerically, especially if slow convergence is suspected. A dummy Jacobian can be supplied if Gears method is not to be used or if the Jacobian is to be estimated numerically. You can even prepare a differential equation le with no Jacobian at all. So, to develop a model le for a system of differential equations, you rst of all write the model ending with two lines, each containing only a %. When this runs properly you can start to add code for the Jacobian by adding new lines between the two % lines. This will be clear from inspecting the large number of model les provided and the readme.* les. If at any stage the code with Jacobian runs more slowly than the code without the Jacobian, then the Jacobian must be coded incorrectly. The next example is the text for test le deqmod2.tf2 which codes for the Lotka-Volterra predator-prey equations. This time all the comments are left in and a Jacobian is coded. This can be left out entirely by following the model by a percentage sign on two consecutive lines. SIMFIT can use the Adams method

User dened models

303

but can still use Gears method by estimating the Jacobian by nite differences. Note that the Jacobian is initialized to the identity, so when supplying a Jacobian only the elements not equal to identity elements need be set. % Example of a user supplied pair of differential equations file: deqmod2.tf2 (typical parameter file deqpar2.tf2) model: Lotka-Volterra predator-prey equations differential equations: f(1) = = f(2) = = jacobian: j(1) = = j(2) = = j(3) = = j(4) = = dy(1)/dx p(1)*y(1) - p(2)*y(1)*y(2) dy(2)/dx -p(3)*y(2) + p(4)*y(1)*y(2) df(1)/dy(1) p(1) - p(2)*y(2) df(2)/dy(1) p(4)*y(2) df(1)/dy(2) -p(2)*y(1) df(2)/dy(2) -p(3) + p(4)*y(1)

initial condition: y0(1) = p(5), y0(2) = p(6) Note: the last parameters must be y0(i) in differential equations % 2 equations no. equations differential equation no. variables (or differential equation) 6 parameters no. of parameters in this model % y(1) stack = y(1) y(2) stack = y(1), y(2) multiply stack = y(1)*y(2) duplicate stack = y(1)*y(2), y(1)*y(2) p(2) stack = y(1)*y(2), y(1)*y(2), p(2) multiply stack = y(1)*y(2), p(2)*y(1)*y(2) negative stack = y(1)*y(2), -p(2)*y(1)*y(2) p(1) stack = y(1)*y(2), -p(2)*y(1)*y(2), p(1) y(1) stack = y(1)*y(2), -p(2)*y(1)*y(2), p(1), y(1) multiply stack = y(1)*y(2), -p(2)*y(1)*y(2), p(1)*y(1) add stack = y(1)*y(2), p(1)*y(1) - p(2)*y(1)*y(2) f(1) evaluate dy(1)/dx p(4) stack = y(1)*y(2), p(4) multiply stack = p(4)*y(1)*y(2) p(3) stack = p(4)*y(1)*y(2), p(3) y(2) stack = p(4)*y(1)*y(2), p(3), y(2) multiply stack = p(4)*y(1)*y(2), p(3)*y(2) subtract stack = -p(3)*y(2) + p(4)*y(1)*y(2) f(2) evaluate dy(2)/dx % end of model, start of Jacobian p(1) stack = p(1) p(2) stack = p(1), p(2) y(2) stack = p(1), p(2), y(2)

304

S IMFIT reference manual: Part 5

multiply subtract j(1) p(4) y(2) multiply j(2) p(2) y(1) multiply negative j(3) p(4) y(1) multiply p(3) subtract j(4) %

stack = p(1), p(2)*y(2) stack = p(1) - p(2)*y(2) evaluate J(1,1)

evaluate J(2,1)

evaluate J(1,2)

evaluate J(2,2)

B.2.6 Example 6: supplying initial conditions


The test le deqpar2.tf2 illustrates how initial conditions, starting estimates and limits are supplied. Title line...(1) Parameter file for deqmod2.tf2 .. this line is ignored 0 (2) IRELAB: mixed(0), decimal places(1), sig. digits(2) 6 (3) M = number of parameters (include p(M-N+1)=y0(1), etc.) 1 (4) METHOD: Gear(1), Runge_Kutta(2), Adams(3) 1 (5) MPED: Jacobian estimated(0), calculated(1) 2 (6) N = number of equations 41 (7) NPTS = number of time points 0.0,1.0,3.0 (7+1) pl(1),p(1),ph(1) parameter 1 0.0,1.0,3.0 (7+2) pl(2),p(2),ph(2) parameter 2 0.0,1.0,3.0 (7+3) pl(3),p(3),ph(3) parameter 3 0.0,1.0,3.0 (7+4) pl(4),p(4),ph(4) parameter 4 0.0,1.0,3.0 (7+5) pl(5),p(5),ph(5) y0(1) 0.0,0.5,3.0 (7+M) pl(6),p(6),ph(6) y0(2) 1.0e-4 (7+M+1) TOL: tolerance 10.0 (7+M+2) XEND: end of integration 0.0 (7+M+3) XSTART: start of integration An initial conditions le supplies all the values required for a simulation or curve tting problem with differential equations using programs deqsol or qnt. Note that the values must be supplied in exactly the above order. The rst line (title) and trailing lines after (7+M+3) are ignored. Field width for most values is 12 columns, but is 36 for parameters. Comments can be added after the last signicant column if required. Parameters are in the order of pl (i) p(i) ph(i) where pl (i) are the bottom limits, p(i) are the starting parameters and ph(i) are the upper limits for curve tting. To x a parameter during curve tting just set pl (i) = p(i) = ph(i). Note that pl (i) and ph(i) are used in curve tting but not simulation. Parameters 1 to M N are the parameters in the equations, but parameters M N + 1 to M are the initial conditions, namely y0 (1) to y0 (N ).

B.2.7 Example 7: transforming differential equations


If you just want information on a sub-set of the components, y(i), you can select any required components (interactively) in deqsol. If you only want to t a sub-set of components, this is done by adding escape

User dened models

305

sequences to the input data library le as shown by the % characters in the example les deqsol.tf2 and deqsol.tf3. A more complicated process is required if you are interested only in some linear combination of the y(i), and do not want to (or can not) re-write the differential equations into an appropriate form, even using conservation equations to eliminate variables. To solve this problem you can input a matrix A, then simply choose y(new) = A y(old), where after integration y(new) replaces y(old). Format for A-type les The procedure is that when transformation is selected, deqsol sets A equal to the identity matrix then it reads in your le with the sub-matrix to overwrite A. The A-le simply contains a column of i-values,a column of j-values and a column of corresponding A(i, j) values. To prepare such a le you can use makmat or a text editor. Consult the test les (deqmat.tf?) for examples. Examples An A matrix to interchange y(1) and y(2). 0 1 An A matrix to replace y(2) by y(1) + y(2). 1 0 0 1 0 0 0 1 1 1 0

Note the following facts.

An A matrix to replace y(2) by 0.5y(1) + 2.0y(2) y(3) then swap y(1) and y(3). 0.0 0.0 1.0 0.5 2.0 1.0 1.0 0.0 0.0 1. You only need to supply elements A(i, j) which differ from those of the corresponding identity matrix. 2. The program solves for the actual y(i) then makes new vectors z(i) where y(i) are to be transformed. The z(i) are then copied onto the new y(i). 3. This is very important. To solve the y(i) the program has to start with the actual initial conditions y0 (i). So, even if the y(i) are transformed by an A which is not the identity, the y0 are never transformed. When simulating you must remember that y0 (i) you set are true y0 (i) and when curve-tting, parameters estimated are the actual y0 (i), not transformed ones.

B.2.8 Formatting conventions for user dened models


Please observe the use of the special symbol % in model les. The symbol % starting a line is an escape sequence to indicate a change in the meaning of the input stream, e.g., from text to parameters, from parameters to model instructions, from model to Jacobian, etc. Characters occurring to the right of the rst non-blank character are interpreted as comments and text here is ignored when the model is parsed. The % symbol must be used to indicate:i) start of the le ii) start of the model parameters iii) start of the model equations iv) end of model equations (start of Jacobian with diff. eqns.) The le you supply must have exactly the format now described.

306

S IMFIT reference manual: Part 5

a) The le must start with a % symbol indicating where text starts The next lines must be the name/details you choose for the model. This would normally be at least 4 and not greater than 24 lines. This text is only to identify the model and is not used by SIMFIT. The end of this section is marked by a % symbol. The next three lines dene the type of model. b) The rst of these lines must indicate the number of equations in the model, e.g., 1 equation, 2 equations, 3 equations, etc. c) The next must indicate the number of independent variables as in:- 1 variable, 2 variables, 3 variables, etc. or else it could be differential equation to indicate that the model is one or a set of ordinary differential equations with one independent variable. d) The next line must dene the number of parameters in the model. e) With differential equations, the last parameters are reserved to set the values for the integration constants y0 (i), which can be either estimated or xed as required. For example, if there are n equations and m parameters are declared in the le, only m n can be actually used in the model, since y0 (i) = p(m n + i) for i = 1, 2, ..., n. f) Lines are broken up into tokens by spaces. g) Only the rst token in each line matters after the model starts. h) Comments begin with % and are added just to explain whats going on. i) Usually the comments beginning with a % can be omitted. j) Critical lines starting with % must be present as explained above. k) The model operations then follow, one per line until the next line starting with a % character indicates the end of the model. l) Numbers can be in any format, e.g., 2, 1.234, 1.234E-6, 1.234E6 m) The symbol f(i) indicates that model equation i is evaluated at this point. n) Differential equations can dene the Jacobian after dening the model. If there are n differential equations of the form dy = f (i)(x, y(1), y(2), ..., y(n)) dx then the symbol y(i) is used to put y(i) on the stack and there must be a n by n matrix dened in the following way. The element J (a, b) is indicated by putting j(n(b 1) + a) on the stack. That is the columns are lled up rst. For instance with 3 equations you would have a Jacobian J (i, j) = d f (i)/dy( j) dened by the sequence: J(1,1) = j(1), J(2,1) = j(2), J(3,1) = j(3), J(1,2) = j(4), J(2,2) = j(5), J(3,2) = j(6), J(3,1) = j(7) J(3,2) = j(8) J(3,3) = j(9)

B.2.8.1 Table of user-dened model commands

Command x y z add subtract multiply

Effects produced stack -> stack, x stack -> stack, y stack -> stack, z stack, a, b -> stack, (a + b) stack, a, b -> stack, (a - b) stack, a, b -> stack, (a*b)

User dened models

307

divide p(i) f(i) power squareroot exponential tentothepower ln (or log) log10 pi sine cosine tangent arcsine arccosine arctangent sinh cosh tanh exchange duplicate pop absolutevalue negative minimum maximum gammafunction lgamma normalcdf erfc y(i) j(i) ***

stack, a, b -> stack, (a/b) stack -> stack, p(i) ... i can be 1, 2, 3, etc stack, a -> stack ...evaluate model since now f(i) = a stack, a, b -> stack, (ab) stack, a -> stack, sqrt(a) stack, a -> stack, exp(a) stack, a -> stack, 10a stack, a -> stack, ln(a) stack, a -> stack, log(a) (to base ten) stack -> stack, 3.1415927 stack, a -> stack, sin(a) ... radians not degrees stack, a -> stack, cos(a) ... radians not degrees stack, a -> stack, tan(a) ... radians not degrees stack, a -> stack, arcsin(a) ... radians not degrees stack, a -> stack, arccos(a) ... radians not degrees stack, a -> stack, arctan(a) ... radians not degrees stack, a -> stack, sinh(a) stack, a -> stack, cosh(a) stack, a -> stack, tanh(a) stack, a, b -> stack, b, a stack, a -> stack, a, a stack, a, b -> stack, a stack, a -> stack, abs(a) stack, a -> stack , -a stack, a, b -> stack, min(a,b) stack, a, b -> stack, max(a,b) stack, a -> stack, gamma(a) stack, a -> stack, ln(gamma(a)) stack, a -> stack, phi(a) integral from -infinity to a stack, a -> stack, erfc(a) stack -> stack, y(i) Only diff. eqns. stack, a -> stack J(i-(i/n), (i/n)+1) Only diff. eqns. stack -> stack, *** ... *** can be any number

B.2.8.2 Table of synonyms for user-dened model commands

The following sets of commands are equivalent:sub, minus, subtract mul, multiply div, divide sqrt, squarero, squareroot exp, exponent, exponential ten, tentothe, tentothepower ln, log sin, sine cos, cosine tan, tangent asin, arcsin, arcsine acos, arccos, arccosine atan, arctan, arctangent dup, duplicate exch, exchange, swap del, delete, pop abs, absolute

308

S IMFIT reference manual: Part 5

neg, negative min, minimum max, maximum phi, normal, normalcd, normalcdf abserr, epsabs relerr, epsrel middle, mid
B.2.8.3 Error handling in user dened models

As the stack is evaluated, action is taken to avoid underow, overow and forbidden operations, like 1/x as x tends to zero or taking the log or square root of a negative number etc. This should never be necessary, as users should be able to design the tting or simulation procedures in such a way that such singularities are not encountered, rather than relying on default actions which can lead to very misleading results.
B.2.8.4 Notation for functions of more than three variables

The model for nine equations in nine unknowns coded in usermodn.tf4 is provided to show you how to use usermod to nd a zero vector for a system of nonlinear equations. It illustrates how to code for n functions of m variables, and shows how to use y(1), y(2), . . . , y(m) instead of x, y, z, etc. The idea is that, when there are more than three variables, you should not use x, y or z in the model le, you should use y(1), y(2) and y(3), etc.
B.2.8.5 The commands put(.) and get(.)

These facilitate writing complicated models that re-use previously calculated expressions (very common with differential equations). The commands are as follows put(i) get(j) : move the top stack element into storage location i : transfer storage element j onto the top of the stack

and the following code gives an example. x put(11) get(11) get(11) multiply put(23) get(11) get(23) add

: now (x + x2) has been added to the top of the stack

It should be observed that these two commands reference a global store. This is particularly important when a main model calls sub-models for evaluation, root nding or quadrature, as it provides a way to communicate information between the models.
B.2.8.6 The command get3(.,.,.)

Often a three way branch is required where the next step in a calculation depends on whether a critical expression is negative, zero or positive, e.g., the nature of the roots of a quadratic equation depend on the value of the discriminant. The way this is effected is to dene the three possible actions and store the results using three put commands. Then use a get3(i,j,k) command to pop the top stack element and invoke get(i) if the element is negative, get(j) if it is zero (to machine precision), or get(k) otherwise. For example, the model updown.mod, illustrated on page 274, is a simple example of how to use a get3(.,.,.) command to

User dened models

309

dene a model which swaps over from one equation to another at a critical value of the independent variable. This command can be used after the command order has been used to place a 1, 0 or 1 on the stack, and the model updownup.mod illustrates how to use order with value3 to create a model with two swap-over points. x negative put(5) 1.0e-100 put(6) x put(7) x get3(5,6,7)

: now the top stack element is |x|, or 1.0e-100

B.2.8.7 The commands epsabs and epsrel

When the evaluation of a model requires iterative procedures, like root nding or quadrature, the absolute and relative error tolerances must be specied. Default values (1.0e-6 and 1.0e-3) are initialized and these should sufce for most purposes. However you may need to specify alternative values by using the epasabs or epsrel commands with difcult problems, or to control the accuracy used in a calculation. For instance, when tting a model that involves the evaluation of a multiple integral as a sub-model, you would normally use fairly large values for the error tolerances until a good t has been found, then decrease the tolerances for a nal renement. Values added to the stack to dene the tolerances are popped. 1.0e-4 epsabs 1.0e-2 epsrel

: now epsabs = 0.0001 and epsrel = 0.01

B.2.8.8 The commands blim(.) and tlim(.)

When nding roots of equations it is necessary to specify starting limits and when integrating by quadrature it is necessary to supply lower and upper limits of integration. The command blim(i) sets the lower limit for variable i while the command tlim(j) sets the upper limit for variable j. Values added to the stack to dene the limits are popped. 0 blim(1) 0 blim(2) pi tlim(1) pi 2 multiply tlim(2)

:limits are now 0 < x < 3.14159 and 0 < y < 6.28318

B.2.9 Plotting user dened models


Once a model has been checked by program usermod it can be plotted directly if it is a function of one variable, a function of two variables, a parametric curve in r() format (page 280), x(t ), y(t ) format (page 267), or a space curve in x(t ), y(t ), z(t ) format (page 268). This is also be a very convenient way to simulate families of curves described as separate functions of one variable, as will be readily appreciated by reading in the test le usermodn.tf1, which denes four trigonometric functions of one variable.

310

S IMFIT reference manual: Part 5

B.2.10 Finding zeros of user dened models


After a model function of one variable has been checked by program usermod it is possible to locate zeros of the equation f (x) k = 0 for xed values of k. It is no use expecting this root nding to work unless you know the approximate location of a root and can supply two values A, B that bracket the root required, in the sense that f (A) f (B) < 0. For this reason, it is usual to simulate the model rst and observe the plot until two sensible limits A, B are located. Try usermod.tf1 which is just a straight line to get the idea. Note that in difcult cases, where IFAIL is not returned as zero, it will be necessary to adjust EPSABS and EPSREL, the absolute and relative error tolerances.

B.2.11 Finding zeros of n functions in n variables


When a model le dening n functions of n unknowns has been checked by program usermod it is possible to attempt to nd a n-vector solution given n starting estimates, if n > 1. As the location of such vectors uses iterative techniques, it is only likely to succeed if sensible starting estimates are provided. As an example, try the model le usermodn.tf4 which denes nine equations in nine unknowns. Note that to obtain IFAIL equal to zero, i.e. a satisfactory solution, you will have to experiment with starting estimates. Observe that these are supplied using the usermod y vector, not the parameter vector p. Try setting the nine elements of the y vector to zero, which is easily done from a menu.

B.2.12 Integrating 1 function of 1 variable


After using program usermod to check a function of one variable, denite integration over xed limits can be done by two methods. Simpsons rule is used, because users may wish to embed a straightforward Simpsons rule calculation in a model, but also adaptive quadrature is used, in case the integral is ill conditioned, e.g., has spikes or poles. Again preliminary plotting is recommended for ill-conditioned functions. Try usermod1.tf1 to see how it all works.

B.2.13 Integrating n functions of m variables


When a model dening n functions of m variables has been successfully parsed by program usermod, adaptive integration can be attempted if m > 1. For this to succeed, the user must set the m lower limits (blim) and the m upper limits (tlim) to sensible values, and it probably will be necessary to alter the error tolerances for success (i.e. zero IFAIL). Where users wish to embed the calculation of an adaptive quadrature procedure inside a main model, it is essential to investigate the quadrature independently, especially if the quadrature is part of a sub-model involving root nding. Try usermod4.tf1 which is a single four dimensional integral (i.e. n = 1 and m = 4) that should be integrated between zero and one for all four variables. Observe, in this model, that y(1), y(2), y(3), y(4) are the variables, because m > 3.

B.2.14 Calling sub-models from user-dened models


B.2.14.1 The command putpar

This command is used to communicate parameters p(i) to a sub-model. It must be used to transfer the current parameter values into global storage locations if it is wished to use them in a subsidiary model. Unless the command putpar is used in a main model, the sub-models have no access to parameter values to enable the command p(i) to add parameter i to the sub-model stack. The stack length is unchanged. Note that the storage locations for putpar are initialized to zero so, if you do not use putpar at the start of the main model, calls to p(i) in subsequent sub-models will not lead to a crash, they will simply use p(i) = 0. The command putpar cannot be used in a subsidiary model, of course.

User dened models

311

B.2.14.2 The command value(.)

This command is used to evaluate a subsidiary model. It uses the current values for independent variables to evaluate subsidiary model i. The stack length is increased by one, as the value returned by function evaluation is added to the top of the stack. The command putpar must be used before value(i) if it wished to use the main parameters in subsidiary model number i. It is important to make sure that a subsidiary model is correct, by testing it as a main model, if possible, before using it as a subsidiary model. You must be careful that the independent variables passed to the sub-model for evaluation are the ones intended. Of course, value can call sub-models which themselves can call root, and/or quad.
B.2.14.3 The command quad(.)

This command is used to estimate an integral by adaptive quadrature. It uses the epsabs, epsrel, blim and tlim values to integrate model i and place the return value on the stack. The values assigned to the blim and tlim arrays are the limits for the integration. If the model i requires j independent variables then j blim and tlim values must be set before quad(i) is used. The length of the stack increases by one, the return value placed on the top of the stack. The command putpar must be used before quad(i) if it is wished to use the main parameters in the subsidiary model number i. Adaptive quadrature cannot be expected to work correctly if the range of integration includes sharp spikes or long stretches of near-zero values, e.g., the extreme upper tail of a decaying exponential. The routines used (D01AJF and D01EAF) cannot really handle innite ranges, but excellent results can be obtained using commonsense extreme limits, e.g., several relaxation times for a decaying exponential. With difcult problems it will be necessary to increase epsrel and epsabs.
B.2.14.4 The command convolute(.,.)

When two or sub-models have been dened, say model (i) = f (x) and model ( j) = g(x), a special type of adaptive quadrature, which is actually a special case of the quad(.) command just explained, can be invoked to evaluate the convolution integral
B

f g =

f (u)g(B u) du

= g f using the command convolute(i,j). To illustrate this type of model, consider the convolution of an exponential input function and gamma response function dened by the test le convolve.mod shown below. % integral: from 0 to x of f(u)*g(x - u) du, where f(t) = exp(-p(1)*t) g(t) = [p(2)2]*exp(-p(2)*t) % 1 equation 1 variable 2 parameters % putpar p(2) p(2) multiply put(1) 0.0001 epsabs 0.001

312

S IMFIT reference manual: Part 5

epsrel 0 blim(1) x tlim(1) convolute(1,2) f(1) % begin{model(1)} % Example: exponential decay, exp(-p(1)*x) % 1 equation 1 variable 1 parameter % p(1) x multiply negative exponential f(1) % end{model(1)} begin{model(2)} % Example: gamma density of order 2 % 1 equation 1 variable 2 parameters % p(2) x multiply negative exponential x multiply get(1) multiply f(1) % end{model(2)} The command putpar communicates the parameters from the main model to the sub-models, the quadrature precision is controlled by epsabs and epsrel and, irrespective of which models are used in the convolution, the limits of integration are always input using the blim(1) and tlim(1) commands just before using the convolute(.,.) command. Often the response function has to be normalized, usually to integrate to 1 over the overall range, and the prior squaring of p(1) to use as a normalizing factor for the gamma density in this case is done to save multiplication each time model(2) is called by the quadrature routine. Such models are often used for deconvolution by curve tting in situations where the sub-models are known, but unknown parameters have to be estimated from output data with associated error, and this technique should not be confuse with graphical deconvolution described on page 271.

User dened models

313

B.2.14.5 The command root(.)

This command is used to estimate a zero of a sub-model iteratively. It uses the epsabs, epsrel, blim and tlim values to nd a root for model i and places the return value on the stack. The values assigned to blim(1) and tlim(1) are the limits for root location. The length of the stack increases by one, the root value placed on the top of the stack. The command putpar must be used before root(i) if it wished to use the main parameters in the subsidiary model i. The limits A = blim(1) and B = tlim(1) are used as starting estimates to bracket the root. If f (A) f (B) > 0 then the range (A, B) is expanded by up to ten orders of magnitude (without changing blim(1) or tlim(1)) until f (A) f (B) < 0. If this or any other failure occurs, the root is returned as zero. Note that A and B will not change sign, so you can search for, say, just positive roots. If this is too restrictive, make sure blim(1)*tlim(1) < 0. C05AZF is used, and with difcult problems it will be necessary to increase epsrel.
B.2.14.6 The command value3(.,.,.)

This is a very powerful command which is capable of many applications of the form: if ... elseif ... else. If the top stack number is negative value(i) is called, if it is zero (to machine precision), value(j) is called, and if it is positive value(k) is called. It relies on the presence of correctly formatted sub-models i, j and k of course, but the values returned by sub-models i, j and k are arbitrary, as almost any code can be employed in models i, j and k. The top value of the stack is replaced by the value returned by the appropriate sub-model. This command is best used in conjunction with the command order, which places either 1, 0 or 1 on the stack.
B.2.14.7 The command order

Given a lower limit, an initial value, and an upper limit, this command puts 1 on the stack for values below the lower limit, puts 0 on the stack for values between the limits, and puts 1 on the stack for values in excess of the upper limit. 0 x 4 order value3(1,2,3) f(1) This code is used in the model updownup.mod to generate a model that changes denition at the critical swap-over points x = 0 and x = 4. To summarize, the effect of this command is to replace the top three stack elements, say a, w, b where a < b by either 1 if w a, 0 if a < w b, or 1 if w > b, so reducing the stack length by two.
B.2.14.8 The command middle

Given a lower limit, an initial value, and an upper limit, this command reects values below the lower limit back up to the lower limit and decreases values in excess of the upper limit back down to the upper limit, but leaves intermediate values unchanged. For example, the code 0 x 1 middle will always place a value w on the stack for 0 w 1, and w = x only if 0 x 1. To summarize, the effect of this command is to replace the top three stack elements, say a, w, b where a < b by either a if w a, w if a < w b or b if w > b, so reducing the stack length by two.

314

S IMFIT reference manual: Part 5

B.2.14.9 The syntax for subsidiary models

The format for dening sub-models is very strict and must be used exactly as now described. Suppose you want to use n independent equations. Then n separate user les are developed and, when these are tested, they are placed in order directly after the end of the main model, each surrounded by a begin{model(i)} and end{model(i)} command. So, if you want to use a particular model as a sub-model, you rst of all develop it using program usermod then, when it is satisfactory, just add it to the main model. However, note that sub-models are subject to several restrictions.
B.2.14.10 Rules for using sub-models

Sub-model les must be placed in numerically increasing order at the end of the main model le. Model parsing is abandoned if a sub-model occurs out of sequence. There must be no spaces or non-model lines between the main model and the subsidiary models, or between any subsequent sub-models. Sub-models cannot be differential equations. Sub-models of the type being described must dene just one equation. Sub-models are not tested for consistent put and get commands, since puts might be dened in the main model, etc. Sub-models cannot use putpar, since putpar only has meaning in a main model. Sub-models can use the commands value(i), root(j) and quad(k), but it is up to users to make sure that all calls are consistent. When the command value(i) is used, the arguments passed to the sub-model for evaluation are the independent variables at the level at which the command is used. For instance if the main model uses value(i) then value(i) will be evaluated at the x, y, z, etc. of the main model, but with model(i) being used for evaluation. Note that y(k) must be used for functions with more than three independent variables, i.e. when x, y and z no longer sufce. It is clear that if a model uses value(i) then the number of independent variables in that model must be equal to or greater than the number of independent variables in sub-model(i). When the commands root(i) and quad(j) are used, the independent variables in the sub-model numbers i and j are always dummy variables. When developing models and subsidiary models independently you may get error messages about x not being used, or a get without a corresponding put. Often these can be suppressed by using a pop until the model is developed. For instance x followed by pop will silence the message about x not being used.
B.2.14.11 Nesting subsidiary models

The subsidiary models can be considered to be independent except when there is a clash that would lead to recursion. For instance, value(1) can call model(1) which uses root(2) to nd a root of model(2), which calls quad(3) to integrate model(3). However, at no stage can there be simultaneous use of the same model as value(k), and/or quad(k), and/or root(k). The same subsidiary model cannot be used by more than any one instance of value, quad, root at the same time. Just commonsense really, virtual stack k for model k can only be used for one function evaluation at a time. Obviously there can be at most one instance each of value, root and quad simultaneously.

User dened models

315

B.2.14.12 IFAIL values for D01AJF, D01AEF and C05AZF

Since these iterative techniques may be used inside optimization or numerical integration procedures, the soft fail option IFAIL = 1 is used. If the SIMFIT version of these routines is used, a silent exit will occur and failure will not be communicated to users. So it is up to users to be very careful that all calls to quadrature and root nding are consistent and certain to succeed. Default function values of zero are returned on failure.
B.2.14.13 Test les illustrating how to call sub-models

The test les usermodx.tf? give numerous examples of how to use sub-models for function evaluation, root nding, and adaptive quadrature.

B.2.15 Calling special functions from user-dened models


The special functions commonly used in numerical analysis, statistics, mathematical simulation and data tting, can be called by one-line commands as in this table.
B.2.15.1 Table of special function commands

Command arctanh(x) arcsinh(x) arccosh(x) ai(x) dai(x) bi(x) dbi(x) besj0(x) besj1(x) besy0(x) besy1(x) besi0(x) besi1(x) besk0(x) besk1(x) phi(x) phic(x) erf(x) erfc(x) dawson(x) ci(x) si(x) e1(x) ei(x) rc(x,y) rf(x,y,z) rd(x,y,z) rj(x,y,z,r) sn(x,m) cn(x,m) dn(x,m) ln(1+x) mchoosen(m,n) gamma(x)

NAG S11AAF S11AAF S11AAF S17AGF S17AJF S17AHF S17AKF S17AEF S17AFF S17ACF S17ADF S18ADF S18AFF S18ACF S18ADF S15ABF S15ACF S15AEF S15ADF S15AFF S13ACF S13ADF S13AAF ...... S21BAF S21BBF S21BCF S21BDF S21CAF S21CAF S21CAF S01BAF ...... S13AAF

Description Inverse hyperbolic tangent Inverse hyperbolic sine Inverse hyperbolic cosine Airy function Ai(x) Derivative of Ai(x) Airy function Bi(x) Derivative of Bi(x) Bessel function J0 Bessel function J1 Bessel function Y0 Bessel function Y1 Bessel function I0 Bessel function I1 Bessel function K0 Bessel function K1 Normal cdf Normal cdf complement Error function Error function complement Dawson integral Cosine integral Ci(x) Sine integral Si(x) Exponential integral E1(x) Exponential integral Ei(x) Elliptic integral RC Elliptic integral RF Elliptic integral RD Elliptic integral RJ Jacobi elliptic function SN Jacobi elliptic function CN Jacobi elliptic function DN ln(1 + x) for x near zero Binomial coefficient Gamma function

316

S IMFIT reference manual: Part 5

lngamma(x) S14ABF psi(x) S14ADF dpsi(x) S14ADF igamma(x,a) S14BAF igammac(x,a) S14BAF fresnelc(x) S20ADF fresnels(x) S20ACF bei(x) S19ABF ber(x) S19AAF kei(x) S19ADF ker(x) S19ACF cdft(x,m) G01EBF cdfc(x,m) G01ECF cdff(x,m,n) G01EDF cdfb(x,a,b) G01EEF cdfg(x,a,b) G01EFF invn(x) G01FAF invt(x,m) G01FBF invc(x,m) G01FCF invb(x,a,b) G01FEF invg(x,a,b) G01FFF spence(x) ...... clausen(x) ...... struveh(x,m) ...... struvel(x,m) ...... kummerm(x,a,b)...... kummeru(x,a,b)...... lpol(x,m,n) ...... abram(x,m) debye(x,m) fermi(x,a) ...... ...... ......

heaviside(x,a)...... delta(i,j) ...... impulse(x,a,b)...... spike(x,a,b) ...... gauss(x,a,b) ...... sqwave(x,a) ...... rtwave(x,a) ...... mdwave(x,a) ...... stwave(x,a) ...... rswave(x,a) ...... shwave(x,a) ...... uiwave(x,a,b) ......

log Gamma function Digamma function, (d/dx)log(Gamma(x)) Trigamma function, (d2/dx2)log(Gamma(x)) Incomplete Gamma function Complement of Incomplete Gamma function Fresnel C function Fresnel S function Kelvin bei function Kelvin ber function Kelvin kei function Kelvin ker function cdf for t distribution cdf for chi-square distribution cdf for F distribution (m = num, n = denom) cdf for beta distribution cdf for gamma distribution inverse normal inverse t inverse chi-square inverse beta inverse gamma Spence integral: 0 to x of -(1/y)log|(1-y)| Clausen integral: 0 to x of -log(2*sin(t/2)) Struve H function order m (m = 0, 1) Struve L function order m (m = 0, 1) Confluent hypergeometric function M(a,b,x) U(a,b,x), b = 1 + n, the logarithmic solution Legendre polynomial of the 1st kind, P_nm(x), -1 =< x =< 1, 0 =< m =< n Abramovitz function order m (m = 0, 1, 2), x > 0, integral: 0 to infinity of tm exp( - t2 - x/t) Debye function of order m (m = 1, 2, 3, 4) (m/xm)[integral: 0 to x of tm/(exp(t) - 1)] Fermi-Dirac integral (1/Gamma(1 + a))[integral: 0 to infinity ta/(1 + exp(t - x))] Heaviside unit function h(x - a) Kronecker delta function Unit impulse function (small b for Dirac delta) Unit triangular spike function Gauss pdf Square wave amplitude 1, period 2a Rectified triangular wave amplitude 1, period 2a Morse dot wave amplitude 1, period 2a Sawtooth wave amplitude 1, period a Rectified sine wave amplitude 1, period pi/a Sine half-wave amplitude 1, period 2*pi/a Unit impulse wave area 1, period a, width b

Also, to allow users to document their models, all lines starting with a !, a / or a * character within models are ignored and treated as comment lines. Any of the above commands included as a line in a SIMFIT model or sub-model simply takes the top stack element as argument and replaces it by the function value. The NAG routines indicated can be consulted for details, as equivalent routines, agreeing very closely with the NAG specications, are used. The soft fail

User dened models

317

(IFAIL = 1) options have been used so the simulation will not terminate on error condition, a default value will be returned. Obviously it is up to users to make sure that sensible arguments are supplied, for instance positive degrees of freedom, F or chi-square arguments, etc. To help prevent this problem, and to provide additional opportunities, the command middle (synonym mid) is available.
B.2.15.2 Using the command middle with special functions

Given a lower limit, an initial value, and an upper limit, this command reects values below the lower limit back up to the lower limit and decreases values in excess of the upper limit back down to the upper limit, but leaves intermediate values unchanged. For example, the code 5 0.001 x 0.999 middle invn(x,n) will always return a zero IFAIL when calculating a percentage point for the t distribution with 5 degrees of freedom, because the argument will always be in the range (0.001, 0.999) whatever the value of x.
B.2.15.3 Special functions with one argument

The top stack element will be popped and used as an argument, so the routines can be used in several ways. For instance the following code x phi(x) f(1) would simulate model 1 as a normal cdf, while the code get(4) phi(x) f(3) would return model three as the normal cdf for whatever was stored in storage location 4.
B.2.15.4 Special functions with two arguments

The top stack element is popped and used as x, while the second is popped and used as variable a, n, or y, as the case may be. For instance the code 10 x cdft(x,n) would place the t distribution cdf with 10 degrees of freedom on the stack, while the code 5 0.975 invt(x,n) would place the critical value for a two-tailed t test with 5 degrees of freedom at a condence level of 95% on the stack. Another simple example would be

318

S IMFIT reference manual: Part 5

p(1) x heavi(x,a) f(1) which would return the function value 0 for x < p(1) but 1 otherwise.
B.2.15.5 Special functions with three or more arguments

The procedure is a simple extension of that described for functions of two arguments. First the stack is prepared as . . . u, v, w, z, y, x but, after the function call, it would be . . . u, v, w, f (x, y, z). For example, the code z y x rf(x,y,z) f(11) would return model 11 as the elliptic function RF, since f(11) would have been dened as a function of at least three variables. However, the code get(3) get(2) get(1) rd(x,y,z) 1 add f(7) would dene f(7) as one plus the elliptic function RD evaluated at whatever was stored in locations 3 (i.e. z), 2 (i.e. y) and 1 (i.e. x).
B.2.15.6 Test les illustrating how to call special functions

Three test les have been supplied to illustrate these commands: usermods.tf1: special functions with one argument usermods.tf2: special functions with two arguments usermods.tf3: special functions with three arguments These should be used in program usermod by repeatedly editing, reading in the edited les, simulating, etc. to explore the options. Users can choose which of the options provided is to be used, by simply uncommenting the desired option and leaving all the others commented. Note that these are all set up for f(1) as a function of one variable and that, by commenting and removing comments so that only one command is active at any one time, the models can be plotted as continuous functions. Alternatively singly calculated values can be compared to tabulated values, which should be indistinguishable if your editing is correct.

B.2.16 Operations with scalars and vectors


B.2.16.1 The command store(j)

This command is similar to the put(j) command, but there is an important difference; the command put(j) is executed every time the model is evaluated, but the command store(j) is only executed when the model le is parsed for the rst time. So store(j) is really equivalent to a data initialization statement at compile time. For instance, the code

User dened models

319

3 store(14) would initialize storage location 14 to the value 3. If no further put(14) is used, then storage location 14 would preserve the value 3 for all subsequent calculations in the main model or any sub-model, so that storage location 14 could be regarded as a global constant. Of course any put(14) in the main model or any sub-model would overwrite storage location 14. The main use for the store command is to dene special values that are to be used repeatedly for model evaluation, e.g., coefcients for a Chebyshev expansion. For this reason there is another very important difference between put(j) and store(j); store(j) must be preceded by a literal constant, e.g., 3.2e-6, and cannot be assigned as the end result of a calculation, because storage initialization is done before calculations. To summarize: store(j) must be preceded by a numerical value, when it pops this top stack element after copying it into storage location j. So the stack length is decreased by one, to initialize storage location j, but only on the rst pass through the model, i.e. when parsing.
B.2.16.2 The command storef(file)

Since it is tedious to dene more than a few storage locations using the command store(j), the command storef(*.*), for some named le instead of *.*, provides a mechanism for initializing an arbitrary number of storage locations at rst pass using contiguous locations. The le specied by the storef(*.*) command is read and, if it is a SIMFIT vector le, all the successive components are copied into corresponding storage locations. An example of this is the test model le cheby.mod (and the related data le cheby.dat) which should be run using program usermod to see how a global vector is set up for a Chebshev approximation. Other uses could be when a model involves calculations with a set of xed observations, such as a time series. To summarize: the command storef(mydatya) will read the components of any n-dimensional SIMFIT vector le, mydata, into n successive storage locations starting at position 1, but only when the model le is parsed at rst pass. Subsequent use of put(j) or store(j) for j in the range (1, n) will overwrite the previous effect of storef(mydata).
B.2.16.3 The command poly(x,m,n)

This evaluates m terms of a polynomial by Horners method of nested multiplication, with terms starting at store(n) and proceeding as far as store(n + m - 1). The polynomial will be of degree m 1 and it will be evaluated in ascending order. For example, the code 1 store(10) 0 store(11) -1 store(12) 10 3 x poly(x,m,n) will place the value of f (x) = 1 x2 on the stack, where x is the local argument. Of course, the contents of the storage locations can also be set by put(j) commands which would then overwrite the previous store(j) command. For instance, the following code 5 put(12) 10

320

S IMFIT reference manual: Part 5

3 2 poly(x,m,n) used after the previous code, would now place the value 21 on the stack, since f (t ) = 1 + 5t 2 = 21, and the argument is now t = 2. To summarize: poly(x,m,n) evaluates a polynomial of degree m 1, using successive storage locations n, n + 1, n + 2, . . ., n + m 1, i.e. the constant term is storage location n, and the coefcient of degree m 1 is storage location n + m 1. The argument is whatever value is on the top of the stack when poly(x,m,n) is invoked. This command takes three arguments x, m, n off the stack and replaces them by one value, so the stack is decreased by two elements. If there is an error in m or n, e.g., m or n negative, there is no error message, and the value f (x) = 0 is returned.
B.2.16.4 The command cheby(x,m,n)

The Chebyshev coefcients are rst stored in locations n to n + m 1, then the command cheby(x,m,n) will evaluate a Chebyshev expansion using the Broucke method with m terms. Note that the rst term must be twice the constant term since, as usual, only half the constant is used in the expansion. This code, for instance, will return the Chebyshev approximation to exp(x). 2.532132 store(20) 1.130318 store(21) 0.271495 store(22) 0.044337 store(23) 0.005474 store(24) 20 5 x cheby(x,m,n) Note that, if the numerical values are placed on the stack sequentially, then they obviously must be peeled off in reverse order, as follows. 2.532132 1.130318 0.271495 0.044337 0.005474 store(24) store(23) store(22) store(21) store(20) 20 5 x cheby(x,m,n)

User dened models

321

To summarize: cheby(x,m,n) evaluates a Chebyshev approximation with m terms, using successive storage locations n, n + 1, n + 2, . . ., n + m + 1, i.e. twice the constant term is in storage location n, and the coefcient of T (m 1) is in storage location m + n 1. The argument is whatever value is on the top of the stack when cheby(x,m,n) is invoked. This command takes three arguments x, m, n off the stack and replaces them by one value, so the stack is decreased by two elements. If there is an error in x, m or n, e.g., x not in (1, 1), or m or n negative, there is no error message, and the value f (x) = 0 is returned. Use the test le cheby.mod with program usermod to appreciate this command.
B.2.16.5 The commands l1norm(m,n), l2norm(m,n) and linorm(m,n)

The L p norms are calculated for a vector with m terms, starting at storage location n, i.e. l1norm calculates the sum of the absolute values, l2norm calculates the Euclidean norm, while linorm calculates the innity norm (that is, the largest absolute value in the vector). It should be emphasized that l2norm(m,n) puts the Euclidean norm on the stack, that is the length of the vector (the square root of the sum of squares of the elements) and not the square of the distance. For example, the code 2 put(5) -4 put(6) 3 put(7) 4 put(8) 1 put(9) l1norm(3,5) would place 9 on the stack, while the command l2norm(5,5) would put 6.78233 on the stack, and the command linorm(5,5) would return 4. To summarize: these commands take two arguments off the stack and calculate either the sum of the absolute values, the square root of the sum of squares, or the largest absolute value in m successive storage locations starting at location n. The stack length is decreased by one since m and n are popped and replaced by the norm. There are no error messages and, if an error is encountered, a zero value is returned.
B.2.16.6 The commands sum(m,n) and ssq(m,n)

As there are occasions when it is useful to be able to add up the signed values or the squares of values in storage, these commands are provided. For instance, the code 1 2 3 4 put(103) put(102) put(101) put(100) 100 4 sum(m,n) f(1)

322

S IMFIT reference manual: Part 5

101 3 ssq(m,n) f(2) would assign 10 to function 1 and 29 to function 2. To summarize: these commands take two arguments off the stack and then replace them with either the sum of m storage locations starting at position n, or the sum of squares of m storage locations starting at position n, so decreasing the stack length by 1.
B.2.16.7 The command dotprod(l,m,n)

This calculates the scalar product of two vectors of length l which are stored in successive locations starting at positions m and n. To summarize: The stack length is decreased by 2, as three arguments are consumed, and the top stack element is then set equal to the dot product, unless an error is encountered when zero is returned.
B.2.16.8 Commands to use mathematical constants

The following commands are provided to facilitate model building. Command pi piby2 piby3 piby4 twopi root2pi deg2rad rad2deg root2 root3 eulerg lneulerg Value 3.141592653589793e+00 1.570796326794897e+00 1.047197551196598e+00 7.853981633974483e-01 6.283185307179586e+00 2.506628274631000e+00 1.745329251994330e-02 5.729577951308232e+01 1.414213562373095e+00 1.732050807568877e+00 5.772156649015329e-01 -5.495393129816448e-01 Comment pi pi divided by two pi divided by three pi divided by three pi multiplied by two square root of two pi degrees to radians radians to degrees square root of two square root of three Eulers gamma log (Eulers gamma)

To summarize: these constants are merely added passively to the stack and do not affect any existing stack elements. To use the constants, the necessary further instructions would be required. So, for instance, to transform degrees into radial measure, the code 94.25 deg2rad multiply would replace the 94.25 degrees by the equivalent radians.

B.2.17 Integer functions


Sometimes integers are needed in models, for instance, as exponents, as summation indices, as logical ags, as limits in do loops, or as pointers in case constructs, etc. So there are special integer functions that take the top argument off the stack whatever number it is (say x) then replace it by an appropriate integer as follows.

User dened models

323

Command int(x) nint(x) sign(x)

Description replace x by the integer part of x replace x by the nearest integer to x replace x by -1 if x < 0, by 0 if x = 0, or by 1 if x > 0

When using integers with SIMFIT models it must be observed that only double precision oating point numbers are stored, and all calculations are done with such numbers, so that 0 actually means 0.0 to machine precision. So, for instance, when using these integer functions with real arguments to create logicals or indices for summation, etc. the numbers on the stack that are to be used as logicals or integers are actually transformed dynamically into integers when required at run time, using the equivalent of nint(x) to generate the appropriate integers. Because of this, you should note that code such as ... 11.3 19.7 1.2 int(x) 2.9 nint(x) divide would result in 1.0/3.0 being added to the stack (i.e. 0.3333 . . .) and not 1/3 (i.e 0) as it would for true integer division, leading to the stack ..., 11.3, 19.7, 0.3333333

B.2.18 Logical functions


Logical variables are stored in the global storage vector as either 1.0 (so that nint(x) = 1 = true) or as 0.0 (so that nint(x) = 0 = false). The logical functions either generate logical variables by testing the magnitude of the arbitrary stack value (say x) with respect to zero (to machine precision) or they accept only logical arguments (say m or n) and return an error message if the stack values are not pre-set to 0.0 or 1.0. Note that logical variables (i.e. Booleans) can be stored using put(i) and retrieved using get(i), so that logical tests of any order of complexity can be constructed. Command lt0(x) le0(x) eq0(x) ge0(x) gt0(x) not(m) and(m,n) or(m,n) xor(m,n) Description replace replace replace replace replace replace replace replace replace x x x x x m m m m by 1 if x < 0, otherwise by 0 by 1 if x =< 0, otherwise by 0 by 1 if x = 0, otherwise by 0 by 1 if x >= 0, otherwise by 0 by 1 if x > 0, otherwise by 0 by NOT(m), error if m not 0 or 1 and n by AND(m,n), error if m or n not 0 or 1 and n by OR(m,n), error if m or n not 0 or 1 and n by XOR(m,n), error if m or n not 0 or 1

B.2.19 Conditional execution


Using these integer and logical functions in an appropriate sequence interspersed by put(.) and get(.) commands, any storage location (say j) can be set up to test whether any logical condition is true or false. So, the commands if(.) and ifnot(.) are provided to select model features depending on logical variables. The idea is to calculate the logical variables using the integer and logical functions, then load them into storage

324

S IMFIT reference manual: Part 5

using put(.) commands. The if(.) and ifnot(.) commands inspect the designated storage locations and return 1 if the storage location has the value 1.0 (to machine precision), or 0 otherwise, even if the location is not 0.0 (to machine precision). The logical values returned are not added to the stack but, if a 1 is returned, the next line of the model code is executed whereas, if a 0 is returned, the next line is missed out. Command if(j) ifnot(j) Description execute the next line only if storage(j) = 1.0 execute the next line only if storage(j) = 0.0

Note that very extensive logical tests and blocks of code for conditional executions, do loops, while and case constructs can be generated by using these logical functions sequentially but, because not all the lines of code will be active, the parsing routines will indicate the number of if(.) and ifnot(.) commands and the resulting potentially unused lines of code. This information is not important for correctly formatted models, but it can be used to check or debug code if required. Consult the test le if.mod to see how to use logical functions.

B.2.20 Arbitrary functions with arbitrary arguments


The sub-models described so far for evaluation, integration, root nding, etc. are indexed at compile time and take dummy arguments, i.e. the ones supplied by the SIMFIT calls to the model evaluation subroutines. However, sometimes it is useful to be able to evaluate a sub-model with arbitrary arguments added to the stack, or arguments that are functions of the main arguments. Again, it is useful to be able to evaluate an arbitrary function chosen dynamically from a set of sub-models indexed by an integer parameter calculated at run time, rather than read in at compile time when the model is rst parsed. So, to extend the user-dened model syntax, the command user1(x,m) is provided. The way this works involves three steps: 1. an integer (m) is put on the stack to denote the required model, 2. calculations are performed to put the argument (x) on the stack, then 3. the user dened model is called and the result placed on the stack. For instance the code ... 14.7 3 11.3 user1(x,m) would result in ..., 14.7, 12.5 if the value of sub-model number 3 is 12.5 at an argument of 11.3. Similar syntax is used for functions of two and three variables, i.e. user1(x,m) user2(x,y,m) user3(x,y,z,m) Clearly the integer m can be literal, calculated or retrieved from storage, but it must correspond to a sub-model that has been dened in the sequence of sub-models, and the calculated arbitrary arguments x, y, z must be sensible values. For instance the commands

User dened models

325

2 x user1(x,m) and value(2) are equivalent. However the rst form is more versatile, as the model number (m, 2 in this case) and argument (x, the dummy value in this case) can be altered dynamically as the result of stack calculations, while the second form will always invoke the case with m = 2 and x = the subroutine argument to the model. The model le user1.mod illustrates how to use the user1(x,m) command.

B.2.21 Using usermod with user-dened models


In order to assist users to develop their own models a special program, usermod, is distributed as part of the SIMFIT package. This provides the following procedures. t After a model has been selected it is checked for consistency. If all is well the appropriate parts of the program will now be able to use that particular model. If the model has an error it will be specied and you can use your editor to attempt a repair. Note that, after any editing, the model le must be read in again for checking. t After a model has been accepted you can check, that is, supply the arguments and observe the stack operations which are displayed in color-code as the model is evaluated. t For single functions of one variable you can plot, nd zeros or integrate. t For single functions of two variables you can plot, or integrate. t For several functions of one variable you can plot selected functions. t For n functions of m variables you can nd simultaneous zeros if n = m, integrate for any n and m, or optimize if m = n 1. t Default settings are provided for parameters, limits, tolerances and starting estimates, and these can be edited interactively as required. Note that parameters p(i) used by the models will be those set from the main program, and the same is true for absolute error tolerance epsabs, relative error tolerance epsrel , and the integration limits blim(i) and tlim(i). t A default template is provided which users can edit to create their own models.

B.2.22 Locating a zero of one function of one variable


Users must supply a relative error accuracy factor epsrel , two values A and B, and a constant C such that, for g(x) = f (x) C, then g(A)g(B) < 0. If the values supplied are such that g(A)g(B) > 0, the program will attempt to enlarge the interval in order to bracket a zero, but it will not change the sign of A or B. Users must do this if necessary by editing the starting estimates A and B. The program returns the root as X if successful, where X and Y have been located such that |X Y | 2.0 epsrel |Z | and |g(Z )| is the smallest known function value, as described for NAG routine C05AZF.

326

S IMFIT reference manual: Part 5

As an example, input the special function model le usrmods.tf1 which denes one equation in one variable, namely the cumulative normal distribution function (Page 286). Input f (x) = 0.975 so the routine is required to estimate x such that x 1 t2 0.975 = dt exp 2 2 and, after setting some reasonable starting estimates, e.g., the defaults (-1,1), the following message will be printed Success : Root = 1.96000E+00 (EPSREL = 1.00000E-03)

giving the root estimated by the SIMFIT equivalent of C05AZF.

B.2.23 Locating zeros of n functions of n variables


The model le must dene a system of n equations in n variables and the program will attempt to locate x1 , x2 , . . . , xn such that fi (x1 , x2 , . . . , xn ) = 0, for i = 1, 2, . . . , n. Users must supply good starting estimates by editing the default y1 , y2 , . . . , yn , or installing a new y vector from a le, and the accuracy can be controlled by varying epsrel , since the program attempts to ensure that |x x | epsrel |x |, where x is the true solution, as described for NAG routine C05NBF. Failure to converge will lead to nonzero IFAIL values, requiring new starting estimates. As an example, input the test le usermodn.tf4 which denes 9 equations in 9 variables, and after setting y(i) = 0 for i = 1, 2, . . . , 9 select to locate zeros of n equations in n variables. The following table will result From C05NBF: IFAIL = x( 1) = -5.70653E-01 x( 2) = -6.81625E-01 x( 3) = -7.01732E-01 x( 4) = -7.04215E-01 x( 5) = -7.01367E-01 x( 6) = -6.91865E-01 x( 7) = -6.65794E-01 x( 8) = -5.96034E-01 x( 9) = -4.16411E-01 0, FNORM = ... fvec( ... fvec( ... fvec( ... fvec( ... fvec( ... fvec( ... fvec( ... fvec( ... fvec( 7.448E-10, XTOL = 1.000E-03 1) = 2.52679E-06 2) = 1.56881E-05 3) = 2.83570E-07 4) = -1.30839E-05 5) = 9.87684E-06 6) = 6.55571E-06 7) = -1.30536E-05 8) = 1.17770E-06 9) = 2.95110E-06

showing the solution vector and vector of partial derivatives for the tridiagonal system (3 2x1)x1 2x2 = 1

xi1 + (3 2xi)xi 2xi+1 = 1, i = 2, 3, . . . , 8 x8 + (3 2x9)x9 = 1 estimated by the SIMFIT equivalent of C05NBF.

B.2.24 Integrating one function of one variable


The program accepts a user dened model for a single function of one variable and returns two estimates I1 and I2 for the integral
B

I=
A

f (x) dx,

where A and B are supplied interactively. The value of I1 is calculated by Simpsons rule and is rather approximate, while that of I2 is calculated by adaptive quadrature. For smooth functions over a limited range

User dened models

327

these should agree fairly closely, but large differences suggest a difcult integral, e.g., with spikes, requiring more careful investigation. The values of epsrel and epsabs control the accuracy of adaptive quadrature such that, usually |I I2 | tol

ABSERR tol , as described for NAG routine D01AJF.

tol = max(|epsabs|, |epsrel | |I |) |I I2 | ABSERR

As an example, input the le usermod1.tf5 which denes the function f (x) = p1 exp( p2 x) and, after setting p1 = p2 = 1 and requesting integration, gives Numerical quadrature over the range: Number of Simpson divisions Area by the Simpson rule IFAIL (from D01AJF) EPSABS EPSREL ABSERR Area by adaptive integration = = = = = = = 0.000E+00, 1.000E+00

200 1.71828E+00 0 1.00000E-06 1.00000E-03 3.81535E-15 1.71828E+00

for the areas by Simpsons rule and the SIMFIT equivalent of D01AJF.

B.2.25 Integrating n functions of m variables


The program accepts a user dened model for n functions of m variables and estimates the n integrals Ii =
B1 A1 B2 A2

...

Bm Am

fi (x1 , x2 , . . . , xm ) dxm . . . dx2 dx1

for i = 1, 2, . . . , n, where the limits are taken from the arrays Ai = blim(i) and Bi = tlim(i). The procedure only returns IFAIL = 0 when max(ABSEST (i)) max(|epsabs|, |epsrel | max |FINEST (i)|),
i i

where ABSEST (i) is the estimated absolute error in FINEST (i), the nal estimate for integral i, as described for NAG routine D01EAF. As an example, input the test le d01fcf.mod which denes the function
1 1 0 0 1 0 1

f (x1 , x2 , x3 , x4 ) =

4u1 u2 3 exp(2u1 u3 ) du4 du3 du2 du1 (1 + u2 + u4)2

then,on requesting integration of n functions of m variables over the range (0, 1), the table

328

S IMFIT reference manual: Part 5

IFAIL (from D01EAF) = EPSABS = EPSREL = Number 1 2 3 4 Number 1 BLIM 0.00000E+00 0.00000E+00 0.00000E+00 0.00000E+00 INTEGRAL 5.75333E-01

0 1.00000E-06 1.00000E-03

TLIM 1.00000E+00 1.00000E+00 1.00000E+00 1.00000E+00 ABSEST 1.07821E-04

will be printed, showing the results from the SIMFIT equivalent of D01EAF and D01FCF.

B.2.26 Bound-constrained quasi-Newton optimization


The user supplied model must dene n + 1 functions of n variables as follows f (1) = F (x1 , x2 , . . . , xn ) f (2) = F /x1 f (3) = F /x2 ... f (n + 1) = F /xn as the partial derivatives are required in addition to the function value. The limited memory quasi-Newton optimization procedure also requires several other parameters, as now listed. t MHESS is the number of limited memory corrections to the Hessian that are stored. The value of 5 is recommended but, for difcult problems, this can be varied in the range 4 to 17. t FACT R should be about 1.0e+12 for low precision, 1.0e+07 for medium precision, and 1.0e+01 for high precision. Convergence is controlled by FACT R and PGT OL and will be accepted if |Fk Fk+1 |/ max(|Fk |, |Fk+1 |, 1) FACT R EPSMCH at iteration k + 1, where EPSMCH is machine precision, or if max(Projected Gradient(i)) PGT OL.
i

t Starting estimates and bounds on the variables can be set by editing the defaults or by installing from a data le. t The parameter IPRINT allows intermediate output every IPRINT iterations, and the nal gradient vector can also be printed if required. t The program opens two les at the start of each optimization session, w_usermod.err stores intermediate output every IPRINT iterations plus any error messages, while iterate.dat stores all iteration details, as for qnt and deqsol, when they use the LBFGSB suite for optimization. Note that, when IPRINT > 100 full output, including intermediate coordinates, is written to w_usermod.err at each iteration. As an example, input the model le optimum.mod, dening Rosenbrucks two dimensional test function f (x, y) = 100(y x2)2 + (1 x)2 which has a unique minimum at x = 1, y = 1. The iteration, starting at x = 1.2, y = 1.0 with IPRINT = 5 proceeds as follows

User dened models

329

Iterate F(x) |prj.grd.| 1 6.9219E-01 5.0534E+00 6 2.1146E-01 3.1782E+00 11 1.7938E-02 3.5920E-01 16 1.7768E-04 4.4729E-02 20 5.5951E-13 7.2120E-06 dF(x)/dx( 1) = 7.21198E-06 dF(x)/dx( 2) = -2.87189E-06

Task NEW_X NEW_X NEW_X NEW_X CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL

and the coordinates for the optimization trajectory, shown plotted as a thick segmented line in the contour diagram on page 263, were taken from the le w_usermod.err, which was constructed from a separate run with IPRINT = 101. The parameter TASK informs users of the action required after each intermediate iteration, then nally it records the reason for termination of the optimization.

Appendix C

Library of models
C.1 Mathematical models [Library: Version 2.0]
The SIMFIT libraries are used to generate exact data using program makdat, or to t data using an advanced curve tting program, e.g. qnt. Version 2.0 of the SIMFIT library only contains a limited selection of models, but there are other versions available, with extra features and model equations. The models are protected to prevent overow during function evaluation, and they return zero when called with meaningless arguments, e.g. the beta pdf with negative parameters or independent variable outside [0, 1]. After a model has been selected, it is initialized, and the parameters are described in more meaningful form, e.g. as rate constants or initial concentrations, etc. Also, any restrictions on the parameter values or range of independent variable are listed, and equations for singular cases such as 1/( p2 p1 ) when p2 = p1 are given.

C.2 Functions of one variable


C.2.1 Differential equations
These are integrated using Gears method with an explicitly calculated Jacobian. Irreversible MM S-depletion: dy p2 y = ; y = S, S(0) = p3 , P(0) = 0 dx p1 + y p 2 ( p 3 y) dy = ; y = P, S(0) = p3 , P(0) = 0 dx p 1 + ( p 3 y)

Irreversible MM P-accumulation: General S-depletion:

dy p2 y = p3 y p4; y = S, S(0) = p5 , P(0) = 0 dx p1 + y dy p 2 ( p 5 y) = + p3 ( p5 y) + p4; y = P, S(0) = p5 , P(0) = 0 dx p 1 + ( p 5 y) dy p3 (y p4) = 2 ; y() = p4 , y(0) = p5 dx y + p1 y + p2

General P-accumulation:

Membrane transport (variable volume etc.): Von Bertalanffy growth:

dy = p1 y p2 p3 y p4 ; y(0) = p5 dx

C.2.2 Systems of differential equations


The library has a selection of systems of 1, 2, 3, 4 and 5 differential equations which can be used for simulation and tting by program deqsol. Also ASCII coordinate les called deqmod?.tf? and deqpar?.tf? are provided for the same models, to illustrate how to supply your own differential equations. Program deqsol can use the Adams methods and allows the use of Gears method with an explicit Jacobian or else an internally approximated one. 330

Library of mathematical models

331

C.2.3 Special models


Polynomial of degree n: pn+1 + p1x + p2 x2 + + pn xn Order n : n rational function: p 2 n + 1 + p 1 x + p 2 x2 + + p n xn 1 + p n + 1 x + p n + 2 x2 + + p 2 n xn p1 x p2 x pn x + + + + p 2 n +1 p n +1 + x p n +2 + x p 2n + x

Multi Michaelis-Menten functions:

Multi M-M in isotope displacement mode, with y =[Hot]: p2 y pn y p1 y + + + + p 2 n +1 p n +1 + y + x p n +2 + y + x p 2n + y + x Multi M-M in isotope displacement mode, with [Hot] subsumed: p1 p2 pn + + + + p 2 n +1 p n +1 + x p n +2 + x p 2n + x High/Low afnity sites: p 1 p n +1 x p 2 p n +2 x p n p 2n x + + + + p 2 n +1 1 + p n +1 x 1 + p n +2 x 1 + p 2n x

H/L afnity sites in isotope displacement mode, with y = [Hot]: p 1 p n +1 y p 2 p n +2 y p n p 2n y + + + + p 2 n +1 1 + pn+1(x + y) 1 + pn+2(x + y) 1 + p2n(x + y) H/L afnity sites in isotope displacement mode, with [Hot] subsumed: p 2 p n +2 p n p 2n p 1 p n +1 + + + + p 2 n +1 1 + p n +1 x 1 + p n +2 x 1 + p 2n x Binding constants saturation function: p n +1 n p1 x + 2 p2x2 + + npnxn 1 + p 1 x + p 2 x2 + + p n xn + p n +2

Binding constants in isotope displacement mode, with y = [Hot]: p n +1 y n p1 + 2 p2(x + y) + + npn(x + y)n1 1 + p1(x + y) + p2(x + y)2 + + pn (x + y)n p n +1 n + p n +2 + p n +2

Adair constants saturation function:

p1 x + 2 p1 p2 x2 + + np1 p2 . . . pn xn 1 + p 1 x + p 1 p 2 x2 + + p 1 p 2 . . . p n xn

Adair constants in isotope displacement mode , with y = [Hot]: p n +1 y n p1 + 2 p1 p2 (x + y) + + np1 p2 . . . pn (x + y)n1 1 + p1(x + y) + p1 p2 (x + y)2 + + p1 p2 . . . pn (x + y)n + p n +2

Sum of n exponentials: p1 exp( pn+1 x) + p2 exp( pn+2x) + + pn exp( p2n x) + p2n+1 Sum of n functions of the form 1 exp(kx): p1 {1 exp( pn+1 x)} + p2{1 exp( pn+2 x)} + + pn{1 exp( p2n x)} + p2n+1 Sum of n sine functions:
i=1 n

pi sin( pn+i x + p2n+i) + p3n+1


i=1

Sum of n cosine functions:

pi cos( pn+i x + p2n+i) + p3n+1

332

S IMFIT reference manual: Part 5

Sum of n Gauss (Normal) pdf functions: p2 1 exp 2 p 2 n +2 2 x p n +2 p 2 n +2


2

p1 1 exp 2 p 2 n +1 2 pn 1 exp 2 p 3n 2 p1
x

x p n +1 p 2 n +1 x p 2n p 3n 1 2

+
2

+ +

+ p 3 n +1
2

Sum of n Gauss (Normal) cdf functions: p2 p 2 n +2 2 pn p 3n 2


x x

p 2 n +1 2 du + +

exp

u p n +1 p 2 n +1

du+

exp 1 2

1 2

u p n +2 p 2 n +2 u p 2n p 3n
2

exp

du + p3n+1

C.2.4 Biological models


Exponential growth/decay in three parameterizations: Parameterization 1: p1 exp( p2 x) + p3 Parameterization 2: exp( p1 p2 x) + p3 Parameterization 3: { p1 p3} exp( p2 x) + p3 Monomolecular growth: p1 {1 exp( p2 x)} + p3 Logistic growth in four parameterizations: p1 Parameterization1: + p4 1 + p2 exp( p3 x) Parameterization 2: Parameterization 3: Parameterization 4:

p1 + p4 1 + exp( p2 p3 x)

p1 + p4 1 + exp([ p3 (x p2)]) p1 + p4 1 + exp([ p2 + p3 x])

Gompertz growth: p1 exp{ p2 exp( p3 x)} + p4 Richards growth: p1


(1 p 4 )

p2 exp( p3 x)

1 1 p4

+ p5

Preece and Baines: p4

2( p4 p5 ) exp[ p1 (x p3 )] + exp[ p2 (x p3)]

Weibull survival in four parameterizations: Parameterization 1: p1 exp ( p2 [x p3 ]) + p4 Parameterization 2: p1 exp ([x p3 ]/ p2 ) + p4 Parameterization 3: p1 exp ([ p2 x] p3 ) + p4 Parameterization 4: p1 exp ( exp( p2 )[x p3 ]) + p4 Gompertz survival: p1 exp p2 p3 [exp( p3 x) 1] + p4

Library of mathematical models

333

C.2.5 Biochemical models


Monod-Wyman-Changeux allosterism: K = p1 , c = p2 < 1, L = p3 , V = p4 p1 p4 x (1 + p1x)n1 + p2 p3 (1 + p1 p2 x)n1 (1 + p1x)n + p3(1 + p1 p2 x)n

Lag phase to steady state: p1 x + p2{1 exp( p3 x)} + p4 One-site binding: x = [Total ligand], Kd = p1 , [Total sites] = p2 , A A2 4 p 2 x ; where A = x + p1 + p2 2 p2

Irreversible Michaelis Menten progress curve: Km = p1 , Vmax = p2 , [S(t = 0)] = p3 p1 ln p3 + f (x) p2 x = 0, where f (0) = 0 p3 f (x) p1 x + p3 x, [No hot], p1 = Vmax , p2 = Km , p3 = D p2 + x p1 y + p3y, [Hot input], p1 = Vmax , p2 = Km , p3 = D p2 + y + x p1 + p3 , [Hot subsumed], p1 = Vmax y, p2 = Km + y, p3 = Dy p2 + x p1 p2 + x

Michaelis-Menten plus diffusion in three modes with x = [S] or [Cold], and y = [Hot]: Type 1: Type 2: Type 3:

Generalized inhibition:

C.2.6 Chemical models


Arrhenius rate constant law: p1 exp( p2 /x) Transition state rate constant law: p1 x p3 exp( p2 /x) B in A B C: p1 p3 p2 p1 exp( p1 x) + p4 p2 p3 p2 p1 p1 p3 p2 p1 exp( p2 x) p1 p3 p2 p1 exp( p2 x)

C in A B C: p3 + p4 + p5 B in A B reversibly:

exp( p1 x) p4

p1 ( p3 + p4 ) + ( p2 p4 p1 p3 ) exp{( p1 + p2 )x} p1 + p2

Michaelis pH functions with scaling and displacement factors: p3 [1 + p1/x + p1 p2 /x2 ] + p4 p3 [1 + x/ p1 + p2 /x] + p4 p3 [1 + x/ p2 + x2 /( p1 p2 )] + p4 Freundlich isotherm: p1 x1/ p2 + p3

334

S IMFIT reference manual: Part 5

C.2.7 Physical models


Diffusion into a capillary: p1 erfc Full Mualen equation: Short Mualen equation: 2 p2
p3

, where p2 = Dt

1 1 + ( p 1 x) p 2 1 1 + ( p 1 x) p 2

(1 1 / n )

C.2.8 Statistical models


Normal pdf: Beta pdf: p3 1 exp 2 p2 2 x p1 p2
2

p3 ( p1 + p2 ) p1 1 x (1 x) p21 ( p1 )( p2 )

Exponential pdf: p1 p2 exp( p1 x) p3 Cauchy pdf: p2 {1 + [(x p1)/ p2 )]2 } Logistic pdf: p2 {1 + exp[(x p1)/ p2 ]}2 p3 1 exp 2 p2 x 2 ln x p1 p2
2

p3 exp[(x p1)/ p2 ]

Lognormal pdf:

Gamma pdf: Rayleigh pdf:

p 2 p 2 1 p3 p1 x exp( p1 x) ( p2 )

p2 x 1 exp 2 2 p1 2 p 2 x2 p3 1 2 p 1 p 3 x p 1 1 p2 exp exp

x p1 1 2

Maxwell pdf:

x p1

Weibull pdf:

x p1 p2

Normal cdf, i.e. integral from to x of normal pdf Beta cdf, i.e. integral from 0 to x of beta pdf Exponential cdf: p2 {1 exp( p1 x)} Cauchy cdf: p3 Logistic cdf: 1 1 x p1 + arctan 2 p2

Lognormal cdf, i.e. integral from 0 to x of Lognormal pdf Weibull cdf: p3 1 exp x p1 p2 1 1 + exp[{ p1 + p2 x}]

p3 exp{(x p1)/ p2 } 1 + exp{(x p1)/ p2 }

Logit in exponential format:

Probit in Normal cdf format: ( p1 + p2x)

Library of mathematical models

335

C.2.9 Empirical models


Hill with n xed: p 1 xn + p3 n pn 2+x
p p2 3

Hill with n varied:

p 1 x p3 + p4 + x p3

Power law: p1 x p2 + p3 log10 law: p1 log10 x + p2 Up/Down exponential: p3 {exp( p1 x) exp( p2 x)} + p4 Up/Down logistic: p1 + p6 1 + exp( p2 p3 x) + exp( p4 + p5x)

Double exponential plus quadratic: p1 exp( p2 x) + p3 exp( p4 x) + p5x2 + p6 x + p7 Double logistic: p1 p4 + + p7 1 + exp( p2 p3 x) 1 + exp( p5 p6x)
2

Linear plus reciprocal: p1 x + p2 /x + p3 Gaussian plus exponential: p3 1 exp 2 p2 2 1 p3 exp 2 p2 2 x p1 p2 x p1 p2 + p5 exp( p4 x) + p6


2

Gaussian times exponential:

exp( p4 x) + p5

C.2.10 Mathematical models


Upper or lower semicircle: p2
2 p2 3 (x p1)

Upper or lower semiellipse: p2 p4

x p1 p3

Sine/Cosine: p1 sin( p3 x) + p2 cos( p3 x) + p4 Damped SHM: exp( p4 x)[ p1 sin( p3 x) + p2 cos( p3 x)] + p5 Arctangent: p1 arctan( p2 x) + p3 Gamma type: p1 x p2 exp( p3 x) + p4 Sinh/Cosh: p1 sinh( p3 x) + p2 cosh( p3 x) + p4 Tanh: p1 tanh( p2 x) + p3

C.3 Functions of two variables


C.3.1 Polynomials
Degree 1: p1 x + p2y + p3 Degree 2: p1 x + p2y + p3x2 + p4xy + p5y2 + p6 Degree 3: p1 x + p2y + p3x2 + p4xy + p5y2 + p6x3 + p7x2 y + p8xy2 + p9y3 + p10

336

S IMFIT reference manual: Part 5

C.3.2 Rational functions:


2 : 2 with f (0, 0) = 0: p1 xy 1 + p2x + p3 y + p4x2 + p5 xy + p6y2 p1 xy + p2x2 y + p3xy2 2 3 2 2 3 7 xy + p8 y + p9 x + p10 x y + p11 xy + p12 y

3 : 3 with f (0, 0) = 0:

1 + p4 x + p5 y + p6 p5 + p1 x + p2 y 1 + p3 x + p4 y

x2 + p

1 : 1 rational function:

2 : 2 rational function:

p11 + p1 x + p2y + p3x2 + p4xy + p5y2 1 + p6x + p7 y + p8x2 + p9 xy + p10y2

C.3.3 Enzyme kinetics


Reversible Michaelis Menten (product inhibition): Competitive inhibition: p1 x p2 (1 + y/ p3) + x p1 x p2 + x(1 + y/ p3) p1 x (1 + y/ p3)( p2 + x) p 1 x/ p 2 p 3 y/ p 4 1 + x/ p 2 + y/ p 4

Uncompetitive inhibition:

Noncompetitive inhibition: Mixed inhibition: Ping pong bi bi: Ordered bi bi:

p1 x p2 (1 + y/ p3) + x(1 + y/ p4)

p1 xy p3 x + p2y + xy

p1 xy p3 p4 + p3x + p2 y + xy p2 x 1 + p3 /y

Time dependent inhibition: p1 exp Inhibition by competing substrate:

p 1 x/ p 2 1 + x/ p 2 + y/ p 3

Michaelis-Menten pH dependence: f (y)x/[g(y) + x] f (y) = p1 /[1 + y/ p5 + p6y], g(y) = p2 (1 + y/ p3 + p4y)/[1 + y/ p5 + p6y]

C.3.4 Biological
Logistic growth: p1 + p6 1 + exp([ p2 + p3 x + p4y + p5xy])

C.3.5 Physical
Diffusion into a capillary: p1 erfc x 2 p2 y

Library of mathematical models

337

C.3.6 Statistical
Bivariate normal pdf: p1 = x , p2 = x , p3 = y , p4 = y , p5 = p6 2x y 1 2 exp 1 2(1 2) x x x
2

x x x

y y y

y y y

+ p7

Logit in exponential format:

1 1 + exp[{ p1 + p2 x + p3y}]

Probit in Normal cdf format: ( p1 + p2x + p3 y)

C.4 Functions of three variables


C.4.1 Polynomials
Linear: p1 x + p2y + p3z + p4

C.4.2 Enzyme kinetics


MWC activator/inhibitor: nV (1 + )n1 (1 + )n + L[(1 + )/(1 + )]n

= p1 [x], = p2 [y], = p3 [z], V = p4 , L = p5

C.4.3 Biological
Logistic growth: p1 + p10 1 + exp([ p2 + p3 x + p4y + p5z + p6xy + p7 xz + p8yz + p9xyz]) 1 1 + exp[{ p1 + p2 x + p3y + p4z}]

C.4.4 Statistics
Logit in exponential format:

Probit in Normal cdf format: ( p1 + p2x + p3 y + p4z)

Appendix D

Editing PostScript les


D.1 The format of S IM FIT PostScript les
One of the unique features of SIMFIT PostScript les is that the format is designed to make retrospective editing easy. A typical example of when this could be useful would be when a graph needs to be changed for some reason. Typically an experimentalist might have many plots stored as .eps les and want to alter one for publication or presentation. SIMFIT users are strongly recommended to save all their plots as .ps or .eps les, so that they can be altered in the way to be described. Even if you do not have a PostScript printer it is still best to save as .ps, then use GSview/Ghostscript to print or transform into another graphics format. Consider these next two gures, showing how a graph can be transformed by simple editing in a text editor, e.g. NOTEPAD.
Binding Curve for the
2.00
2 2

isoform at 21 C
2.00

Binding for the

4 4

isoform at 25 C

Ligand Bound per Mole of Protein

1.50

Ligand/Mole Protein

1 Site Model
1.00

Model 1
1.00

2 Site Model
0.50

Model 2

Experiment number 3
0.00 0 10 20 30 40 50 0.00 0 10 20 30 40 50

Concentration of Free Ligand(M)

Concentration/M

This type of editing should always be done if you want to use one gure as a reduced size inset gure inside another, or when making a slide, otherwise the SIMFIT default line thickness will be too thin. Note that most of the editing to be described below can actually be done at the stage of creating the le, or by using program EDITPS. In this hypothetical example, we shall suppose that the experimentalist had realized that the title referred to the wrong isoform and temperature, and also wanted to add extra detail, but simplify the graph in order to make a slide using thicker lines and a bolder font. In the following sections the editing required to transform the SIMFIT example le simfig1.ps will be discussed, following a preliminary warning.

D.1.1 Warning about editing PostScript les


In the rst place the technique to be described can only be done with SIMFIT PostScript les, because the format was developed to facilitate the sort of editing that scientists frequently need to perform. Secondly, it must be realized that PostScript les must conform to a very strict set of rules. If you violate these rules, then 338

Editing S IMFIT PostScript les

339

GSview/Ghostscript will warn you and indicate the fault. Unfortunately, if you do not understand PostScript, the warning will be meaningless. So here are some rules that you must keep in mind when editing. t Always keep a backup copy at each successful stage of the editing. t All text after a single percentage sign % to the line end is ignored in PostScript. t Parentheses must always be balanced as in (gure 1(a)) not as in (gure 1(a). t Fonts must be spelled correctly, e.g. Helvetica-Bold and not helveticabold. t Character strings for displaying must have underneath them a vector index string of EXACTLY the same length. t When introducing non-keyboard characters each octal code represents one byte. t The meaning of symbols and line types depends on the function, e.g. da means dashed line while do means dotted line. A review of the PostScript colours, fonts and conventions is also in the w_readme les. In the next sections it will be assumed that are running SIMFIT and have a renamed copy of simfig1.ps in your text editor (e.g. notepad), and after each edit you will view the result using GSview/Ghostscript. Any errors reported when you try to view the edited le will be due to violation of a PostScript convention. The most usual one is to edit a text string without correctly altering the index below it to have exactly the same number of characters.

D.1.2 The percent-hash escape sequence


Later versions of SIMFIT create PostScript les that can be edited by a stretch, clip, slide procedure, which relies on each line containing coordinates being identied by a comment line starting with %#. All text extending to the right from the rst character of this sequence can safely be ignored and is suppressed for clarity in the following examples.

D.1.3 Changing line thickness and plot size


The following text will be observed in the original simfig1.ps le. 72.00 252.00 translate 0.07 0.07 scale 0.00 rotate 11.00 setlinewidth 0 setlinecap 0 setlinejoin [] 0 setdash 2.50 setmiterlimit The postx argument for setlinewidth alters the line width globally. In other words, altering this number by a factor will alter all the linewidths in the gure by this factor, irrespective on any changes in relative line thicknesses set when the le was created. The translate, scale and rotate are obvious, but perhaps best done by program EDITPS. Here is the same text edited to increase the line thickness by a factor of two and a half. 72.00 252.00 translate 0.07 0.07 scale 0.00 rotate 27.50 setlinewidth 0 setlinecap 0 setlinejoin [] 0 setdash 2.50 setmiterlimit

D.1.4 Changing PostScript fonts


In general the Times-Roman fonts may be preferred for readability in diagrams to be included in books, while Helvetica may look better in scientic publications. For making slides it is usually preferable to use Helvetica-Bold. Of course any PostScript fonts can be used, but in the next example we see how to change the fonts in simfig1.ps to achieve the effect illustrated.

340

S IMFIT reference manual: Part 5

/ti-font /xl-font /yl-font /zl-font /tc-font /td-font /tl-font /tr-font /ty-font /tz-font

/Times-Bold /Times-Roman /Times-Roman /Times-Roman /Times-Roman /Times-Roman /Times-Roman /Times-Roman /Times-Roman /Times-Roman

D%plot-title D%x-legend D%y-legend D%z-legend D%text centred D%text down D%text left to right D%text right to left D%text right y-mid D%text left y-mid

The notation is obvious, the use indicated being clear from the comment text following the percentage sign % at each denition, denoted by a D. This is the editing needed to bring about the font substitution. /ti-font /xl-font /yl-font /zl-font /tc-font /td-font /tl-font /tr-font /ty-font /tz-font /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold /Helvetica-Bold D%plot-title D%x-legend D%y-legend D%z-legend D%text centred D%text down D%text left to right D%text right to left D%text right y-mid D%text left y-mid

Observing the scheme for colours (just before the fonts in the le) and text sizes (following the font denitions) will make it obvious how to change colours and text sizes.

D.1.5 Changing title and legends


Observe the declaration for the title and legends in the original le. (Binding Curve for the a2b2 isoform at 21@C) 3514 4502 ti (000000000000000000000061610000000000000060) fx (Concentration of Free Ligand(lM)) 3514 191 xl (00000000000000000000000000000300) fx (Ligand Bound per Mole of Protein) 388 2491 yl (00000000000000000000000000000000) fx Note that, for each of the text strings displayed, there is a corresponding index of font substitutions. For example a zero prints the letter in the original font, a one denotes a subscript, while a six denotes bold maths. Since the allowed number of index keys is open-ended, the number of potential font substitutions is enormous. You can have any accent on any letter, for instance. This is the editing required to change the text. However, note that the positions of the text do not need to be changed, the font display functions work out the correct position to centre the text string. (Binding for the a4c4 isoform at 25@C) 3514 4502 ti (000000000000000061610000000000000060) fx (Concentration/lM) 3514 191 xl (0000000000000030) fx (Ligand/Mole Protein) 388 2491 yl (0000000000000000000) fx Note that the \ character is an escape character in PostScript so, if you want to have something like an unbalanced parenthesis, as in Figure 1 a) you would have to write Figure 1a\). When you create a PostScript le from SIMFIT it will prevent you from writing a text string that violates PostScript conventions but, when you are editing, you must make sure yourself that the conventions are not violated, e.g. use c:\\simfit instead of c:\simfit.

Editing S IMFIT PostScript les

341

D.1.6 Deleting graphical objects


It is very easy to delete any text or graphical object by simply inserting a percentage sign % at the start of the line to be suppressed. In this way an experimental observation can be temporarily suppressed, but it is still in the le to be restored later if required. Here is the PostScript code for the notation on the left hand vertical, i.e. y axis in the le simfig1.ps. 910 1581 958 1581 li 6118 1581 6070 1581 li (0.50) 862 1581 ty (0000) fx 910 2491 958 2491 li 6118 2491 6070 2491 li (1.00) 862 2491 ty (0000) fx 910 3401 958 3401 li 6118 3401 6070 3401 li (1.50) 862 3401 ty (0000) fx This is the text, after suppressing the tick marks and notation for y = 0.5 and y = 1.5 by inserting a percentage sign. Note that the index must also be suppressed as well as the text string. %910 1581 958 1581 li %6118 1581 6070 1581 li %(0.50) 862 1581 ty %(0000) fx 910 2491 958 2491 li 6118 2491 6070 2491 li (1.00) 862 2491 ty (0000) fx %910 3401 958 3401 li %6118 3401 6070 3401 li %(1.50) 862 3401 ty %(0000) fx

D.1.7 Changing line and symbol types


This is simply a matter of substituting the desired line or plotting symbol key. Lines : Circles : Triangles: Squares : Diamonds : Signs : li ce te se de ad (normal) (empty) (empty) (empty) (empty) (add) da ch th sh dh mi (dashed) (half) (half) (half) (half) (minus) do (dotted) dd (dashed dotted) pl (polyline) cf(full) tf (full) sf (full) df (full) cr (cross) as (asterisk)

Here is the original text for the dashed line and empty triangles. 5697 3788 120 da 933 1032 72 te 951 1261 72 te 984 1566 73 te 1045 1916 72 te 1155 2346 72 te 1353 2708 73 te

342

S IMFIT reference manual: Part 5

1714 2367 3551 5697

3125 3597 3775 4033

72 72 72 72

te te te te

Here is the text edited for a dotted line and empty circles. 5697 3788 120 do 933 1032 72 ce 951 1261 72 ce 984 1566 73 ce 1045 1916 72 ce 1155 2346 72 ce 1353 2708 73 ce 1714 3125 72 ce 2367 3597 72 ce 3551 3775 72 ce 5697 4033 72 ce

D.1.8 Adding extra text


Here is the original extra text section. /font /Times-Roman GS font F size S (1 Site Model) (000000000000) fx /font /Times-Roman GS font F size S (2 Site Model) (000000000000) fx D /size 216 D 4313 2874 M 0 rotate

D /size 216 D 1597 2035 M 0 rotate

Here is the above text after changing the font. /font /Helvetica-BoldOblique D /size 216 D GS font F size S 4313 2874 M 0 rotate (Model 1) (0000000) fx /font /Helvetica-BoldOblique D /size 216 D GS font F size S 1597 2035 M 0 rotate (Model 2) (0000000) fx Here is the additional code required to add another label to the plot. /font /Helvetica-BoldOblique D /size 240 D GS font F size S 2250 1200 M 0 rotate (Experiment number 3) (0000000000000000000) fx

Editing S IMFIT PostScript les

343

D.1.9 Standard fonts


All PostScript printers have a basic set of 35 fonts and it can be safely assumed that graphics using these fonts will display in GSview/Ghostscript and print on all except the most primitive PostScript printers. Of course there may be a wealth of other fonts available. The Times and Helvetica fonts are well known, and the monospaced Courier family of typewriter fonts are sometimes convenient for tables.

Times-Roman !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Times-Bold !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Times-BoldItalic !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Helvetica !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Helvetica-Bold !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Helvetica-BoldOblique !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~

D.1.10 Decorative fonts


Sometimes decorative or graphic fonts are required, such as pointing hands or scissors. It is easy to include such fonts using program Simplot, although the characters will be visible only if the plot is inspected using GSview/Ghostscript.

344

S IMFIT reference manual: Part 5

Symbol !#%&()+,./0123456789:;<=>? []_ {|} ZapfDingbats wvui !"#$%&'( )0123456789@ABCDEFGHIPQRSTVWXY `abcdefghipqrstuvwxyd ZapfChancery-MediumItalic !"#$%&()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_ abcdefghijklmnopqrstuvwxyz{|}~ Some extra characters in Times, Helvetica, etc. (361)(267)(262)(263)(241)(246)(372)(277)(312)(247)(243) Some extra characters in Symbol (320)(341)(361)(273)(253)(333)(334)(336)(254)(256)|(174) (304)(305)(260)(270)(316)(274)(306)(272)(246)(321)(263) (245)(362)(243)(264)(271)(325)(266)(261)(326)(345)(310)
D.1.11 Plotting characters outside the keyboard set
To use characters outside the keyboard set you have to use the corresponding octal codes. Note that these codes represent just one byte in PostScript so, in this special case, four string characters need only one key character. For example, such codes as \277 for an upside down question mark in standard encoding, or \326 for a square root sign in Symbol, only need one index key. You might wonder why, if Simplot can put any accent on any character and there are maths and bold maths fonts, you would ever want alternative encodings, like the ISOLatin1Encoding. This is because the ISOLatin1Encoding allows you to use specially formed accented letters, which are more accurately proportioned than those generated by program Simplot by adding the accent as a second over-printing character, e.g. using \361 for n tilde is more professional than overprinting. All the characters present in the coding vectors to be shown next can be used by program Simplot, as well as a special Maths/Greek font and a vast number of accented letters and graphical objects, but several points must be remembered. All letters can be displayed using GSview/Ghostscript and then Adobe Acrobat after distilling to pdf. Although substitutions can be made interactively from Simplot, you can also save a .eps le and edit it in a text editor. When using an octal code to introduce a non-keyboard character, only use one index key for the four character code. If you do not have a PostScript printer, save plots as .eps les and print from GSview/Ghostscript or transform into graphics les to include in documents. Some useful codes follow, then by examples to clarify the subject. You will nd it instructive to view simfonts.ps in the SIMFIT viewer and display it in GSview/Ghostcsript.

Editing S IMFIT PostScript les

345

D.1.12 The StandardEncoding Vector

octal \00x \01x \02x \03x \04x \05x \06x \07x \10x \11x \12x \13x \14x \15x \16x \17x \20x \21x \22x \23x \24x \25x \26x \27x \30x \31x \32x \33x \34x \35x \36x \37x

! ( 0 8 @ H P X h p x ) 1 9 A I Q Y a i q y

" * 2 : B J R Z b j r z

# + 3 ; C K S [ c k s {

$ , 4 < D L T \ d l t |

% 5 = E M U ] e m u }

& . 6 > F N V ^ f n v ~

/ 7 ? G O W _ g o w

' `

346

S IMFIT reference manual: Part 5

D.1.13 The ISOLatin1Encoding Vector

octal \00x \01x \02x \03x \04x \05x \06x \07x \10x \11x \12x \13x \14x \15x \16x \17x \20x \21x \22x \23x \24x \25x \26x \27x \30x \31x \32x \33x \34x \35x \36x \37x

! ( 0 8 @ H P X h p x ) 1 9 A I Q Y a i q y

" * 2 : B J R Z b j r z

# + 3 ; C K S [ c k s {

$ , 4 < D L T \ d l t |

% 5 = E M U ] e m u }

& . 6 > F N V ^ f n v ~

/ 7 ? G O W _ g o w

Editing S IMFIT PostScript les

347

D.1.14 The SymbolEncoding Vector

octal \00x \01x \02x \03x \04x \05x \06x \07x \10x \11x \12x \13x \14x \15x \16x \17x \20x \21x \22x \23x \24x \25x \26x \27x \30x \31x \32x \33x \34x \35x \36x \37x

( 0 8

! ) 1 9

2 :

# + 3 ; [ {

, 4 < |

% 5 = ] }

& . 6 >

/ 7 ? _

348

S IMFIT reference manual: Part 5

D.1.15 The ZapfDingbatsEncoding Vector

octal \00x \01x \02x \03x \04x \05x \06x \07x \10x \11x \12x \13x \14x \15x \16x \17x \20x \21x \22x \23x \24x \25x \26x \27x \30x \31x \32x \33x \34x \35x \36x \37x

v i $ 5 C Q Y g u  %  6 D R ` h u

  & ) 7 E S a i

  ' 0 8 F T b p v

 ( 1 9 G U c q w

 ! 2 @ H V d r x

 " 3 A I W e s y d

w  # 4 B P X f t

p | i

e o }

f n ~

g m 

h x

j y

k z

l {

Editing S IMFIT PostScript les

349

D.1.16 SIMFIT character display codes


0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L Standard font Standard font subscript Standard font superscript Maths/Greek Maths/Greek subscript Maths/Greek superscript Bold Maths/Greek ZapfDingbats (PostScript) Wingding (Windows) ISOLatin1Encoding (PostScript), Standard (Windows, almost) Special (PostScript) Wingding2 (Windows) Grave accent Acute accent Circumex/Hat Tilde Macron/Bar/Overline Dieresis Maths/Greek-hat Maths/Greek-bar Bold maths/Greek-hat Bold Maths/Greek-bar Symbol font Bold Symbol font

You will need non-keyboard characters from the standard font for such characters as a double dagger () or upside down question mark (), e.g. typing \277 in a text string would generate the upside down question mark () in the PostScript output. If you want to include a single backslash in a text string, use \\, and also cancel any unpaired parentheses using \( and \). Try it in program SIMPLOT and it will then all make \361 sense. The ISOLatin1Encoding vector is used for special characters, such as \305 for Angstrom (A), for n-tilde ( n), or \367 for the division sign (), and, apart from a few omissions, the standard Windows font is the same as the ISOLatin1Encoding. The Symbol and ZapfDingbats fonts are used for including special graphical characters like scissors or pointing hands in a text string. A special font is reserved for PostScript experts who want to add their own character function. Note that, in a document with many graphs, the prologue can be cut out from all the graphs and sent to the printer just once at the start of the job. This compresses the PostScript le, saves memory and speeds up the printing. Examine the manuals source code for this technique. If you type four character octal codes as character strings for plotting non-keyboard characters, you do not have to worry about adjusting the character display codes, program SIMPLOT will make the necessary corrections. The only time you have to be careful about the length of character display code vectors is when editing in a text editor. If in doubt, just pad the character display code vector with question marks until it is the same length as the character string.

350

S IMFIT reference manual: Part 5

D.2 editps text formatting commands


Program editps uses the SIMFIT convention for text formatting characters within included SIMFIT .eps les but, because this is rather cumbersome, a simplied set of formatting commands is available within editps whenever you want to add text, or even create PostScript les containing text only. The idea of these formatting commands is to allow you to introduce superscripts, subscripts, accented letters, maths, dashed lines or plotting symbols into PostScript text les, or into collage titles, captions, or legends, using only ASCII text controls. To use a formatting command you simply introduce the command into the text enclosed in curly brackets as in: {raise}, {lower}, {newline}, and so on. If {anything} is a recognized command then it will be executed when the .eps le is created. Otherwise the literal string argument, i.e. anything, will be printed with no inter-word space. Note that no {commands} add interword spaces, so this provides a mechanism to build up long character strings and also control spacing; use {anything} to print anything with no trailing inter-word space, or use { } to introduce an inter-word space character. To introduce spaces for tabbing, for instance, just use {newline}{ }start-of-tabbing, with the number of spaces required inside the { }. Note that the commands are both spelling and case sensitive, so, for instance, {21}{degree}{C} will indicate the temperature intended, but {21}{degrees}{C} will print as 21degreesC while {21}{Degree}{C} will produce 21DegreeC.

D.2.1 Special text formatting commands, e.g. left


{left} . . . use {left} to print a { {right} . . . use {right} to print a } {%!command} . . . use {%!command} to issue command as raw PostScript The construction {%!command} should only be used if you understand PostScript. It provides PostScript programmers with the power to create special effects. For example {%!1 0 0 setrgbcolor}, will change the font colour to red, and {%!0 0 1 setrgbcolor} will make it blue, while {%!2 setlinewidth} will double line thickness. In fact, with this feature, it is possible to add almost any conceivable textual or graphical objects to an existing .eps le.

D.2.2 Coordinate text formatting commands, e.g. raise


{raise} . . . use {raise} to create a superscript or restore after {lower} {lower} . . . use {lower} to create a subscript or restore after {raise} {increase} . . . use {increase} to increase font size by 1 point {decrease}. . . use {decrease} to decrease font size by 1 point {expand} . . . use {expand} to expand inter-line spacing by 1 point {contract} . . . use {contract} to contract inter-line spacing by 1 point

D.2.3 Currency text formatting commands, e.g. dollar


{dollar} $ {sterling} {yen} Y

D.2.4 Maths text formatting commands, e.g. divide


{divide} {multiply} {plusminus}

D.2.5 Scientic units text formatting commands, e.g. Angstrom


{Angstrom} A {degree} {micron}

editps PostScript formatting commands

351

D.2.6 Font text formatting commands, e.g. roman


{roman} {helveticabold} {zapfdingbats} {bold} {helveticaoblique} {isolatin1} {italic} {symbol} {helvetica} {zapfchancery}

Note that you can use octal codes to get extra-keyboard characters, and the character selected will depend on whether the StandardEncoding or IOSLatin1Encoding is current. For instance, \ 361 will locate an {ae} character if the StandardEncoding Encoding Vector is current, but it will locate a {n } character if the ISOLatin1Encoding Encoding Vector is current, i.e. the command {isolatin1} has been used previously. The command {isolatin1} will install the ISOLatin1Encoding Vector as the current Encoding Vector until it is cancelled by any font command, such as {roman}, or by any shortcut command such as {ntilde} or {alpha}. For this reason, {isolatin1} should only be used for characters where shortcuts like {ntilde} are not available.

D.2.7 Poor mans bold text formatting command, e.g. pmb?


The command {pmb?} will use the same technique of overprinting as used by the Knuth TEX macro to render the argument, that is ? in this case, in bold face font, where ? can be a letter or an octal code. This is most useful when printing a boldface character from a font that only exists in standard typeface. For example, {pmbb} will print a boldface letter b in the current font then restore the current font, while {symbol}{pmbb}{roman} will print a boldface beta then restore roman font. Again, {pmb\ 243} will print a boldface pound sign.

D.2.8 Punctuation text formatting commands, e.g. dagger


{dagger} {questiondown} {daggerdbl} {paragraph} {section}

D.2.9 Letters and accents text formatting commands, e.g. Aacute


{Aacute} A {atilde} a {ccedilla} c {edieresis} e {idieresis} {ocircumex} o {uacute} u {agrave} a ` {adieresis} a {egrave} e ` {igrave} ` {ntilde} n {otilde} o {ucircumex} u {aacute} a {aring} a {eacute} e {iacute} {ograve} o ` {odieresis} o {udieresis} u {acircumex} a {ae} {ecircumex} e {icircumex} {oacute} o {ugrave} u `

All the other special letters can be printed using {isolatin1} (say just once at the start of the text) then using the octal codes, for instance {isolatin1}{\ 303} will print an upper case ntilde.

D.2.10 Greek text formatting commands, e.g. alpha


{alpha} {epsilon} {kappa} {pi} {tau} {beta} {phi} {lambda} {theta} {omega} {chi} {gamma} {mu} {rho} {psi} {delta} {eta} {nu} {sigma}

All the other characters in the Symbol font can be printed by installing Symbol font, supplying the octal code, then restoring the font, as in {symbol}{\ 245}{roman} which will print innity, then restore Times Roman font.

352

S IMFIT reference manual: Part 5

D.2.11 Line and Symbol text formatting commands, e.g. ce


{li} = line {da} = dashed line {do} = dotted line {dd} = dashed dotted line {ce}, {ch}, {cf} = circle (empty, half lled, lled) {te}, {th}, {tf} = triangle (empty, half lled, lled) {se}, {sh}, {sf} = square (empty, half lled, lled) {de}, {dh}, {df} = diamond (empty, half lled, lled) These line and symbol formatting commands can be used to add information panels to legends, titles, etc. to identify plotting symbols.

D.2.12 Examples of text formatting commands


{TGF}{beta}{lower}{1}{raise} is involved TGF1 is involved y = {x}{raise}{2}{lower} + 2 y = x2 + 2 The temperature was {21}{degree}{C} The temperature was 21 C {pi}{r}{raise}{decrease}{2}{increase}{lower} is the area of a circle r2 is the area of a circle The {alpha}{lower}{2}{raise}{beta}{lower}{2}{raise} isoform The 2 2 isoform {[Ca}{raise}{decrease}{++}{increase}{lower}{]} = {2}{mu}{M} [Ca++ ] = 2M

PostScript specials

353

D.3 PostScript specials


SIMFIT PostScript les are designed to faciltate editing, and one important type of editing is to be able to specify text les, known as specials, that can modify the graph in an almost unlimited number of ways. This technique will now be described but, if you want to do it and you are not a PostScript programmer, do not even think about it; get somebody who has the necessary skill to do what you want. An example showing how to display a logo will be seen on page 235 and further details follow.

D.3.1 What specials can do


First of all, here are some examples of things you may wish to do with SIMFIT PostScript les that would require specials. t Replace the 35 standard fonts by special user-dened fonts. t Add a logo to plots, e.g. a departmental heading for slides. t Redene the plotting symbols, line types, colours, ll styles, etc. t Add new features, e.g. outline or shadowed fonts, or clipping to non-rectangular shapes. When SIMFIT PostScript les are created, a header section, called a prologue, is placed at the head of the le which contains all the denitions required to create the SIMFIT dictionary. Specials can be added, as independent text les, to the les after these headings in order to re-dene any existing functions, or even add new PostScript plotting instructions. The idea is is very simple; you can just modify the existing SIMFIT dictionary, or even be ambitious and add completely new and arbitrary graphical objects.

D.3.2 The technique for dening specials


Any SIMFIT PostScript le can be taken into a text editor in order to delete the existing header in order to save space in a large document, as done with the SIMFIT manual, or else to paste in a special. However, this can also be done interactively by using the font option, accessible from the SIMFIT PostScript interface. Since this mechanism is so powerful, and could easily lead to the PostScript graphics being permanently disabled by an incorrectly formatted special, SIMFIT always assumes that no specials are installed. If you want to use a special, then you simply install the special and it will be active until it is de-selected or replaced by another special. Further details will be found in the on-line documentation and w_readme les, and examples of specials are distributed with the SIMFIT package to illustrate the technique. You should observe the effect of the example specials before creating your own. Note that any les created with specials can easily be restored to the default conguration by cutting out the special. So it makes sense to format your specials like the SIMFIT example specials pspecial.1, etc. to facilitate such retrospective editing. The use of specials is controlled by the le pspecial.cfg as now described. The rst ten lines are Booleans indicating which of les 1 through 10 are to be included. The next ten lines are the le names containing the special code. There are ten SIMFIT examples supplied, and it is suggested that line 1 of your specials should be in the style of these examples. You simply edit the le names in pspecial.cfg to install your own specials. The Booleans can be edited interactively from the advanced graphics PS/Fonts option. Note that any specials currently installed are agged by the SIMFIT program manager and specials only work in advanced graphics mode. In the event of problems with PostScript printing caused by specials, just delete pspecial.cfg. To summarise. t Create the special you want to insert. t Edit the le psecial.cfg in the SIMFIT folder. t Attach the special using the Postscript Font option.

354

S IMFIT reference manual: Part 5

D.3.3 Examples of PostScript specials


To clarify the structure of SIMFIT PostScript specials, just consider the code for the rst three examples distributed with the SIMFIT package. The le psecial.1 simply adds a monochrome logo, the le psecial.2 shows how to add color, while the le psecial.3 makes more sweeping changes to the color scheme by reversing the denitions for black and white. t

The PostScript special pspecial.1


%file = pspecial.1: add monochrome simfit logo to plot gsave /printSIMFIT {0 0 moveto (SIMFIT) show} def /Times-Italic findfont 300 scalefont setfont 300 4400 translate .95 -.05 0 {setgray printSIMFIT -10 5 translate} for 1 1 1 setrgbcolor printSIMFIT grestore %end of pspecial.1

The PostScript special pspecial.2


%file = pspecial.2: add yellow simfit logo to plot gsave /printSIMFIT {0 0 moveto (SIMFIT) show} def /Times-Italic findfont 300 scalefont setfont 300 4400 translate .95 -.05 0 {setgray printSIMFIT -10 5 translate} for 0 0 moveto (SIMFIT) true charpath gsave 1 1 0 setrgbcolor fill grestore grestore %end of pspecial.2

The PostScript special pspecial.3


%file = pspecial.3: yellow-logo/blue-background/swap-black-and-white /background{.5 .5 1 setrgbcolor}def background 0 0 0 4790 6390 4790 6390 0 4 pf /c0{1 1 1 setrgbcolor}def /c15{0 0 0 setrgbcolor}def /foreground{c0}def gsave /printSIMFIT {0 0 moveto (SIMFIT) show} def /Times-Italic findfont 300 scalefont setfont 300 4400 translate .95 -.05 0 {setgray printSIMFIT -10 5 translate} for 0 0 moveto (SIMFIT) true charpath gsave 1 1 0 setrgbcolor fill grestore grestore %end of pspecial.3

Remember, the effects of these specials are only visible in the PostScript les created by SIMFIT and not in any direct Windows quality hardcopy.

Appendix E

Auxiliary programs
E.1 Recommended software
SIMFIT can be used as a self-contained free-standing package. However, it is assumed that users will want to integrate SIMFIT with other software, and the driver w_simfit.exe has been constructed with the Windows calculator, the Windows Notepad text editor, the GSview/Ghostscript PostScript interpreter, and the Adobe Acrobat pdf reader as defaults. Users can, of course, easily replace these by their own choices, by using the Conguration option on the main menu. The clipboard can be used to integrate SIMFIT with any other A Windows programs, and it is assumed that some users will want to interface SIMFIT with L TEX while others would use the Microsoft Ofce suite.

E.1.1 The interface between SIMFIT and GSview/Ghostscript


You must install Ghostscript to get the most out of SIMFIT graphics, and the user-friendly GSview is strongly recommended to view PostScript les, or drive non-PostScript printers. Note that, although it is not essential to have GSview installed, you must have Ghostscript installed and SIMFIT must be congured to use it in order for SIMFIT to make graphics les such as .png from SIMFIT .eps les. Visit the home pages at http://www.cs.wisc.edu/ghost/
A E.1.2 The interface between SIMFIT, L TEX and Dvips
A The .eps les generated by SIMFIT have correct BoundingBox dimensions, so that L TEX can use packages A such as Dvips, Wrapg, PSfrag and so on, as will be clear, for instance, from the LTEX code for this manual.

E.2 S IM FIT Microsoft Ofce, and OpenOfce


SIMFITcan import tables with or without row and column labels in any of these formats. a) Tables in SIMFITformat b) Tables copied to the clipboard from any program c) Tables written to le using any program d) Spreadsheet les saved in tab, space, comma, or semicolon delimited format e) Spreadsheet les saved in XML or HTML format f) Spreadsheet tables saved in Unicode format Also, SIMFIT writes ASCII text results les, and outputs graphics les in .eps, .png, .pdf, .emf, .pcx, .jpg, .tif, or .bmp, formats. So, as many SIMFIT users are familiar with Notepad, Microsoft Ofce, or OpenOfce, this document explains how to interface to these programs.

355

356

Denitions

E.3 Denitions
Here we summarize the denitions to be used in this section.

E.3.1 Data tables


A data table is a n by m rectangular set of cells which contain only numerical values, as in the following example of a 2 by 3 data matrix.

1.1 2.1

1.2 2.2

1.3 2.3

Here the the columns are space-delimited, i.e., separated by one or more spaces, but we can also have commadelimited tables, with zero or more spaces in addition to the commas, as shown next.

1.1, 2.1,

1.2, 2.2,

1.3 2.3

In spreadsheet and word-processing programs, data tables are stored in tab-delimited format, which results in tables being displayed with cells outlined by lines, as follows. 1.1 2.1 1.2 2.2 1.3 2.3

E.3.2 Labeled data tables


A labeled data table employs row 1 for column labels and column 1 for row labels with a dummy label in cell(1,1) as illustrated below.

R/C Row-1 Row-2

Col-1 1.1 2.1

Col-2 1.2 2.2

Col-3 1.3 2.3

Here labels are shown as character strings, and the main table contains only numerical data. Note that there can be no spaces in labels of space-delimited tables, and underscores, hyphens, or similar must be used to avoid ambiguities. Spaces within labels also cause problems with importing SIMFIT results log les into Word and Excel.

E.3.3 Missing values


Missing values arise because data values are misplaced, not recorded, or discarded and this leads to a decient table. The next example illustrates how a missing value can be indicated by an empty cell or a character variable in a data cell. 1.1 2.1 1.3 X

2.2

Any non-numerical characters can be used to indicate missing values, and pre-processing is required to replace empty cells by estimates derived in some way from the rest of the data set, before SIMFIT can process the data.

Denitions

357

E.3.4 SIMFIT data les


The SIMFIT data le format requires a title on line 1, then the number of rows and columns on line 2, followed by the table of numerical values, with no missing values, as in the next example. Title for data 23 1.1 1.2 1.3 2.1 2.2 2.3

E.3.5 SIMFIT data les with labels


For some purposes row and column labels may be appended to SIMFIT data les using a special symbolism as follows. Title for data 2, 3 1.1, 1.2, 1.3 2.1, 2.2, 2.3 7 begin{labels} Row-1 Row-2 Col-1 Col-2 Col-3 end{labels} The number 7 after the data table indicates that there are 7 extra lines appended, but this extra counter is optional. Note that the label in cell(1,1) is not recorded, and the labels are in sequential order with rows followed by columns. Note that SIMFIT data les are created automatically by all of the various methods described in the following sections, so that the user need only create a data table or labeled data table before using a particular method.

E.3.6 Clipboard data


The only type of clipboard data that can be used by SIMFIT is where a data table or labeled data table has been selected and copied to the clipboard. If a space-delimited labeled table has been copied to the clipboard from a text editor like Notepad, there must be no spaces within labels, so use underscores.

E.3.7 Files exported from spreadsheet programs


The only types of les exported from spreadsheet programs that can be used by SIMFIT are data tables or labeled data tables exported in text, XML, or HTML format. Text les can be delimited by spaces, tabs, commas, or semicolons, but les exported in space-delimited text format must have no spaces within labels, so use underscores.

358

Spreadsheet tables

E.4 Spreadsheet tables


Data for analysis by SIMFIT would usually be held in a spreadsheet program, like Microsoft Ofce Excel or OpenOfce Calc, as in this example of an unselected table of multivariate statistical data from K.R. Gabriel in Biometrika 1971, 58, 45367.
Percent Toilet Kitchen Bath Electricity Water Radio TV set Refrigerator Christian 98.2 78.8 14.4 86.2 32.9 73 4.6 29.2 Armenian 97.2 81 17.6 82.1 30.3 70.4 6 26.3 Jewish 97.3 65.6 6 54.5 21.1 53 1.5 4.3 Moslem 96.9 73.3 9.6 74.7 26.9 60.5 3.4 10.5 American 97.6 91.4 56.2 87.2 80.1 81.2 12.7 52.8 Shaafat 94.4 88.7 69.5 80.4 74.3 78 23 49.7 A-Tur 90.2 82.2 31.8 68.6 46.3 67.9 5.6 21.7 Silwan 94 84.2 19.5 65.5 36.2 64.8 2.7 9.5 Sur-Bahar 70.5 55.1 10.7 26.1 9.8 57.1 1.3 1.2

Such tables must be rectangular, with optional row and column labels, and all other cells lled with numerical data (missing data will be discussed later). Often it would only be necessary to select cells containing numerical values from such a table by highlighting as follows.
Percent Toilet Kitchen Bath Electricity Water Radio TV set Refrigerator Christian 98.2 78.8 14.4 86.2 32.9 73 4.6 29.2 Armenian 97.2 81 17.6 82.1 30.3 70.4 6 26.3 Jewish 97.3 65.6 6 54.5 21.1 53 1.5 4.3 Moslem 96.9 73.3 9.6 74.7 26.9 60.5 3.4 10.5 American 97.6 91.4 56.2 87.2 80.1 81.2 12.7 52.8 Shaafat 94.4 88.7 69.5 80.4 74.3 78 23 49.7 A-Tur 90.2 82.2 31.8 68.6 46.3 67.9 5.6 21.7 Silwan 94 84.2 19.5 65.5 36.2 64.8 2.7 9.5 Sur-Bahar 70.5 55.1 10.7 26.1 9.8 57.1 1.3 1.2

However, sometimes row and column labels could also be needed, when a labeled table with cells containing either labels or numerical values would be selected, as follows.
Percent Toilet Kitchen Bath Electricity Water Radio TV set Refrigerator Christian 98.2 78.8 14.4 86.2 32.9 73 4.6 29.2 Armenian 97.2 81 17.6 82.1 30.3 70.4 6 26.3 Jewish 97.3 65.6 6 54.5 21.1 53 1.5 4.3 Moslem 96.9 73.3 9.6 74.7 26.9 60.5 3.4 10.5 American 97.6 91.4 56.2 87.2 80.1 81.2 12.7 52.8 Shaafat 94.4 88.7 69.5 80.4 74.3 78 23 49.7 A-Tur 90.2 82.2 31.8 68.6 46.3 67.9 5.6 21.7 Silwan 94 84.2 19.5 65.5 36.2 64.8 2.7 9.5 Sur-Bahar 70.5 55.1 10.7 26.1 9.8 57.1 1.3 1.2

As an example, consider the next gures, showing these data from the SIMFIT test le houses.tf1 when displayed as biplots. The dummy label in cell(1,1) is not used.

Spreadsheet tables

359

40

Bath

Multivariate Biplot
A-Tur Isawyie Radio Kitchen Sur-Bahar Bet-Safafa Toilet Silwan Abu-Tor Jewish Moslem Armenian Christian

Shaafat Bet-Hanina TV set Water Am.Colony Sh.Jarah

-40

Refrigerator

-80 -60 0

Electricity 60

Three Dimensional Multivariate Biplot

Electricity Am.Colony Sh.Jarah Christian Armenian Shaafat Bet-Hanina Refrigerator Water Bath TV set

.35

Toilet

Moslem A-Tur Isawyie Kitchen Radio Silwan Abu-Tor Jewish

Z
-.4 .6

-1

Sur-Bahar Bet-Safafa

Y
0 -.4

This document explains how to transfer data from such spreadsheet tables into SIMFIT.

360

Using the clipboard

E.5 Using the clipboard to transfer data into S IM FIT


There are two formats that can be used to transfer a rectangular tabular data set into SIMFIT for analysis. The cells contain only numerical data The cells also have associated row and column labels If labels are present then the column labels (usually variables) must occupy the top row of cells, the row labels (usually cases) must occupy the rst column of cells, and there must be a dummy row/column label in cell(1,1). Purely numerical tables are used for curve tting and statistical analysis, where the identity of the observations is immaterial. Row and column labels can be added just to help identify the data retrospectively, but they are also used in multivariate statistics to identify plotting symbols or label dendrograms. As long as there are no empty cells, such tables can be imported directly into SIMFIT by a [Paste] button, or be transformed into SIMFIT data les using program maksim. Cells can be delimited by spaces, commas, semicolons, or tabs, and there must be no empty cells. Further, if a space-delimited table has been copied to the clipboard from a text editor like Notepad, there must be no spaces within labels.

E.5.1 Pasting data from the clipboard directly into SIMFIT


In Microsoft Excel, OpenOfce Calc, or almost any text editor (such as Notepad) or word processing program, the rectangular table required is selected then copied to the clipboard. The appropriate SIMFIT program is opened and, when a data set is requested, the [Paste] button on the le opening control is pressed. If the table is rectangular, with no missing values, one of the following options will be available. Data: use as a SIMFIT data le (no labels) Data: use as a SIMFIT data le (with labels) If the option to use as a SIMFIT data le is chosen from this SIMFIT clipboard control, the data will be written to a temporary le then analyzed. The names of these temporary les are displayed so they can be saved retrospectively. The options will only be available if the data are consistent with the SIMFIT format, otherwise error messages will be displayed.

E.5.2 Converting data from the clipboard into a SIMFIT le


This involves the use of the SIMFITprogram maksim. The clipboard table is input into maksim exactly as just described for any SIMFIT program, but then there are many options. Rows or columns selected from a check list can be suppressed or restored Sub-tables can be selected by properties Sub-tables can be Saved As ... SIMFIT data les. The advantage of this technique is that, from a single table copied to the clipboard, numerous selected subtables can be archived, so that retrospective analysis of data sub-sets can be performed.

Using spreadsheet table les

361

E.6 Using spreadsheet output les to transfer data into S IM FIT


SIMFIT will accept application les (created by any of the data preparation applications) containing tables as well as SIMFIT data les, and the application les will be transformed into temporary SIMFIT data les. Otherwise, the application les can be transformed into SIMFIT data les using program maksim, in which case the options just described for clipboard data will be available.

E.6.1 Space-delimited les (.txt)


These les must not have spaces to represent empty cells, or spaces within labels. For example, use TV_set instead of TV set.

E.6.2 Comma-delimited les (.csv)


If it is wished to use commas to separate triples of gures before the decimal point then the cell must be quoted. That is, writing 1,234.56 as 1,234.56 to represent 1234.56. This has to be done explicitly in Notepad, but Excel and Calc will add the quotes. When the SIMFIT les is output, it will have 1234.56 not 1,234.56. Note that using a comma for a decimal point ,e.g., 1,5 for 1.5 is just not acceptable. Any comma pairs ,, will be taken to represent blank cells and be expanded to ,X,.

E.6.3 Semicolon-delimited les (.csv)


Commas in numeric cells will still have to be quoted and semicolon pairs ;; will be taken to represent blank cells and be expanded to ;X;.

E.6.4 Tab-delimited les (.txt)


Commas in numeric cells will still have to be quoted and adjacent tabs will be interpreted as empty cells and be expanded to (tab)X(tab).

E.6.5 Unicode (.txt)


If SIMFIT program maksim fails to read Unicode les, then switch Notepad into ASCII text mode, read in the Unicode le, then export again as an ASCII text le, which will then be acceptable

E.6.6 Web documents (.xml, .html, .htm, .mht, .mhtml)


Such les musty have the le extension .xml, .html, .htm, .mht, or .mhtml, and also XML les follow this case-sensitive convention. The tag <Table indicates the start of a table and </Table> indicates the end. The tag <Row indicates the start of a row and </Row> indicates the end. The tag <Cell indicates the start of a new cell and </Cell> indicates the end. The following tokens are used: ss:Index=2 to indicate an empty cell in column 1,i.e., the next cell is cell 2. ss:Type=String for a label, and ss:Type=Number for a numeric cell. HTML les must use standard case-insensitive tags. Files exported from Microsoft Ofce Excel and OpenOfce Calc use these conventions.

362

Using simt6.xls with Excel

E.7 Using simt6.xls with Excel to create S IM FIT data les


The simt4.xls macro remains useful for exporting curve tting les from Excel, as it can use non-adjacent columns as well as checking for nondecreasing values in column 1 and positive values in column 3. However, it cannot deal with labels, so the more versatile simt6.xls macro is now the preferred tool for creating export les in SIMFITdata le format from Excel. The macro also validates the data, checking for basic errors which would cause SIMFIT to reject the data table.

E.7.1 The functionality of simt6.xls


The SIMFIT table can be located anywhere within the spreadsheet, i.e. it does not have to occupy Row 1 and Column 1 of the spreadsheet. The user informs the macro of the location of the SIMFIT table by selecting it with the mouse before the macro is invoked. The SIMFIT table can be either a data table or a labeled table. The macro noties the user of the data tables dimensions and obtains conrmation that a SIMFIT export le is to be created. The macro validates the data table, reporting on any cells which contain invalid entries, i.e. cells which are either blank or contain non-numeric values. If the validation is successful, the user is prompted for a distinctive title for the table and for a lename for the export le. The export le is created with the SIMFIT-dened structure. Within the export le: Cell values are held in the General Number format for details see later; If the data table contains more than 50 columns, the rows in the export le will be split into lines containing a maximum of 50 values. This enables SIMFIT to handle very wide data tables which could otherwise cause overow problems; Any blank labels are replaced by X and all labels are truncated to a maximum length of 20 characters.

E.7.2 The General Number format


The General Number format mirrors the way in which a value is displayed in the spreadsheet and its use for creating the export le thereby places complete control of the output format in the hands of the user. In this connection the user should, however, note that the displayed value can depend on the width of the Excel column. If the width of a column containing a long entry is reduced, Excel may display a rounded version of the number. Alternatively the user can control the displayed value by applying a specic Excel number format to any cells or cell ranges in the spreadsheet. If the user species a number format containing commas to separate triples of digits before the decimal point, these will be suppressed in the export le and hence will not present any problems for SIMFIT.

E.7.3 Using the simt6.xls macro


The simt6.xls workbook will be located in the SIMFIT documents folder, e.g., C:\Program Files\Simfit\doc. In order to use the macro you may need to change the Excel macro security setting. Navigate Tools>Macro>Security from the Menu Bar and set the security level to Medium. The simt6.xls macro has been written and tested with the Microsoft Ofce XP Excel Version 2002, but should work satisfactorily with other versions from Excel 97 to 2008.

Using simt6.xls with Excel

363

E.7.3.1 Step 1: Open the simt6.xls workbook

The simt6.xls workbook acts purely as a carrier for the simt6 macro and contains no worksheets accessible to the user. The workbook must, however, be opened in the normal manner, in order to make the macro available. With the macro security level set to Medium, as described above, you will be warned that workbook contains macros. Click the Enable Macros button to complete the opening procedure. After the workbook has been opened, there is no visible sign of the presence of simt6.xls, because it operates in the background.
E.7.3.2 Step 2: Select the data table within the users workbook

Open the users workbook containing the SIMFIT table and activate the worksheet which holds the table. Next select the SIMFIT table with the mouse in order to inform the macro of the position and dimensions of the table. Selection is carried out by left-clicking the mouse on the top left cell of the table and dragging (moving the mouse with the left key held down) to the bottom right cell of the SIMFIT table. This is single-area selection and simt6.xls does not support multiple-area selections. So, if you have data in separate columns, join them up to form a contiguous block before selecting the table.
E.7.3.3 Step 3: Invoke the macro

The macro is invoked either by navigating Tools>Macro>Macros from the Menu Bar, or by Alt+F8 from the keyboard (meaning hold the Alt key down whilst pressing the F8 key). Click on Simfit in the list of macros, then click on the Run button, and respond to the various dialogue boxes which the macro uses for communicating with the user. If SIMFIT does not appear in the list of macros, check that Step 1 has been carried out successfully.
E.7.3.4 A tip on entering the pathname for the export le

If the macro progresses successfully to the point of creating the export le you will be asked to provide a lename for the le and to indicate the folder in your le structure where you wish to place the export le. This is done by entering a lename in the form of a full pathname, such as C:\.......\.......\<filename>, where the entries between the reverse oblique (\) characters are the names of the intervening folders in your le structure. It can be difcult to enter this accurately if the folder is located at a deep level in the structure, so a way around this is to create a top-level folder to hold the export le. Your pathname will then be much shorter, e.g. C:\MySimfitExports\<filename>. If you wish to maintain the export les at a lower level of the le structure, you can use the facilities of Windows Explorer to move the les to your chosen location at the end of the SIMFIT session.

364

Using transformsim.xls with Excel

E.8 Using transformsim.xls with Excel to create S IM FIT data les


The transformsim.xls macro replicates the functionality of the simt6.xls macro, but offers additional features for lling empty cells in the data table, for performing transformations of the data table, and for transposing the data table before the SIMFIT export le is created. Consequently the user information for simt6.xls in the previous section applies generally to the transformsim.xls macro, and will therefore not be repeated here. The differences between the two products, and the additional features of the transformsim.xls macro, are described below.

E.8.1 The functionality of transformsim.xls


E.8.1.1 A signicant difference between simt6.xls and transformsim.xls

In contrast with simt6.xls, which does not modify the data table held in the worksheet, transformsim.xls applies its editing and transformations to the actual worksheet, and leaves this modied worksheet in place after the macro has terminated. The user can save the modied worksheet at this point, but is strongly advised to save the worksheet with a different lename (using the SaveAs method), otherwise the original le will be overwritten by the modied version.
E.8.1.2 Filling empty cells found in the data table

If the initial validation of the data table nds empty cells, the user is informed of the number of empty cells, and can choose from the following options for processing these cells: 1. Report references of empty cells individually until either the user halts the process or all empty cells have been reported 2. Replace empty cells in rows by row average 3. Replace empty cells in rows by row-wise interpolation 4. Replace empty cells in columns by column average 5. Replace empty cells in columns by column-wise interpolation Row-wise interpolation acts as follows: an empty cell range in the middle of the row will be lled by the average of the neighboring cells adjacent to the range; and an empty cell range which extends to the beginning or end of the row will be lled by the neighboring adjacent cell to the range. Column-wise interpolation applies the same algorithm to columns. Note that options 2 to 5 are applied to all rows (or columns) which contain empty cells, i.e. the user is not able to select a different option for each row or column. Note also that whereas options 2 to 5 will ll all empty cells in a data table, option 1 merely reports the empty cells and leaves the data table unchanged.
E.8.1.3 Performing transformations of the data table

Transformations can only be carried out on a data table which contains no non-numeric cells and no empty cells. Unless these conditions are satised, the macro will terminate. The transformation menu offers the following options: 1. Normalize columns to length 1 2. Normalize columns to standard deviation 1 3. Centralize columns to mean 0 4. Centralize columns to mean 0 and normalize columns to standard deviation 1 5. Normalize rows to length 1

Using transformsim.xls with Excel

365

6. Normalize rows to standard deviation 1 7. Centralize rows to mean 0 8. Centralize rows to mean 0 and normalize rows to standard deviation 1 Multiple transformations can be performed in sequence: each transformation changes the data table, and a subsequent transformation will be applied to the modied data table. Option 4 is an example of a sequence of Option 3 followed by Option 2, but is offered as a separate option because this combination is frequently used. It is the users responsibility to ensure that any chosen sequence of transformations is viable from a statistical analysis standpoint. The macro will report any normalization transformation for which the divisor evaluates to zero, as this would result in a division by zero error. However this occurrence is not treated as a catastrophic failure, as the transformation is declined and the data table is left unaltered.
E.8.1.4 Transposing the SIMFIT table

An option is provided to transpose the SIMFIT table, when the SIMFIT export le is created. Note that this option operates only on the exported le and does not alter the table which is contained within the worksheet.
E.8.1.5 Inspecting and saving the modied worksheet

Because transformsim.xls makes successive changes to the actual worksheet, it can be quite useful to be able to inspect and/or save the modied worksheet at intermediate stages of processing, such as after lling empty cells, or after any transformation in a sequence, or before creating the nal SIMFIT le. In order to view or save the worksheet the user must exit from the macro at an intermediate stage, and additional user exit points have been provided in the macro for this purpose. In order to resume processing, the user must either: Select the table in the modied worksheet and re-run the macro; or Re-load the original worksheet and re- run the macro, repeating the previous processing steps.
E.8.1.6 The History Log

Every time the transformsim macro is run, a new worksheet is inserted into the users workbook to hold the History Log. Any History Log worksheet which is already present in the workbook (created by a previous macro run) is deleted. [Tip if you wish to retain History Logs from previous macro runs, rename their worksheets (History Log 1, History Log 2, etc) before running the macro.] The History Log records: The sequence of processes which have been applied by the macro. This is important as an audit trail of how the nal data table was derived from the original version, especially if the process sequence has to be repeated at some future time. The processing time spent on the main iterative processes (to assist in estimating overall processing times for very large data tables) Any errors which were detected during processing In order to ensure that the History Log travels with the SIMFIT le, its contents are appended to the SIMFIT le as additional lines between the delimiters begin transformationhistory and end transformationhistory using the transformsim.xls macro with Excel to create SIMFIT data les

366

Importing S IMFIT results tables into documents and spreadsheets

E.9 Importing S IM FIT results tables into documents and spreadsheets


Every time SIMFIT performs an analysis, tables of the results are written to a current log le in strict ASCII text format, using cyclical default names f$result.0nm, and the ten most recent of these are always preserved for retrospective use.

E.9.1 Log les


Each time a new program is run, the existing log les are rst renamed, e.g., f$result.010 is deleted, f$result.009 is renamed f$result.010, and so on, while f$result.txt is renamed f$result.001 and a new current log le called f$result.txt is created. If some results are likely to be useful, then you should save the current log le with a distinctive name before it is renamed or deleted, but note that you can also congure SIMFIT to request individual names from you for every new log le.

E.9.2 Extracting tables


It is likely that you will only need to use limited sections, such as tables of results, from the SIMFIT log les. So, open the log le in Notepad, copy the table required to the clipboard, open a new le, then paste in the table from the clipboard to create a table le.

E.9.3 Printing results


Results log les written by SIMFIT are text les, so tables and results formats will only be preserved if you browse or print the results le using a text editor, such as Notepad, which has been set up to use a monospace font, like Courier New. So, to print SIMFIT results tables so that formatting is preserved, open the previously described results log le in Notepad, then select the option to print.

E.9.4 Importing tables into documents


You can import the whole of a SIMFIT log le, or just an arbitrary table copied from it, into a word processor such as Microsoft Word or OpenOfce Writer. Unfortunately, the tables will only display as ordered columns when using a xed width font, such as Courier New, and you will have to edit the table in order to use a proportionally spaced font, like Arial, or Times New Roman. The following sections explain how to import results table into Excel and Word using standard features of these applications. In addition, a macro for automating the operations for a Word import is also described.

E.9.5 Importing a SIMFIT results log le table into Excel


Open the results log le in Notepad, Copy the required table and Paste it into Excel at a destination cell in Column A. Select the table. Navigate Data>Text to Columns to launch the Conversion Wizard, then a) Select Fixed Width as the original data type, then click Next b) Adjust the column break lines created by the Wizard to the required column boundaries. Then click Next. c) Select format for columns. Choose General. d) Click Finish to exit the Wizard. If you want to change the general format on some column, then format the columns to the required number format by selecting the columns and navigating Format>Cells>Number>Number.

Importing S IMFIT results tables into documents and spreadsheets

367

E.9.6 Converting a SIMFIT results log le table into a Word table (manual method)
The method explained below is suitable for converting a SIMFIT results log le table in which the column values are separated by one or more space characters. The Table feature in Word is the preferred method for handling tabular data. Existing tabular data which is held in text form can be converted into a Word table via the Convert>Text to Table option on the Table Menu, but this option relies on the presence of a single separator character between the values on each row. Hence it is necessary to edit the imported SIMFIT log le to replace the space separators by a single separator before using the Word Convert>Text to Table feature. The steps of the process are as follows. Open the results log le in Notepad, Copy the required table and Paste it into a Word document. Click the Show/Hide button on the Standard toolbar This will make the non-printing characters visible on the screen. You should see a paragraph mark at each point where an end-of-line code has been inserted, and in particular at the end of the rows of tabular data. Select the lines of text which are to be converted into a Word table. If the rst row or column of the table contains labels, these row and column labels can be included in the selection, but see the note below. The starting and ending points of the selection must be precise, otherwise the process will produce an incorrect result: the selection must start with, and include, the paragraph mark immediately before the rst row of the table; and must end with, and include, the paragraph mark which terminates the last row of the table. Carry out these Search & Replace operations (Ctrl+H from the keyboard) using the Replace All option: a) Replace two spaces by one space b) Repeat the above until no multiple spaces remain in the selection c) Replace <space><paragraph mark> by <paragraph mark> d) Replace <paragraph mark><space> by <paragraph mark> e) Replace <space> by <tab> Re-select the table, but this time start the selection at the rst character on the rst row and end the selection after the paragraph mark which terminates the last row of the table. Select the Word Convert>Text to Table option from the Table Menu In the dialogue box which opens, Click the Tabs option in the Separate text at section, and the AutoFit to contents option in the Autofit behaviour section. Finally Click OK. [Note: Row and Column labels must not contain internal spaces. Any spaces within labels should be converted to underscore (or other characters) before submitting the original le to SIMFIT]

E.9.7 Converting a SIMFIT results log le table into a Word table using the ConvertToTable macro
The manual method described in the preceding section can be automated by using the ConvertToTable macro which is contained in a Word document called ConvertToTable.doc. This document will be located in the SIMFIT folder, e.g., C:\Program Files\Simfit\doc. The ConvertToTable.doc macro has been written

368

Importing S IMFIT results tables into documents and spreadsheets

and tested with the Microsoft Ofce XP Word Version 2002, but should work satisfactorily with other versions from Word 97 to 2008. In order to make the macro available for use with any Word document, ConvertToTable.doc must rst be saved as a global template as follows: a) Open the le ConvertToTable.doc; b) To save the macro you may also need to change the Word macro security setting by navigating Tools>Macro>Security from the Menu Bar and setting the security level to Medium. c) Open the Save As dialog box by File > Save As; d) Select the option Document template (.dot extension) from the Save as type drop-down list; e) Click the Save button. f) Select Tools>Templates and Add-Ins from the Menu Bar. If ConvertToTable.dot does not appear in the list of Global templates and add-ins, click the Add button, select ConvertToTable.dot from the list of templates and click OK. Make sure that this entry is checked before clicking OK to close the Templates and Add-Ins dialogue box. g) Quit Word. The template is already stored and you will only have to load and run it as explained below. The following steps explain the use of the macro.
E.9.7.1 Step1: Import the SIMFIT results log le material into Word

The import can be either an entire results log le or a selected section of the le. The import can be carried out either by opening the results log le directly in Word, or by copying the material from another editor, such as Notepad, and pasting the selection into a Word document.
E.9.7.2 Step 2: Click the Show/Hide button on the Standard toolbar

This will make the non-printing characters visible on the screen. You should see a new paragraph mark at each point where an end-of-line code has been inserted, and in particular at the end of the rows of tabular data.
E.9.7.3 Step 3: Select the lines of text which are to be converted into a Word table

If the rst row or column of the table contains labels, these can be included in the selection, but see Section E.9.7.5 below. The starting and ending points of the selection must be precise, otherwise the macro will produce an incorrect result: the selection must start with, and include, the paragraph mark immediately before the rst row of the table; and must end with, and include, the paragraph mark which terminates the last row of the table.
E.9.7.4 Step 4: Invoke the macro

To load the macro from the template list proceed as follows. Navigate Tools>Templates and addins>mark box with ConvertTotable>Accept To execute the macro proceed as follows. Navigate Tools>Macro>Macros from the Menu Bar, or press the Alt+F8 combination from the keyboard. Select ConvertToTable from the list of available macros and Click the Run button. (If ConvertToTable does not appear in the macros list, check that the above instructions for saving ConvertToTable.dot as a global template have been carried out correctly.)

Importing S IMFIT results tables into documents and spreadsheets

369

E.9.7.5 Space characters in Row and Column labels

Any space characters in these labels will cause incorrect results because the macro will interpret them as column separators. The best way around this problem is to use underscore characters instead of spaces when the labels are rst created. Alternatively any spaces which are present in the labels can be replaced by underscore characters by manually editing the imported material before the macro is invoked. In either case the spaces can be reinstated by performing the reverse operation (replace underscore by space) after Step 4. Although this requires another manual edit, this is relatively fast because an entire top row or rst column can be selected for the Replace operation. (Replace on the Edit Menu or Ctrl+H via the keyboard).
E.9.7.6 Deactivating or Removing the ConvertToTable.dot Global Template

If you do not wish to have this Global Template available the next time you start Word, it can be either deactivated or completely removed as follows: Select Tools > Templates and Add-Ins, then either deactivate ConvertToTable.dot by turning off the check bow against this item; or remove ConvertToTable.dot by selecting the item and clicking the Remove button. Finally click OK.

370

Printing and importing S IMFIT graphs into documents

E.10 Printing and importing S IM FIT graphs into documents


The primary high level graphics format created by SIMFIT is the Encapsulated PostScript Format, i.e., .eps les but, as not all applications can use these directly, you may have to create graphic image les in other formats. However, if you cannot benet directly from the superior quality of such .eps les, you should never import SIMFIT graphics les into documents in .bmp, .pcx, .jpg, or .tif formats; only .emf, or better .png (from .eps) should be used.

E.10.1 Windows print quality


Any graph displayed by SIMFIT can be used directly to drive the printer using a high resolution bitmap. The resolution and size can be adjusted but this is only useful for a quick record.

E.10.2 Windows bitmaps and compressed bitmaps


You should never choose to save .bmp, .jpg, .tif, .pcx from the display unless you have a good reason for being satised with such large, poor resolution les.

E.10.3 Windows Enhanced metales (.emf)


The easiest way to use SIMFIT graphs in Windows is to save Enhanced Metales, i.e., .emf les directly from SIMFIT. The quality of .emf les can be improved somewhat by conguring SIMFIT to use slightly thicker lines and bolder fonts than the default settings, which you can investigate. If you do not have a PostScript printer, and do not have SIMFIT congured to use GSview and Ghostscript, then this is the only course of action open to you, and you will miss out on a great many sophisticated SIMFIT plotting techniques. A further undesirable feature of using .emf les is that they can too easily have their aspect ratio changed within Windows programs, leading to ugly fonts.

E.10.4 PostScript print quality


If you have a PostScript printer, then hardcopy driven directly from the display will be of very high quality. If you do not have a PostScript printer, then install Ghostscript and Gsview and similar print quality can be obtained on any printer.

E.10.5 Postscript graphics les (.eps)


The advantage of storing SIMFIT encapsulated PostScript graphics les is that they are very compact, can be printed at any resolution with any number of colors, and SIMFIT has many facilities for editing, re-sizing, rotating, overlaying, or making collages from such .eps les. A unique feature is that SIMFIT .eps les have a structured format, so they can easily be edited in a text editor, e.g., Notepad, to change sizes, fonts, line types symbols, colors, labels, etc., retrospectively.

E.10.6 Ghostscript generated les


If you have Ghostscript installed, then SIMFIT can supply .eps les for transformation into other graphics formats such as .bmp, .jpg, .tif, or especially .pdf, or .png.

Portable Document format graphics les (.pdf)


If you use Adobe Acrobat, or can import Portable Document Format les, i.e., .pdf les generated from SIMFIT .eps les by Ghostscript, into your Windows application, then such true .pdf les are an excellent choice. However, beware of the fact that many applications simply embed bitmaps into PostScript or .pdf les, whereupon all the advantages are lost.

Printing and importing S IMFIT graphs into documents

371

Portable network graphics les (.png)


Increasingly, the most versatile format for importing graphic image les into Windows programs, such as Microsoft Word and PowerPoint, or OpenOfce Writer or Impress, is the Portable Network Graphics format, as the compression used in .png les results in smaller les than .bmp or .jpg, and edge resolution is far superior. So the best way for users without PostScript printers to use SIMFIT graphs in Windows is to store graphs in .eps format, then create .png les at 72dpi for small applications, like the web, where resolution is not important, but at 300dpi or 600dpi if you wish the graph to be printed or displayed at high resolution. The industry standard for scientic graphs is no longer .gif, it is .png, as these are free from patent problems and are increasingly being accepted by all applications and all operating systems.

E.10.7 Using Encapsulated PostScript (.eps) les directly


If you have access to a true PostScript printer, you can import SIMFIT .eps les directly into Word, but Word will then add a low resolution preview, so that what you see in the display may not be exactly what you get on printing. A Word document containing .eps les will print SIMFIT graphs at high resolution only on a PostScript printer, on non-PostScript printers the resolution may be poor and the graph may not be printed correctly. For this reason, the recommended way is to save .eps les then create .png les at a suitable resolution, either at the same time that the .eps le is created or retrospectively. To do this, SIMFIT must be congured to use Gsview and Ghostscript, and these free packages can be downloaded from the SIMFIT or GSview websites. Note that, for some applications, GSview can add a preview to .eps les in the expectation that this preview will just be used for display and that the PostScript graph will be printed, but not all applications do this correctly. Another advantage of having SIMFIT congured to use the GSview package is that your archived .eps les can then be printed as professional quality stand alone graphs at high resolution on any printer. Note that, if you create a Word document using imported .eps les, the graphs will only be printed correctly and at full resolution on a PostScript printer. So, for maximum reliability, you should import .png les.

Appendix F

The SIMFIT package


F.1 S IM FIT program les
F.1.1 Dynamic Link Libraries
The DLL les must be consistent, that is they must all be compiled and linked together at the same release. If they are upgraded you must replace the whole set, not just one or two of them. If they are not consistent (i.e. not all created at the same time) bizarre effects can result from inconsistent export tables.
numbers.dll

The academic version is a stand-alone library containing the public domain numerical analysis codes required to replace the NAG library routines used by SIMFIT. The software included is a selection from: BLAS, Linpack, Lapack, Minpack, Quadpack, Curt (splines), Dvode and L-BFGS-B. However, the NAG library versions do not contain these numerical analysis codes, but call the NAG library DLLs indirectly through the NAG library version of the maths DLL.
maths.dll

The academic version contains in-line replacement code for the NAG library, and calls to the numbers library to satisfy dependencies. The methods described in the NAG library handbook are used for most of the routines, but some exploit alternative methods described in sources such as AS or ACM TOMS. The NAG library maths versions are mainly front ends to the NAG library DLLs, but they do contain some extra code.
menus.dll

This is the GUI interface to the Windows Win32 API that is responsible for creating the SIMFIT menus and all input/output. SIMFIT does not use resource scripts and all menus and tables are created on the y. This DLL consists of a set of subroutines that transform data into a format that can be recognized by the arguments of the winio@(.) integer function of the Salford Software Clearwin Plus Windows Interface. When the calls have been processed, the resulting arguments are passed to the Clearwin interface. So menus is dependent on the clearwin DLL.
graphics.dll

This contains the graphics codes, and is dependent on the menus and clearwin DLLs.
simt.dll

This consists of all the numerical analysis routines used by SIMFIT that are not in numbers.dll or maths.dll. It also contains numerous special routines to check data and advise users of ill-conditioned calculations, unsatisfactory data and so on. It depends on the maths, menus and graphics DLLs.
models.dll

This contains the model subroutines used in simulation and curve tting. The basic model (Version 2.0) is 372

Executables

373

rather limited in scope and there are many variations with models dedicated to specic uses. Most of these are consistent with maths.dll and numbers.dll but some use special functions that need enhanced versions of maths.dll. It is possible to upgrade the library of equations and use an enlarged set of user dened functions by upgrading this le alone. It depends on the menus, maths, and numbers DLLs.
clearwin.dll

This contains all the codes that require the Salford-Silverfrost runtime library salibc.dll. This, and the help DLL, are the only source codes in SIMFIT that must be compiled using the FTN95 compiler. Note that some versions of the clearwin DLL are also linked to the graphics DLL for reverse communication.
help.dll

This contains compiled HTML scripts. It depends on the menus DLL and must be compiled using the FTN95 compiler.
salibc.dll

This contains the Salford-Silverfrost runtime library to interface with the Windows API.

F.1.2 Executables
adderr

This takes in exact data for functions of one, two or three independent variables and adds random error to simulate experimental data. Replicates and outliers can be generated and there is a wide range of choice in the probability density functions used to add noise and the methods to create weighting factors.
average

This takes in x, y data points, calculates means from replicates if required, and generates a trapezoidal model, i.e. a sectional model for straight lines joining adjacent means. This model can then be used to calculate areas or fractions of the data above a threshold level, using extrapolation/interpolation, for any sub-section of the data range.
binomial

This is dedicated to the binomial, trinomial and Poisson distributions. It generates point mass functions, cumulative distributions, critical points, binomial coefcients and their sums and tests if numbers supplied are consistent with a binomial distribution. Estimates of binomial probability values with 95% condence levels can be calculated by the exact F method or approximate quadratic method, analysis of proportions is carried out and condence contours for the trinomial distribution parameters can be plotted.
calcurve

This reads in curve-tting data and creates a cubic spline calibration curve. This can then be used to predict x given y or y given x with 95% condence levels. There is a wide range of procedures and weighting options that can be used for controlling the data smoothing.
chisqd

This is dedicated to the chi-square distribution. It calculates density and cumulative distribution functions as well as critical points, tests if numbers are consistent with a chi-square distribution, does a chi-square test on paired observed and expected values or on contingency tables and calculates the Fisher exact statistics and chi-square statistics with the Yates correction for 2 by 2 tables.
compare

This ts a weighted least squares spline with user-chosen smoothing factor to data sets. From these best-t splines the areas, derivatives, absolute curvature and arc length can be estimated and pairs of data sets can be compared for signicant differences.

374

The S IMFIT package

csat

This is dedicated to estimating the changes in location and dispersion in ow cytometry data so as to express changes in ligand binding or gene expression in terms of estimated parameters.
deqsol

This simulates systems of differential equations. The user can select the method used, range of integration, tolerance parameters, etc. and can plot proles and phase portraits. The equations, or specied linear combinations of the components can be tted to data sets.
edit

This editor is dedicated to editing SIMFIT curve tting les. It has numerous options for fusing and rearranging data sets, changing units of measurement and weighting factors, plotting data and checking for inconsistencies.
editmt

This is a general purpose numerical editor designed to edit SIMFIT statistical and plotting data les. It has a large number of functions for cutting and pasting, rearranging and performing arithmetical calculations with selected rows and columns. This program and EDITFL are linked into all executables for interactive editing.
editps

This editor is specically designed to edit PostScript les. It can change dimensions, rotation, titles, text, etc. as well as overprinting les to form insets or overlays and can group PostScript les together to form collages.
eoqsol

This item is for users who wish to study the effect of spacing and distribution of data points for optimal design in model discrimination.
ext

This ts sequences of exponential functions and calculates best t parameters and areas under curves. It is most useful in the eld of pharmacokinetics.
ftest

This is dedicated to the F distribution. It calculates test statistics, performs tests for consistency with the F distribution and does the F test for excess variance.
gct

This can be run in three modes. In mode 1 it ts sequences of growth curves and calculates best-t parameters such as maximal growth rates. In mode 2 it ts survival models to survival data. In mode 3 it analyzes censored survival data by generating a Kaplan-Meier nonparametric survival estimate, nding maximum likelihood Weibull models and performing Cox analysis.
help

This item provides on-line help to SIMFIT users.


hlt

This is dedicated to analyzing ligand binding data due to mixtures of high and low afnity binding sites where the response is proportional to the percentage of sites occupied plus a background constant level. It is most useful with dose response data.
inrate

This nds initial rates, lag times, horizontal or inclined asymptotes using a selection of models. It is most useful in enzyme kinetics and transport studies.

Executables

375

lint

This does multi-linear regression and a provides a variety of linear regression techniques such as overdetermined L1 tting, generalized linear interactive modelling, orthogonal tting, robust regression, principal components, etc.
makcsa

This simulates ow cytometry data for testing program CSAFIT.


makdat

This can generate exact data for functions of one, two or three independent variables, differential equations or user-dened equations. It can also create two and three dimensional plots.
makl

This is designed to facilitate the preparation of data sets for curve tting. It has many features to make sure that the user prepares a sensible well-scaled and consistent data le and is also a very useful simple plotting program.
maklib

This collects SIMFIT data les into sets, called library les, to facilitate supplying large data sets for tting, statistical analysis or plotting.
makmat

This facilitates the preparation of data les for statistical analysis and plotting.
maksim

This takes in tables with columns of data from data base and spread sheet programs and allows the user to create SIMFIT les with selected sub-sets of data, e.g. blood pressure for all males aged between forty and seventy.
mmt

This ts sequences of Michaelis-Menten functions. It is most useful in enzyme kinetics, especially if two or more isoenzymes are suspected.
normal

This is dedicated to the normal distribution. It calculates all the usual normal statistics and tests if numbers are consistent with the normal distribution.
polnom

This ts all polynomials up to degree six and gives the user all the necessary statistics for choosing the best-t curve for use in predicting x given y and y given x with 95% condence limits.
qnt

This is a very advanced curve-tting program where the models can be supplied by the user or taken from a library, and the optimization procedures and parameter limits are under the users control. The best t curves can be used for calibration, or estimating derivatives and areas. Best-t surfaces can be plotted as sections through the surface and the objective function can be visualized as a function of any two estimated parameters.
rannum

This generates pseudo random numbers and random walks from chosen distributions.
rft

This performs a random search, constrained overdetermined L1 norm t then a quasi-Newton optimization

376

The S IMFIT package

to nd the best-t positive rational function. It is most useful in enzyme kinetics to explore deviations from Michaelis-Menten kinetics.
rstest

This does runs and signs tests for randomness plus a number of nonparametric tests.
run5

This program-manager runs the SIMFIT package. The executable is called w simt.exe.
sft

This is used for tting saturation curves when cooperative ligand binding is encountered. It gives binding constants according to all the alternative conventions and estimates Hill plot extremes and zeros of the binding polynomial and its Hessian.
simplot

This takes in ASCII coordinate les and creates plots, bar charts, pie charts, surfaces and space curves. The user has a wealth of editing options to create publication quality hardcopy.
simstat

This describes the SIMFIT statistics options and does all the usual tests. In addition it does numerous statistical calculations such as zeros of polynomials, determinants, eigenvalues, singular value decompositions, time series, power function estimations, etc.
spline

This utility takes in spline coefcients from best-t splines generated by CALCURVE and COMPARE and uses them for plotting and calculating areas, derivatives, arc lengths and curvatures.
ttest

This is dedicated to the t statistic. It calculates densities and critical values, tests if numbers are consistent with the t distribution, and does t and paired t tests after testing for normality and doing a variance ratio test.
usermod

This utility is used to develop user dened models. It also plots functions, estimates areas by adaptive quadrature and locates zeros of user dened functions and systems of simultaneous nonlinear equations.
change simt version

This program can be used at any time to transform the current version of SIMFIT into one of the alternative versions. For instance transforming the academic version into a NAG library version. It does this by overwriting the current copies of the maths and number DLLs, so it is important that SIMFIT is not in use when this utility is executed.

Data les

377

F.2 S IM FIT data les


To use SIMFIT you must rst collect your data into data les. These are ASCII text les with a two-line header section, then a table consisting of your m by n matrix of data values, followed, if required, by an optional trailer section, as follows. Title m n a(1,1) a(1,2) a(2,1) a(2,2) ... a(m,1) a(m,2) k appended line appended line ... appended line :Header line 1 :Header line 2 (no. of rows and columns) :Data row 1 ... :Data row 2 ... :Data row m ... :Trailer line 1 (no. of appended lines) :Trailer line 2 :Trailer line 3 :Trailer line k + 1

... a(1,n) ... a(2,n) ... a(m,n) 1 2 k

There are special SIMFIT programs to prepare and edit such les: makl or edit for curve tting data, and makmat or editmt for statistics. These editors have a very extensive range of functions. However, if you already have data in a spread sheet or data base program like Excel, you can copy selected tables to the clipboard and paste them directly into SIMFIT or into the special program maksim which allows you to select rows and columns to be written to a data le in the SIMFIT format. The next examples are SIMFIT test les chosen to illustrate some features of the SIMFIT data le format. Although all data les have an identical format, the meaning of the rows and columns must be consistent with the use intended. For instance, column 1 must be in nondecreasing order for curve tting or MANOVA data sets, and this is best appreciated by viewing the test les designated as defaults for each procedure. Note that SIMFIT does not differentiate between oating point numbers and integers in the matrix table section of data les, and either spaces or commas can be used a separators. The title is quite arbitrary, but the row and column dimensions must be exact. The trailer can be omitted altogether, or may be used to store further items, such as labels, indicators, values, or starting estimates and limits. A particularly valuable feature of the trailer section is that there is a begin{. . . } . . . end{. . . } format which can be used to append labels, indicators, starting values, or parameter limits to the end of data les.

F.2.1 Example 1: a vector


Table F.1 illustrates a typical vector le (vector.tf1), with title, row and column dimensions, 1 column of data, then an unused trailer section. Vector with components 1, 2, 3, 4, 5 5 1 1.0000E+00 2.0000E+00 3.0000E+00 4.0000E+00 5.0000E+00 1 Default line Table F.1: Test le vector.tf1

F.2.2 Example 2: a matrix


Table F.2 illustrates a typical matrix le (matrix.tf1), with 5 rows and 5 columns.

378

The S IMFIT package

Arbitrary 5 by 5 5 1.2000E+00, 3.0000E+00, 1.7100E+01, 7.1500E+00, 1.2400E+01, 1 Default line

5 matrix 4.5000E+00, 5.6000E+00, 2.3400E+01, 5.8700E+00, 4.3000E+00, 6.1000E+00, 3.7000E+00, 5.5000E+00, 9.9400E+00, 7.7000E+00, 7.2000E+00, 9.1000E+00, 9.2000E+00, 8.8200E+00, 8.9500E+00, 8.0000E+00 1.2500E+01 3.3000E+00 1.0800E+01 1.6000E+00

Table F.2: Test le matrix.tf1

F.2.3 Example 3: an integer matrix


Table F.3 has integer entries, and the trailer is being used to describe the meaning of the columns of observations in the test le binomial.tf3. y,N,x for analysis of proportions 5, 3 23, 84, 1 12, 78, 2 31, 111, 3 65, 92, 4 71, 93, 5 3 Column 1 = y, no. successes Column 2 = N, no. Bernoulli trials, N >= y > = 0 Column 3 = x, x-coordinates for plotting Table F.3: Test le binomial.tf3

F.2.4 Example 4: appending labels


Table F.4 originally had the rst 12 lines of the trailer for row labels to be plotted corresponding to the 12 rows (i.e., cases in the test le cluster.tf1). To use this method, the labels must be the rst lines in the trailer section, and there must be at least as many labels as there are columns. Where both row and column labels are required, e.g. with biplots, column labels follow on from row labels. The current version of cluster.tf1 now uses a more versatile method to add labels which will be described next.

F.2.5 Example 5: using begin ... end to add labels


Table F.5 illustrates the alternative method for adding 12 labels to the trailer section for the 12 rows in test le piechart.tf1..

F.2.6 Example 6: various uses of begin ... end


Table F.6 illustrates how to append starting clusters, indicator variables to include or suppress covariates, as well as labels to a data le. When such begin{. . . } . . . end{. . . } techniques are used, the appended data can be placed anywhere in the trailer section. Note that 17 rows and 17 labels have been suppressed to make the table compact, but the complete test le is kmeans.tf1.

Data les

379

Cluster analysis data, e.g. dendrogram 12 8 1 4 2 11 6 4 3 9 8 5 1 14 19 7 13 21 ... 15 21 8 7 17 12 4 22 18 A-1 B-2 C-3 D-4 E-5 F-6 G-7 H-8 I-9 J-10 K-11 L-12 ... Table F.4: Test le cluster.tf1 (original version) Advanced pie chart 1: 10 4 1.0, 1.0, 0.0, 15 1.0, 2.0, 0.0, 15 ... 1.0, 10.0, 0.0, 15 12 begin{labels} Style 1 Style 2 ... Style 10 end{labels} fill styles

Table F.5: Test le piechart.tf1

F.2.7 Example 7: starting estimates and parameter limits


The advanced curve-tting program qnt must have starting estimates and parameter limits appended to the data le in order to use expert mode model tting. This example shows how this is done using begin{limits} . . . end{limits} with the test le gauss3.tf1, where the model has 10 parameters. Note that each line representing a limit triple must satisfy [botom limit] [starting estimate] [upper limit] i.e., be in nondecreasing order, otherwise the parameter limits will not be acceptable. Test le gauss3.tf2 illustrates the alternative way to supply these 10 sets of lower limits, starting estimates, and upper limits for the 10 parameters to be estimated.

380

The S IMFIT package

Data for 5 variables on 20 soils (G03EFF, Kendall and Stuart) 20 5 77.3 13.0 9.7 1.5 6.4 82.5 10.0 7.5 1.5 6.5 ... 69.7 20.7 9.6 3.1 5.9 ... The next line defines the starting clusters for k = 3 begin{values} <-- token to flag start of appended values 82.5 10.0 7.5 1.5 6.5 47.8 36.5 15.7 2.3 7.2 67.2 22.7 10.1 3.3 6.2 end{values} The next line defines the variables as 1 = include, 0 = suppress begin{indicators} <-- token to flag start of indicators 1 1 1 1 end{indicators} The next line defines the row labels for plotting begin{labels} <-- token to flag start of row labels A B ... T end{labels} Table F.6: Test le kmeans.tf1

QNFIT EXPERT mode file: 3 Gaussians plus 7.5% relative error 150 3 -3.0000E+00, 4.1947E-03, 8.5276E-04 -3.0000E+00, 5.8990E-03, 8.5276E-04 ... ... ... 1.5000E+01, 2.9515E-02, 2.4596E-03 27 ... begin{limits} 0, 1, 2 0, 1, 2 0, 1, 2 -2, 0, 2 2, 4, 6 8, 10, 12 0.1, 1, 2 1, 2, 3 2, 3, 4 0, 0, 0 end{limits} Table F.7: Test le gauss3.tf1

Test les

381

F.3 S IM FIT auxiliary les


The test les consist of data sets that can be used to understand how SIMFIT works. You can use a test le with a program, then view it to appreciate the format before running your own data. Library les are just collections of names of test les so you can enter many les at the same time. This is very useful with statistics (e.g., ANOVA, multiple comparisons with simstat) and plotting (e.g., supplying ASCII coordinate les to simplot). Conguration and default les are used by SIMFIT to store certain parameter values that are relevant to some particular functions. Some les are created automatically and upgraded whenever you make signicant changes, and some are created only on demand. All such conguration and default les are ASCII text les that can be browsed in the SIMFIT viewer. In general, the idea is that when a particular conguration proves satisfactory you make the le read-only to x the current defaults and prevent SIMFIT from altering the settings. SIMFIT generates many temporary les and if you exit from a program in an abnormal fashion (e.g., by Ctrl+Alt+Del) these are left in an unnished state. Usually these would be automatically deleted, but expert users will sometimes want the facility to save temporary les on exit from SIMFIT, so this possibility is provided. You should not attempt to edit such les in a text editor but note that, if you suspect a fault may be due to a faulty conguration or default les, just delete them and SIMFIT will create new versions.

F.3.1 Test les (Data)


adderr.tf1 adderr.tf2 anova1.tf1 anova2.tf1 anova2.tf2 anova3.tf1 anova4.tf1 anova5.tf1 anova5.tf2 anova5.tf3 anova5.tf4 anova6.tf1 average.tf1 barchart.tf1 barchart.tf2 barchart.tf3 barchart.tf4 barchart.tf5 barchart.tf6 barchart.tf7 barcht3d.tf1 barcht3d.tf2 barcht3d.tf3 binomial.tf1 binomial.tf2 binomial.tf3 calcurve.tf1 calcurve.tf2 calcurve.tf3 chisqd.tf1 chisqd.tf2 chisqd.tf3 chisqd.tf4 chisqd.tf5 cluster.tf1 Data for adding random numbers using adderr Data for adding random numbers using adderr Matrix for 1 way analysis of variance in ftest or simstat Matrix for 2 way analysis of variance in ftest or simstat Matrix for 2 way analysis of variance in ftest or simstat Matrix for 3 way analysis of variance in ftest or simstat Matrix for groups/subgroups analysis of variance in ftest or simstat Matrix for factorial ANOVA (2 factors, 1 block) Matrix for factorial ANOVA (2 factors, 3 blocks) Matrix for factorial ANOVA (3 factors, 1 blocks) Matrix for factorial ANOVA (3 factors, 3 blocks) Matrix for repeated measurtes ANOVA (5 subjects, 4 treatments) Data for program average Creates a barchart in simplot Creates a barchart in simplot Creates a barchart in simplot Creates a barchart in simplot Creates a barchart in simplot Creates a barchart in simplot Adds a curve to barchart created from barchart.tf6 Creates a 3 dimensional barchart in simplot Creates a 3 dimensional barchart in simplot Creates a 3 dimensional barchart in simplot Fifty numbers from a binomial distribution with N = 50, p = 0.5 Analysis of proportions with no effector values, i.e. X , N Analysis of proportions with effector values, i.e. X , N , t Prepares a calibration curve in EXPERT mode using calcurve Predicts x given y with calcurve.tf1 Predicts y given x with calcurve.tf1 Fifty numbers from a chi-square distribution with = 10 Vector of observed values to be used with chisqd.tf3 Vector of expected values to be used with chisqd.tf2 Matrix for Fisher exact test in chisqd or simstat Contingency table for chi-square test in chisqd or simstat Data for multivariate cluster analysis in simstat

382

The S IMFIT package

cluster.tf2 cochranq.tf1 column1.tf1 column1.tf2 column1.tf3 column1.tf4 column1.tf5 column2.tf1 column2.tf2 column2.tf3 compare.tf1 compare.tf2 cox.tf1 cox.tf2 cox.tf3 cox.tf4 csadat.tf1 csadat.tf2 csat.tf1 csat.tf2 csat.tf3 deqsol.tf1 deqsol.tf2 deqsol.tf3 edit.tf1 edit.tf2 edit.tf3 edit.tf4 editmt.tf1 editmt.tf2 editmt.tf3 errorbar.tf1 errorbar.tf2 ext.tf1 ext.tf2 ext.tf3 ext.tf4 ext.tf5 ext.tf6 ext.tf7 ftest.tf1 gauss3.tf1 gauss3.tf2 gct.tf1 gct.tf2 glm.tf1 glm.tf2 glm.tf3 glm.tf4 gompertz.tf1 hlt.tf1 hlt.tf2 hlt.tf3 hlt.tf4

Data for multivariate cluster analysis in simstat Matrix for Cochran Q test Vector for 1 way ANOVA in ftest or simstat Vector for 1 way ANOVA in ftest or simstat Vector for 1 way ANOVA in ftest or simstat Vector for 1 way ANOVA in ftest or simstat Vector for 1 way ANOVA in ftest or simstat Vector for nonparametric correlation in rstest or simstat Vector for nonparametric correlation in rstest or simstat Vector for nonparametric correlation in rstest or simstat Use with compare to compare with compare.tf2 Use with compare to compare with compare.tf1 Survival data for Cox proportional hazards model Survival data for Cox proportional hazards model Survival data for Cox proportional hazards model Survival data for Cox proportional hazards model Example of the preliminary ow cytometry format for csadat Example of the preliminary ow cytometry format for csadat Geometric type data with 15% stretch for csat Arithmetic type data with 5% translation for csat Mixed type data for csat Library data for tting LV1.tf1 and LV2.tf1 by deqsol Library data for tting LV1.tf1 by deqsol Library data for tting LV2.tf1 by deqsol Data for editing by edit Data for editing by edit Data for editing by edit Data for editing by edit Data for editing by editmt Data for editing by editmt Data for editing by editmt Normal error bars (4 columns) Advanced error bars (6 columns) Exact data for 1 exponential for tting by ext Random error added to ext.tf1 by adderr Exact data for 2 exponentials for tting by ext Random error added to ext.tf3 by adderr Exact data for Model 5 in ext Exact data for Model 6 in ext Exact data for concave down exponentials in ext Fifty numbers from the F distribution with m = 2, n = 5 3 Gaussians: starting estimates by begin{limits}...end{limits} 3 Gaussians: starting estimates from start of trailer section Exact data for model 3 in gct Random error added to gct.tf1 by adderr Normal errors, reciprocal link Binomial errors, logistic link Poisson errors, log link Gamma errors, reciprocal link Data for gct in survival mode 2 Exact data for 1 site for tting by hlt Random error added to hlt.tf1 by adderr Exact data for 2 sites for tting by hlt Random error added to hlt.tf3

Test les

383

hotcold.tf1 hotel.tf1 houses.tf1 inhibit.tf1 inrate.tf1 inrate.tf2 inrate.tf3 inrate.tf4 latinsq.tf1 iris.tf1 iris.tf2 kmeans.tf1 kmeans.tf2 ld50.tf1 line.tf1 lint.tf1 lint.tf2 logistic.tf1 loglin.tf1 lv1.tf1 lv2.tf1 maksim.tf1 maksim.tf2 manova1.tf1 manova1.tf2 manova1.tf3 matrix.tf1 matrix.tf2 matrix.tf3 matrix.tf4 matrix.tf5 meta.tf1 meta.tf2 meta.tf3 mcnemar.tf1 mmt.tf1 mmt.tf2 mmt.tf3 mmt.tf4 normal.tf1 npcorr.tf1 pacorr.tf1 piechart.tf1 piechart.tf2 piechart.tf3 plot2.tf1 plot2.tf2 plot2.tf3 polnom.tf1 polnom.tf2 polnom.tf3 polnom.tf4 qnt.tf1 qnt.tf2

Data for mmt/hlt/qnt in isotope displacement mode Data for Hotelling 1-sample T-square test Data for constructing a biplot Data for tting mixed inhibition as v = f(S,I) Data for models 1 and 2 in inrate Data for model 3 in inrate Data for model 4 in inrate Data for model 5 in inrate Latin square data for 3 way ANOVA in ftest or simstat Iris data for K-means clustering (see manova1.tf5) Starting K-means clusters for iris.tf2 Data for K-means cluster analysis Starting clusters for kmeans.tf1 Dose-response data for LD50 by GLM Straight line data Multilinear regression data for lint Multilinear regression data for lint Data for binary logistic regression Data for log-linear contingency table analysis Exact data for y(1) in the Lotka-Volterra differential equation Exact data for y(2) in the Lotka-Volterra differential equations Matrix for editing by maksim Matrix for editing by maksim MANOVA data: 3 groups, 2 variables MANOVA data: 3 groups, 2 variables MANOVA data: 2 groups, 5 variables 5 by 5 matrix for simstat in calculation mode 7 by 5 matrix for simstat in calculation mode Positive-denite symmetric 4 by 4 matrix for simstat in calculation mode Symmetric 4 by 4 matrix for simstat in calculation mode 25 by 4 matrix for simstat in correlation mode Data for Cochran-Mantel-Haentzel Meta Analysis test Data for Cochran-Mantel-Haentzel Meta Analysis test Data for Cochran-Mantel-Haentzel Meta Analysis test data for McNemar test Exact data for 1 Michaelis-Menten isoenzyme in mmt Random error added to mmt.tf1 by adderr Exact data for 2 Michaelis Menten isoenzymes in mmt Random error added to mmt.tf3 by adderr Fifty numbers from a normal distribution with = 0, = 1 Matrix for nonparametric correlation in rstest or simstat Correlation matrix for partial correlation in simstat Creates a piechart in simplot Creates a piechart in simplot Creates a piechart in simplot LHS axis data for double plot in simplot LHS axis data for double plot in simplot RHS axis data for double plot in simplot Data for a quadratic in polnom Predict x given y from polnom.tf1 Predict y given x from polnom.tf1 Fit after transforming to x = log(x), y = log(y/(1 y)) Quadratic in EXPERT mode for qnt Reversible Michaelis-Menten data in EXPERT mode for qnt

384

The S IMFIT package

qnt.tf3 rft.tf1 rft.tf2 rft.tf3 rft.tf4 rft.tf5 robust.tf1 rstest.tf1 sft.tf1 sft.tf2 sft.tf3 sft.tf4 simplot.tf1 simplot.tf2 simplot.tf3 spiral.tf1 spiral.tf2 spline.tf1 strata.tf1 surface.tf1 surface.tf2 surface.tf3 surface.tf4 survive.tf1 survive.tf2 survive.tf3 survive.tf4 survive.tf5 survive.tf6 times.tf1 trinom.tf1 trinom.tf2 trinom.tf3 ttest.tf1 ttest.tf2 ttest.tf3 ttest.tf4 ttest.tf5 ttest.tf6 tukeyq.tf1 ukmap.tf1 ukmap.tf2 ukmap.tf3 vector.tf1 vector.tf2 vector.tf3 veld.tf1 veld.tf2 weibull.tf1 zigzag.tf1

Linear function of 3 variables in EXPERT mode for qnt 2:2 Rational function data for rft 1:2 Rational function data for rft 2:2 Rational function data for rft 2:3 Rational function data for rft 4:4 Rational function data for rft Normal.tf1 with 5 outliers Residuals for runs test in rstest Exact data for 1 site in sft Random error added to sft.tf1 by adderr Exact data for 2 sites in sft Random error added to sft.tf3 by adderr Error-bar data for simplot Best-t 1:1 to simplot.tf1 for simplot Best-t 2:2 to simplot.tf1 for simplot Creates a 3 dimensional curve in simplot Creates a 3 dimensional curve in simplot Spline coefcients for spline Data for stratied binomial logistic regression Creates a surface in simplot Creates a surface in simplot Creates a surface in simplot Creates a surface in simplot Survival data for gct in mode 3 Survival data to pair with survive.tf1 Survival data for gct in mode 3 Survival data to pair with survive.tf3 Survival data for gct in mode 3 Survival data to pair with survive.tf5 Data for time series analysis in simstat Trinomial contour plots in binomial Trinomial contour plots in binomial Trinomial contour plots in binomial Fifty numbers from a t distribution with = 10 t test data for ttest or simstat Data paired with ttest.tf2 t test data for ttest or simstat Data paired with ttest.tf4 Data for t test on rows of a matrix matrix for ANOVA then Tukey Q test coordinates for K-means clustering starting centroids for ukmap.tf2 uk coastal outline coordinates Vector (5 by 1) consistent with matrix.tf1 Vector (7 by 1) consistent with matrix.tf2 Vector (4 by 1) consistent with matrix.tf3 vector eld le (4 columns) vector eld le (9 columns, i.e. a biplot) Survival data for gct in mode 2 Zig-zag data to illustrate clipping to boundaries

F.3.2 Library les (Data)


anova1.t 1-way ANOVA in ftest or simstat

Model les

385

convolv3.t deqsol.t editps.t epidemic.t inhibit.t npcorr.t simg1.t simg2.t simg3.t simg4.t simplot.t spiral.t qnt.t line3.t

Data for tting by qnt using convolv3.mod Curve tting data for deqsol (Identical to deqsol.tf1) PostScript les for EDITPS Data for tting epidemic differential equations Data for plotting mixed inhibition results Nonparametric correlation data for rstest or simstat Creates gure 1 in simplot Creates gure 2 in simplot Creates gure 3 in simplot Creates gure 4 in simplot Identical to simg1.t Creates a spiral in simplot Parameter limits library le for qnt Data for tting three lines simultaneously by qnt

F.3.3 Test les (Models)


camalot.mod cheby.mod convolve.mod convolv3.mod dble exp.mod d01fcf.mod ellipse.mod family2d.mod family3d.mod helix.mod if.mod impulse.mod line3.mod optimum.mod periodic.mod rose.mod tangent.mod twister.mod updown.mod updownup.mod user1.mod deqmat.tf1 deqmat.tf2 deqmod1.tf1 deqmod1.tf2 deqmod1.tf3 deqmod1.tf4 deqmod1.tf5 deqmod1.tf6 deqmod2.tf1 deqmod2.tf2 deqmod2.tf3 deqmod4.tf1 deqpar1.tf1 deqpar1.tf2 deqpar1.tf3 deqpar1.tf4 Model for Logarithmic Spiral as used in Camalots Model for Chebyshev expansion Model for a convolution between an exponential and gamma function Version of convolve.mod for all components Chemical kinetic double exponential model Model with four variables for integration Model for an ellipse in makdat/simplot/usermod Two dimensional family of diffusion equations Three dimensional family of diffusion equations Model for a helix in makdat/simplot/usermod Model illustrating logical commands Model illustrating 5 single impulse functions Model for 3 lines in qnt Model for optimizing Rosenbrocks 2-dimensional test function in usermod Model illustrating 7 periodic impulse functions Model for a rose in makdat/simplot/usermod Tangent to logarithmic spiral dened in camalot.mod Projection of a space curve onto coordinate planes Model that swaps denition at a cross-over point Model that swaps denition at two cross-over points Model illustrating arbitrary models How to transform a system of differential equations How to transform a system of differential equations Model for 1 differential equation Model for 1 differential equation Model for 1 differential equation Model for 1 differential equation Model for 1 differential equation Model for 1 differential equation Model for 2 differential equations Model for 2 differential equations Model for 2 differential equations Model for 4 differential equations Parameters for deqmod1.tf1 Parameters for deqmod1.tf2 Parameters for deqmod1.tf3 Parameters for deqmod1.tf4

386

The S IMFIT package

deqpar1.tf5 deqpar1.tf6 deqpar2.tf1 deqpar2.tf2 deqpar2.tf3 deqpar4.tf1 usermod1.tf1 usermod1.tf2 usermod1.tf3 usermod1.tf4 usermod1.tf5 usermod1.tf6 usermod1.tf7 usermod1.tf8 usermod1.tf9 usermod2.tf1 usermod3.tf1 usermodd.tf1 usermodn.tf1 usermodn.tf2 usermodn.tf3 usermodn.tf4 usermods.tf1 usermods.tf2 usermods.tf3 usermodx.tf1 usermodx.tf2 usermodx.tf3 usermodx.tf4 usermodx.tf5

Parameters for deqmod1.tf5 Parameters for deqmod1.tf6 Parameters for deqmod2.tf1 Parameters for deqmod2.tf2 Parameters for deqmod2.tf3 Parameters for deqmod4.tf1 Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 1 variable for usermod Function of 2 variables for usermod Function of 3 variables for usermod Differential equation for usermod Four functions for plotting by usermod Two functions of 2 variables for usermod Three functions of 3 variables for usermod Nine functions of 9 variables for usermod Special functions with one argument Special functions with two arguments Special functions with three arguments Using a sub-model for function evaluation Using a sub-model for quadrature Using a sub-model for root-nding Using three sub-models for root-nding of an integral Using a sub-model to evaluate a multiple integral

F.3.4 Miscellaneous data les


cheby.data Data required by cheby.mod convolv3.data Data for convolv3.mod inhibit?.data Data for inhibit.tfl line?.data line1.data, line2.data and line3.data for line3.t simg3?.data Data for simfig3.tfl simg4?.data Data for simfig4.tfl y?.data y1.data, y2.data and y3.data for epidemic.tfl

F.3.5 Parameter limits les


These les consist of lowest possible values, starting estimates and highest possible values for parameters used by qnt and deqsol for constraining parameters during curve tting. They are usually referenced by library les such as qnt.t. See, for example, positive.plf, negative.plf and unconstrained.plf.

F.3.6 Error message les


When programs like deqsol, makdat and qnt start to execute they open special les like w deqsol.txt and w qnt.txt to receive all messages generated during the current curve tting and solving of differential equations. Advanced SIMFIT users can inspect these les and other les like iterate.txt to get more details about any singularities encountered during iterations. If any serious problems are encountered using deqsol or qnt, you can consult the appropriate *.txt le for more information.

PostScript example les

387

F.3.7 PostScript example les


pscodes.ps PostScript octal codes psgfragx.ps Illustrating psfragex.tex/psfragex.ps1 simg1.ps Example simg2.ps Example simg3.ps Example simg4.ps Example simfonts.ps Standard PostScript fonts ms ofce.ps Using MS Excel and Word pspecial.i Example PS specials (i = 1 to 10)

F.3.8 SIMFIT conguration les


These les are created automatically by SIMFIT and should not be edited manually unless you know exactly what you are doing, e.g., setting the PostScript color palette. w simt.cfg This stores all the important details needed to run SIMFIT from the program manager w simt.exe w ps.cfg This stores all the PostScript conguration details w lter.cfg This contains the current search patterns used to congure the le selection and creation controls w input.cfg This holds the last lenames used for data input w output.cfg This holds the last lenames used for data output w clpbrd.cfg This holds the last le number x as in clipboard x.txt w ftests.cfg This holds the last NPTS, NPAR, WSSQ values used for F tests w result.cfg This holds the lename of the latest results le a recent.cfg Recently selected project les (all types) c recent.cfg Recently selected project les (covariance matrices) f recent.cfg Recently selected project les (curve tting) g recent.cfg Recently selected project les (graphics) m recent.cfg Recently selected project les (matrix)) p recent.cfg Recently selected project les (PostScript) v recent.cfg Recently selected project les (vector) pspecial.cfg Conguration le for PostScript specials

F.3.9 Graphics conguration les


These les can be created on demand from program simplot in order to save plotting parameters from the current plot for subsequent re-use. w w w w simg1.cfg simg2.cfg simg3.cfg simg4.cfg Congures simplot to use simg1.t Congures simplot to use simg2.t Congures simplot to use simg3.t Congures simplot to use simg4.t

F.3.10 Default les


These les save details of changes made to the SIMFIT defaults from several programs. w labels.cfg Stores default plotting labels w module.cfg Stores le names of executable modules

388

The S IMFIT package

w params.cfg Stores default editing parameters w symbol.cfg Stores default plotting symbols

F.3.11 Temporary les


These next two les are deleted then re-written during each SIMFIT session. You may wish to save them to disk after a session as a permanent record of les analyzed and created. w in.tmp w out.tmp Stores the list of les accessed during the latest SIMFIT session Stores the list of les created during the latest SIMFIT session

The results log le f$result.tmp is created anew each time a program is started that performs calculations, so it overwrites any previous results. You can save results retrospectively either by renaming this le, or else you can congure SIMFIT to ask you for a le name instead of creating this particular results le. SIMFIT also creates a number of temporary les with names like f$000008.tmp which should be deleted. If you have an abnormal exit from SIMFIT, the current results le may be such a le and, in such circumstances, you may wish to save it to disk. SIMFIT sometimes makes other temporary les, such as f$simfit.tmp with the name of the current program, but you can always presume that it is safe to delete any such les

F.3.12 NAG library les (contents of list.nag)


Models c05adf.mod c05nbf.mod d01ajf.mod d01eaf.mod d01fcf.mod Data c02agf.tf1 e02adf.tf1 e02baf.tf1 e02baf.tf2 e02bef.tf1 f01abf.tf1 f02fdf.tf1 f02fdf.tf2 f02wef.tf1 f02wef.tf2 f03aaf.tf1 f03aef.tf1 f07fdf.tf1 f08kff.tf1 f08kff.tf2 g02baf.tf1 g02bnf.tf1 g02bny.tf1 g02daf.tf1 g02gaf.tf1 g02gbf.tf1 g02gcf.tf1 g02gdf.tf1 g02haf.tf1 g02laf.tf1 g02laf.tf2 1 function of 1 variable 9 functions of 9 variables 1 function of 1 variable 1 function of 4 variables 1 function of 4 variables Zeros of a polynomial Polynomial data Data for xed knot spline tting Spline knots and coefcients Data for automatic knot spline tting Inverse: symposdef matrix A for Ax = (lambda)Bx B for Ax = (lambda)Bx Singular value decomposition Singular value decomposition Determinant by LU Determinant by Cholesky Cholesky factorisation Singular value decomposition Singular value decomposition Correlation: Pearson Correlation: Kendall/Spearman Partial correlation matrix Multiple linear regression GLM normal errors GLM binomial errors GLM Poisson errors GLM gamma errors Robust regression (M-estimates) Partial Least squares X-predictor data Partial Least Squares Y-response data

NAG library les

389

g02laf.tf3 g02wef.tf1 g02wef.tf2 g03aaf.tf1 g03acf.tf1 g03adf.tf1 g03baf.tf1 g03bcf.tf1 g03bcf.tf2 g03caf.tf1 g03ccf.tf1 g03daf.tf1 g03dbf.tf1 g03dcf.tf1 g03eaf.tf1 g03ecf.tf1 g03eff.tf1 g03eff.tf2 g03faf.tf1 g03ehf.tf1 g03ejf.tf1 g04adf.tf1 g04aef.t g04caf.tf1 g07bef.tf1 g08aef.tf1 g08aff.t g08agf.tf1 g08agf.tf2 g08ahf.tf1 g08ahf.tf2 g08cbf.tf1 g08daf.tf1 g08raf.tf1 g08rbf.tf1 g10abf.tf1 g11caf.tf1 g12aaf.tf1 g12aaf.tf2 g12baf.tf1 j06sbf.tf1

Partial Least Squares Z-predictor data Singular value decomposition Singular value decomposition Principal components Canonical variates Canonical correlation Matrix for Orthomax/Varimax rotation X-matrix for procrustes analysis Y-matrix for procrustes analysis Correlation matrix for factor analysis Correlation matrix for factor analysis Discriminant analysis Discriminant analysis Discriminant analysis Data for distance matrix: calculation Data for distance matrix: clustering K-means clustering K-means clustering Distance matrix for classical metric scaling Data for distance matrix: dendrogram plot Data for distance matrix: cluster indicators ANOVA ANOVA library le ANOVA (factorial) Weibull tting ANOVA (Friedman) ANOVA (Kruskall-Wallis) Wilcoxon signed ranks test Wilcoxon signed ranks test Mann-Whitney U test Mann-Whitney U test Kolmogorov-Smirnov 1-sample test Kendall coefcient of concordance Regression on ranks Regression on ranks Data for cross validation spline tting Stratied logistic regression Survival analysis Survival analysis Cox regression Time series

390

Acknowledgements

F.4 Acknowledgements
History of SIMFIT
Early on Julian Shindler used analogue computing to simulate enzyme kinetics but SIMFIT really started with mainframes, cards and paper tape; jobs were submitted to be picked up later, and visual display was still a dream. James Crabbe and John Kavanagh became interested in computing and several of my colleagues, notably Dennis Waight, used MINUITS for curve tting. This was an excellent program allowing random searches, Simplex and quasi-Newton optimization; imposing constraints by mapping sections of the real line to (, ). Bob Foster and Tom Sharpe helped us to make this program t models like rational functions and exponentials, and Adrian Bowman gave valuable statistical advice. By this time we had a mainframe terminal and Richard Woolfson and Jean Pierre Mazat used the NAG library to nd zeros of polynomials as part of our collaboration on cooperativity algebra, while Francisco Solano, Paul Leff and Jean Wardell used the NAG library random number generators for simulation. Andrew Wright advanced matters when we started to use the NAG library differential equation solvers and optimization routines to t differential equations to enzyme kinetic problems, and Mike Pettipher and Ian Gladwell provided hints on how to do this. Phil McGinlay took on the task of developing pharmacokinetic and diffusion models, Manuel Roig joined us to create the optimal design programs, while Naveed Buhkari spent time developing the growth curve tting models. When the PC came along, Elina Melikhova worked on the ow cytometry analysis programs, Jesus Cachaza helped with improving the goodness of t and plotting functions, Ralph Ackerman explained the need for a number of extra features, while Robert Burrows and Igor Plesner provided feedback on the development of the user supplied model routines, which were tested by Naveed Prasad.

The Windows and Linux versions of SIMFI T


SIMFIT has been subjected to much development on numerous platforms, and the latest project has been to write substitutes for NAG routines using the public domain code that is now available for linear algebra, optimization, and differential equation solving, so that there are now two SIMFIT versions. r The academic version This is a completely free version, designed for student use. r The professional version This has more features than the academic version, but it requires the NAG library DLLs. Geoff Morgan, Sven Hammarling, David Sayers, and John Holden from NAG have been very helpful here. Also, the move into the Windows environment guided by Steve Bagley and Abdul Sattar, was facilitated by the Clearwin Plus interface provided by Salford Software, helped by their excellent support team, particularly Richard Putman, Ivan Lucas, and Paul Laidler. I thank Mark Ferguson for his interest and support during this phase of the project, and John Clegg for developing the Excel macros. Mikael Widersten was very helpful in the development of the Linux/Wine version, Stephen Langdell suggested numerous improvements which were incorporated at Version 6, and Martin Wills helped develop the website.

Important collaborators
Although I am very grateful for the help that all these people have given I wish to draw attention to a number of people whose continued inuence on the project has been of exceptional value. Reg Wood has been an unfailing source of help with technical mathematics, Eos Kyprianou has provided constant advice and criticism on statistical matters, Len Freeman has patiently answered endless questions about optimization software, Keith Indge has been a great source of assistance with computational techniques, David Carlisle showed me how to develop the SIMFIT PostScript interface and Francisco Burguillo has patiently tested and given valuable feedback on each revision. Above all, I thank Robert Childs who rst persuaded me that mathematics, statistics and computing make a valuable contribution to scientic research and who, in so doing, rescued me from having to do any more laboratory experiments.

Acknowledgements

391

Public domain code


SIMFIT has a computational base of reliable numerical methods constructed in part from public domain code that is contained in w_maths.dll and w_numbers.dll. The main source of information when developing these libraries was the comprehensive NAG Fortran library documentation, and the incomparable Handbook of Mathematical Functions (Dover, by M.Abramowitz and I.A.Stegun). Numerical Recipes (Cambridge University Press, by W.H.Press, B.P.Flannery, S.A.Teukolsky, and W.T.Vetterling) was also consulted, and some codes were used from Numerical methods of Statistics (Cambridge University Press, by J.F.Monahan). Editing was necessary for consistency with the rest of SIMFIT but I am extremely grateful for the work of the numerical analysts mentioned below because, without their contributions, the SIMFIT package could not have been created.

BLAS, LINPACK, LAPACK [Linear algebra]


T.Chen, J.Dongarra, J. Du Croz, I.Duff, S.Hammarling, R.Hanson, R.J.Kincaid, F.T.Krogh, C.L.Lawson, C.Moler, G.W.Stewart, and others.

MINPACK [Unconstrained optimization]


K.E.Hillstrom, B.S.Garbou, J.J.More.

LBFGSB [Constrained optimization]


R.H.Byrd, P.Lu-Chen, J.Nocedal, C.Zhu.

DVODE [Differential equation solving]


P.N.Brown, G.D.Byrne, A.C.Hindmarsh.

SLATEC [Special function evaluation]


D.E.Amos, B.C.Carlson, S.L.Daniel, P.A.Fox, W.Fullerton, A.D.Hall, R.E.Jones, D.K.Kahaner, E.M.Notis, R.L.Pexton, N.L.Schryer, M.K.Weston.

CURFIT [Spline tting]


P.Dierckx

QUADPACK [Numerical integration]


E.De Doncker-Kapenga, D.K.Kahaner, R.Piessens, C.W.Uberhuber

ACM [Collected algorithms]


487 (J.Durbin), 493 (M.A.Jenkins), 495 (I.Barrodale, C.Philips), 516 (J.W.McKean, T.A.Ryan), 563 (R.H.Bartels, A.R.Conn), 698 (J.Berntsen, A.Genz), 707 (A.Bhalla, W.F.Perger), 723 (W.V.Snyder), 745 (M.Goano), 757 (A.J.McLeod).

AS [Applied statistics]
6 and 7 (M.J.R.Healey), 63 (G.P.Bhattacharjee, K.L.Majumder), 66 (I.D.Hill ), 91 (D.J.Best, D.E.Roberts), 94, 177 and 181 (J.P.Royston), 109 (G.W.Cran, K.J.Martin, G.E.Thomas), 111 (J.D.Beasley, S.G.Springer), 136 (J.A.Hartigan, M.A.Wong), 171 (E.L.Frome), 190 (R.E.Lund), 196 (M.D.Krailko, M.C.Pike), 226 and 243 (R.V.Lenth), 275 (C.G.Ding), 280 (M.Conlon, R.G.Thomas).

392

Acknowledgements

Index
A L TEX, 232, 234, 355

Abramovitz functions, 295, 315 Adair constants, 331 Adams method, 303 Adaptive quadrature, 310, 311 Adderr (program), 373 adding random error, 32 simulating experimental error, 33 Afnity constants, 331 Airy functions and derivatives, 296, 315 Akaike AIC, 62 Aliasing, 36, 56 Allosterism, 333 Analysis of categorical data, 125 Analysis of proportions, 125 Angular transformation, 292 ANOVA, 112 1-way and Kruskal-Wallis, 112 2-way and Friedman, 115 3-way and Latin squares, 117 factorial, 118 groups and subgroups, 117 introduction, 112 power and sample size, 187, 259 repeated-measurements, 121, 158 table, 42 Arbitrary graphical objects, 225 Arc length, 218 Archives, 11, 40 Arcsine transformation, 292 Areas, 59, 69, 208, 212, 212, 299 ARIMA, 172, 174 Aspect ratio, 225, 229 Assigning observations to groups, 161 Association constants, 29, 63, 331 Asymmetrical error bars, 240 Asymptotes, 208 Asymptotic steady states, 209 AUC, 59, 73, 77, 208, 212, 212 Autocorrelation functions, 172 Autoregressive integrated moving average, 174 Average (program), 373 AUC by the trapezoidal method, 212 Bar charts, 23, 81, 236 393

Bar charts with error bars, 115 Bayesian techniques, 161 Bernoulli distribution, 283 Bessel functions, 296, 315 Beta distribution, 290, 294 Binary data, 51 Binary logistic regression, 51, 56 Binding constants, 29, 64, 195, 197, 278, 331 Binding polynomial, 195, 278 Binomial (program), 373 analysis of proportions, 125 Cochran-Mantel-Haenszel test, 127 error bars, 242 Binomial coefcient, 293 Binomial distribution, 190, 283 Binomial test, 103, 185 Bioassay, 56, 72, 131, 279 Biplots, 165, 359 Bivariate condence ellipses 1: basic theory, 245 Bivariate condence ellipses 2: regions, 246 Bivariate normal distribution, 132, 287 Bonferroni correction, 78 Bound-constrained quasi-Newton optimization, 328 BoundingBox, 228 Box and whisker plots, 81, 115, 235 Bray-Curtis distance measure, 141 Bray-Curtis similarity, 247 Brillouin diversity index, 194 Burst times, 208 Calcurve (program), 373 constructing a calibration curve, 73 Calibration, 69, 72 Canberra distance measure, 141 Canonical correlation, 138 Canonical variates, 158 Categorical variables, 35, 36, 53 Cauchy distribution, 288 Censoring, 28, 180 Central limit theorem, 287 Centroids, 251 Change simt version (program), 376 Chebyshev approximation, 320 Chebyshev inequality, 183

394

Index

Chi-square distribution, 91, 289 Chi-square test, 39, 98, 150 Chisqd (program), 373 Cholesky factorization of a matrix, 204 Classical metric scaling, 142 Clausen integral, 294, 315 Clipboard, 11, 360 Clipping graphs, 228, 243, 248250, 353 Cluster analysis, 140, 145, 247, 266 Cochran Q test, 103 Cochran-Mantel-Haenszel test, 127 Cochranss theorem, 289 Coefcient of kurtosis, 79 Coefcient of skewness, 79 Coefcient of variation, 183 Communalities, 164 Compare (program), 373 model-free tting, 211 Composition of functions, 265 Concentration at half maximum response, 63 Condition number, 25, 26, 69, 201, 261 Condence limits, 14, 21, 77, 131, 240, 242, 257 binomial parameter, 190 correlation coefcient, 191 normal distribution, 191 Poisson parameter, 190 trinomial distribution, 191 Condence region, 150, 245, 246 Conuent hypergeometric functions, 295 Contingency tables, 53, 56, 99 Continuous distributions, 285 Contours, 24, 69, 260, 261 Contrasts, 121 Convolution integral, 272, 311 Cooperativity, 29, 64, 195, 278 Correlation, 278 95% condence ellipse, 246 canonical, 138 coefcient, 188 coefcient condence limits, 191 Kendall-tau and Spearman-rank (nonparametric), 135 matrix, 39, 133, 149 partial, 136 Pearson product moment (parametric), 132 residuals, 40 scattergrams, 244 Cosine integral, 294, 315 Covariance matrix, 51, 60, 77, 82, 133, 204, 205 inverse, eigenvalues and determinant, 83 parameter, 40 principal components analysis, 149 singular, 43 symmetry and sphericity, 84

testing for equality, 156 zero off-diagonal elements, 69 Covariates, 36 Cox regression, 56, 176, 181 Cross validation, 216 Cross-over points, 274 Csat (program), 374 analysing ow cytometry data, 277 Cumulative distribution functions, 315 Curvature, 218 Curve tting advanced programs, 69 cooperative ligand binding, 64 differential equations, 69 exponentials, 27, 59 growth curves, 28, 66 high/low afnity binding sites, 63 Lotka-Volterra predator-prey equations, 33 model free, 211 multi Michaelis-Menten model, 64 multifunction mode, 69 positive rational functions, 64 summary, 25, 59 surfaces, 262 survival curves, 67 user friendly programs, 26 Cylinder plots, 239 Data base interface, 15 Data le format, 377 Data mining, 79, 140, 145 Data smoothing, 171 Dawson integral, 295, 315 Debye functions, 294, 315 Deconvolution, 65 by curve tting, 311 graphical, 40, 271 numerical, 272 Degrees of freedom, 39 Dendrograms, 140, 247249 Deqsol (program), 374 orbits, 270 phase portraits, 269 simulating differential equations, 33 Derivatives, 69, 73, 218 Design of experiments, 43, 259 Determinants, 200 Deviances, 52 Deviations from Michaelis-Menten kinetics, 30 Differences between parameter estimates, 40 Differencing, 172 Differential equations compiled models, 330 tting, 69

Index

395

orbits, 270 phase portraits, 269 transformation, 304 user dened models, 302 Diffusion from a plane source, 267, 268 Diffusion into a capillary, 334 Digamma function, 295, 315 Dirac delta function, 297, 315 Discrete distribution functions, 283 Discriminant analysis, 161 Discriminant functions, 158 Dispersion, 91, 109 Dissimilarity matrix, 140 Dissociation constants, 29 Distance matrix, 140 Distribution Bernoulli, 283 beta, 89, 290, 294, 334 binomial, 50, 89, 125, 242, 254, 283, 373 bivariate normal, 287, 337 Cauchy, 33, 288, 334 chi-square, 38, 91, 99, 289, 373 Erlang, 290 exponential, 56, 289, 334 extreme value, 56 F, 38, 259, 289, 374 gamma, 50, 89, 290, 293, 334 Gaussian, 286 geometric, 284 hypergeometric, 99, 284 log logistic, 291 logistic, 291, 334 lognormal, 89, 287, 334 Maxwell, 334 multinomial, 284 multivariate normal, 287 negative binomial, 284 non-central, 195, 291 noncentral F in power calculations, 259 normal, 50, 89, 238, 286, 334, 375 plotting pdfs and cdfs, 254, 256 Poisson, 50, 89, 91, 238, 256, 285, 373 Rayleigh, 334 t, 38, 288, 376 trinomial, 131, 257, 373 uniform, 89, 286 Weibull, 28, 56, 89, 176, 290, 334 Diversity indices, 194 Dose response curves, 74, 131, 279 Dot product, 322 Doubling dilution, 50, 279 Dummy indicator variables, 53, 56, 101 Dummy variables, 36 Dunn-Sidak correction, 78

Durbin-Watson test, 40, 62 Dvips, 355 Dynamic link libraries, 372 EC50, 74 ED50, 74 Edit (program), 374 editing a curve tting le, 14 recommended ways to use, 13 Editing curve tting les, 14 matrix/vector les, 15 PostScript les, 338 Editmt (program), 374 editing a matrix/vector le, 15 Editps (program), 374 aspect ratios and shearing, 229 composing graphs, 234 rotating and scaling graphs, 228 text formatting commands, 350 Eigenvalues, 69, 148, 200, 261 Eigenvectors, 200 Elliptic integrals, 296, 315 Entropy, 194 Enzyme kinetics burst phase, 209 competitive inhibition, 336 coupled assay, 209, 333 deviations from Michaelis-Menten, 30, 64 tting inhibition data, 273 tting rational functions, 64 tting the Michaelis-Menten equation, 29, 64 inhibition, 262 inhibition by competing substrate, 336 isoenzymes, 29, 65 isotope displacement, 65 lag phase, 209 Michaelis-Menten pH dependence, 336 Michaelis-Menten progress curve, 333 mixed inhibition, 273, 336 MWC activator/inhibitor, 337 MWC allosteric model, 333 noncompetitive inhibition, 336 ordered bi bi, 336 ping pong bi bi, 336 progress curves, 209 reversible Michaelis-Menten, 336 substrate activation, 64 substrate inhibition, 64 time dependent inhibition, 336 transients, 209 uncompetitive inhibition, 336 Eoqsol (program), 374 Epidemic differential equations, 69

396

Index

Erlang distribution, 290 Error bars, 14, 81, 115 asymmetrical, 240 barcharts, 238 binomial parameter, 242 calculated interactively, 21, 26, 241 end caps, 220 log odds, 242 log odds ratios plots, 243 multiple, 240 plotting, 21, 26 skyscraper and cylinder plots, 239 slanting, 240 Error message les, 386 Error messages, 26 Error tolerances, 309 Estimable parameters, 43, 51, 200 Estimating percentiles, 56 Euclidean norm, 321 Eulers gamma, 322 Evidence ratio, 62 Excel, 362 Ext (program), 374 tting exponentials, 27, 59 Expected frequencies, 98 Experimental design, 43 Exponential distribution, 289 Exponential functions, 27, 59, 331 Exponential growth, 332 Exponential integral, 294, 315 Exponential survival, 56, 180 Extrapolation, 273 Extreme value survival, 56, 181 F distribution, 289 F test, 39, 62, 93, 107 Factor analysis, 163 Factor levels, 56 Factorial ANOVA, 118 Families of curves, 267, 268 Fast Fourier transform (FFT), 199 Fermi-Dirac integrals, 294, 315 Files analyzed, 11 archive, 11 ASCII plotting coordinates, 12, 21, 225 ASCII text, 15 created, 11 curve tting, 14 data, 377 editing curve tting les, 14 editing matrix/vector les, 15 error, 386 format, 13, 15

graphics conguration, 21 library, 14, 384 matrix/vector, 15 model, 385 multiple selection, 11 names, 11, 13 parameter limits, 69, 386 pdf, 8 polygon, 225 PostScript, 387 project, 11 results, 9, 13 temporary, 388 test, 10, 15, 381 view, 10 Fisher exact Poisson test, 91, 256 Fisher exact test, 99, 186 Fitting models basic principles, 35 generalized linear models, 37 limitations, 36 linear models, 37 nonlinear models, 38 survival analysis, 38 Fitting several models simultaneously, 69 Flow cytometry, 277, 374 Fonts, 223 Greek alphabet, 19 Helveticabold, 23 ZapfDingbats, 19 Fresnel integrals, 295, 315 Freundlich isotherm, 333 Friedman test, 116 Ftest (program), 374 Gamma distribution, 290, 293 Gamma function, 293, 315 Gauss pdf function, 297, 315 Gaussian distribution, 286, 332 Gct (program), 374 tting growth curves, 28, 66, 278 tting survival curves, 67 survival analysis, 255 Gears method, 302, 303 Generalized linear models, 25, 50 Geometric distribution, 284 GLM, 25, 50 Gompertz growth, 332 Gompertz survival, 332 Goodness of t, 25, 3840, 60 Graphics SIMFIT character display codes, 349 2D families of curves, 267 3D families of curves, 268

Index

397

adding extra text, 342 adding logos, 353 advanced bar charts, 236 advanced interface, 17 arbitrary diagrams, 224 arbitrary objects, 225 arrows, 223 aspect ratios and shearing, 229 bar charts, 23, 81, 115, 236 binomial parameter error bars, 242 biplots, 165 bitmaps and chemical equations, 234 box and whisker plots, 81, 115, 235 changing line and symbol types, 341 changing line thickness and plot size, 339 changing PS fonts, 339 changing title and legends, 340 characters outside the keyboard set, 344 clipping, 243, 248250 contours, 263 correlations and scattergrams, 244 cylinder plots, 81 deconvolution, 40, 65, 271 decorative fonts, 343 deleting graphical objects, 341 dendrograms, 140, 247 dilution curves, 279 double plots, 22 editing SIMFIT PS les, 338 error bars, 81, 220, 240 extending lines, 221 extrapolation, 273 lled polygons, 225 rst time users guide, 16 ow cytometry, 277 font size, 230 fonts, 222 generating error bars, 241 growth curves, 278 half normal plot, 80 histograms, pdfs and cdfs, 22, 89 ISOLatin1Encoding vector, 346 K-means clustering, 145, 250, 251 labelling, 253 letter size, 223 line thickness, 223, 230 line types, 220 Log-Odds plot, 242 mathematical equations, 232 models with cross-over points, 274 moving axes, 226 moving labels, 226 multivariate normal plot, 81 normal plot, 80

normal scores, 90 objects, 223 parameter condence regions, 257 perspective effects, 235 phase portraits, 269 pie charts, 24 plotting sections of 3D surfaces, 262 plotting the objective function, 261 plotting user dened models, 309 principal components, 149, 252 probability distributions, 254, 256 projecting onto planes, 265 random walks, 258 rotation and scaling, 228 saving conguration details, 21 scattergrams, 81, 115, 150, 252 scree plot, 151, 252 simple interface, 16 size, shape and clipping, 228 skyscraper plots, 24, 81, 237 special effects, 353 species fractions, 278 splitting axes, 231 standard fonts, 343 StandardEncoding vector, 345 stretch-clip-slide, 339 stretching, 243, 248250 subsidiary gures as insets, 277 surfaces and contours, 260 surfaces, contours and 3D bar charts, 24 survival analysis, 255 symbol types, 219 SymbolEncoding vector, 347 text, 222 three dimensional bar charts, 237 three dimensional scatter diagrams, 266 three dimensional space curves, 264 time series plot, 80 transforming data, 226 warning about editing PS les, 338 ZapfDingbatEncoding vector, 348 zero centered rod plot, 80 Greek alphabet, 19 Greenhouse-Geisser epsilon, 123 Growth curves, 28, 278 GSview/Ghostscript, 19, 355 Half normal plot, 80 Half saturation points, 74 Hanning lter, 171 Hazard function, 176, 285, 290 Heaviside unit function, 297, 315 Helmert matrix, 121 Help (program), 374

398

Index

Hessian, 25, 26, 60, 69, 261 binding polynomial, 195, 278 Hill equation, 74 Hinges, 79 Histograms, 22, 89, 91 Hlt (program), 374 tting a dilution curve, 279 tting High/Low afnity sites, 29, 63 Hodges-Lehhman location estimator, 192 Hotellings T 2 test, 83, 121, 124, 149, 156 Hotellings generalized T02 statistic, 155 Huyn-Feldt epsilon, 123 Hyperbolic and inverse hyperbolic functions, 315 Hypergeometric distribution, 284 Hypergeometric function, 295, 296 IC50, 65, 74 IFAIL, 26 Ill-conditioned problems, 261 Immunoassay, 279 Impulse functions periodic, 276, 297 single, 275, 297 Incomplete beta function, 294, 315 Incomplete gamma function, 293, 315 Independent variables, 36 Indicator variables, 56, 137, 138 Indices of diversity, 194 Initial conditions, 304 Initial rates, 208 Inrate (program), 374 rates, lags and asymptotes, 208 Insets, 277 Integrated hazard function, 285 Integrating 1 function of 1 variable, 310, 311, 326 Integrating n functions of m variables, 310, 327 Inverse functions, 315 Inverse prediction, 74 IOSTAT, 26 Isoenzymes, 29, 65 Isotope displacement curve, 65, 331 Jacobi elliptic integrals, 296 Jacobian, 299, 302, 303 K-means cluster analysis, 145, 250, 251 Kaplan-Meier estimate, 176, 255 Kelvin functions, 296, 315 Kendall coefcient of concordance, 110 Kendalls tau, 135 Kernel density estimation, 198 Kinetic isotope effect, 65 Kolmogorov-Smirnov 1-sample test, 78, 88, 256 Kolmogorov-Smirnov 2-sample test, 78, 86, 87, 94

Kronecker delta function, 297, 315 Kruskal-Wallis test, 113 Kummer functions, 295, 315 Kurtosis, 79 Labelling statistical graphs, 253 Lag times, 208, 209, 333 Last in rst out, 299 Latent variables, 163 Latin squares, 117, 197 LD50, 56, 67, 74, 131 Legendre polynomials, 296, 315 Levenburg-Marquardt, 25 Leverages, 42, 52, 62 Library les, 14, 384 Likelihood ratio test statistic, 100 Limits of integration, 309 Line thickness, 223 Linear regression, 25 Lint (program), 375 constructing a calibration curve, 73 multilinear regression, 42 Link functions for GLM, 51 Loadings, 149 Log les, 366 Log logistic distribution, 291 Log rank test, 255 Log transform, 293 Log-linear model, 54, 101 Log-Odds plot, 125, 240, 242 Log-Odds-Ratios plot, 56, 127, 130 Logistic distribution, 291 Logistic equation, 67 Logistic growth, 28, 332, 336, 337 Logistic polynomial regression, 56 Logistic regression, 51, 56, 125, 131 Logit model, 334, 337 Lognormal distribution, 287 Lotka-Volterra, 33, 302 LU factorization of a matrix, 201 M-estimates, 45 Mahalanobis distance, 40, 158, 161, 205 Makcsa (program), 375 Makdat (program), 375 simulating exact data, 31 Makl (program), 375 making a curve tting le, 14 recommended ways to use, 13 Maklib (program), 375 making a library le, 14 Makmat (program), 375 making a matrix/vector le, 15 Maksim (program), 375

Index

399

transforming data into SIMFIT format, 15 Mallows Cp, 42, 62 Manifest variables, 163 Mann-Whitney U test, 78, 86, 87, 96, 193 MANOVA, 81, 124, 153, 287 Mantel-Haenszel log rank test, 176 Matched case control studies, 58 Mathematical constants, 322 Matrix Ax = b full rank case, 205 Ax = b in L1 , L2 and L norms, 206 Cholesky factorization, 204 determinant, inverse, eigenvalues, eigenvectors, 200 evaluation of quadratic forms, 205 hat, 42 LU factorization, 201 multiplication, 205 norms and condition numbers, 201 pseudo inverse, 205 QR factorization, 203 singular value decomposition, 200 Mauchly sphericity test, 84, 121 Maximum growth rate, 66, 74 Maximum likelihood, 28, 36, 45, 89, 176 Maximum size, 66, 74 McNemar test, 102 Means, 14 an important principle, 21 warning about tting means, 25, 26 Median of a sample, 192 Median test, 109 Meta Analysis, 127, 243 Method of moments, 89 Michaelis pH functions, 333 Michaelis-Menten equation, 75, 331 tting, 29 pH dependence, 336 Microsoft Ofce, 355 Minimizing a function, 328 Minimum growth rate, 66, 74 Missing data, 364 Mmt (program), 375 tting isotope displacement kinetics, 65 tting the Michaelis-Menten equation, 29 tting the multi Michaelis-Menten model, 29, 64 Model discrimination, 39, 62, 69 Model free tting, 211 Models log10 law, 335 Adair constants isotope displacement, 331 Adair constants saturation function, 331

arctangent, 335 Arrhenius rate constant, 333 beta cdf, 334 beta pdf, 334 binding constants isotope displacement, 331 binding constants saturation function, 331 binding to one site, 333 bivariate normal, 337 Briggs-Haldane, 33 Cauchy cdf, 334 Cauchy pdf, 334 competitive inhibition, 336 convolution integral, 272 cooperative ligand binding, 64 cross-over points, 274 damped simple harmonic motion, 300, 335 differential equations, 330 diffusion into a capillary, 301, 334, 336 double exponential plus quadratic, 335 double logistic, 335 epidemic differential equations, 69 error tolerances, 309 exponential cdf, 334 exponential growth, 332 exponential pdf, 334 exponentials, 27, 59 Freundlich isotherm, 333 from a dynamic link library, 299 gamma pdf, 334 gamma type, 335 general P-accumulation DE, 330 general S-depletion DE, 330 generalized inhibition, 333 generalized linear, 50 GLM, 50 Gompertz, 28, 66 Gompertz growth, 332 Gompertz survival, 332 growth curves, 28, 66 H/L sites isotope displacement, 331 high/low afnity sites, 63, 331 Hill, 209, 335 inhibition by competing substrate, 336 irreversible MM P-accumulation DE, 330 irreversible MM progress curve, 333 irreversible MM S-depletion DE, 330 isotope displacement, 65 lag phase to steady state, 208, 209, 333 limits of integration, 309 logistic, 28, 66, 67 logistic cdf, 334 logistic growth (1 variable), 332 logistic growth (2 variables), 336 logistic growth (3 variables), 337

400

Index

logit, 334, 337 lognormal cdf, 334 lognormal pdf, 334 Lotka-Volterra, 33, 269, 302 Maxwell pdf, 334 membrane transport DE, 330 Michaelis pH functions, 333 Michaelis-Menten, 29, 209 Michaelis-Menten pH dependence, 336 Michaelis-Menten plus diffusion, 333 mixed inhibition, 273, 336 Monod-Wyman-Changeux allosterism, 333 monomolecular, 28, 66, 332 Mualen equation, 334 multi Michaelis-Menten, 64, 331 multi MM isotope displacement, 331 multilinear, 42 MWC activator/inhibitor, 337 noncompetitive inhibition, 336 nonparametric, 211 normal cdf, 334 normal pdf, 334 order n : n rational function, 331 ordered bi bi, 336 overdetermined, 102 parametric, 280, 281 ping pong bi bi, 336 polynomial in one variable, 73, 331 polynomial in three variables, 337 polynomial in two variables, 335 power law, 335 Preece and Baines, 332 probit, 76, 334, 337 progress curve, 209 proportional hazards, 178, 181 quadratic binding, 333 rational function, 30 rational function in one variable, 64 rational function in two variables, 336 Rayleigh pdf, 334 Reversible Michaelis-Menten, 336 Richards, 28, 66, 332 saturated, 102 segmented, 274 sine/cosine, 335 sinh/cosh, 335 splines, 211, 214 sum of exponentials, 331 sum of Gaussians, 271, 332 sum of trigonometric functions, 331 survival, 67, 176 tanh, 335 three lines, 302 time dependent inhibition, 336

transition state rate constant, 333 uncompetitive inhibition, 336 up/down exponential, 335 up/down logistic, 335 upper or lower semicircle, 335 upper or lower semiellipse, 335 user dened, 34, 280, 281, 299 Von Bertalanffy, 28, 66 Von Bertalanffy DE, 28, 32, 330 Weibull cdf, 334 Weibull pdf, 334 Weibull survival, 67, 176, 332 models Gaussian plus exponential, 335 Gaussian times exponential, 335 linear plus recprocal, 335 Monod-Wyman-Changeux allosteric model, 333 Monomolecular growth, 332 Mood-David equal dispersion tests, 109 Morse dot wave function, 298, 315 Moving averages, 171 Mualen equation, 334 Multinomial distribution, 284 Multiple error bars, 240 Multiple le selection, 11 Multiple statistical tests, 78 Multivariate analysis of variance, 153 Multivariate biplots, 359 Multivariate normal distribution, 81, 287 Multivariate normal plot, 81 NAG library, 1, 11, 26, 372, 388, 390 Negative binomial distribution, 284 Non-central distributions, 195, 291 Non-metric scaling, 144 Non-seasonal differencing, 172 Nonlinear regression, 59 Nonparametric tests, 108 chi-square, 99 Cochran Q, 103 correlation, 135 Friedman, 116 goodness of t, 102 Kolmogorov-Smirnov 1-sample, 88 Kolmogorov-Smirnov 2-sample, 94 Kruskal-Wallis, 113 Mann-Whitney U, 96 sign, 104 Wilcoxon signed-ranks, 97 Normal (program), 375 Normal distribution, 191, 286 Normal plot, 80 Normal scores, 90 Norms of a vector, 45

Index

401

Number needed to treat, 129 Objective function, 25, 26, 261 Observed frequencies, 98 Odds, 125, 242 Odds ratios, 56, 127, 243 Offsets, 180 OpenOfce, 355 Operating characteristic, 184 Optimization, 263, 328 Orbits, 270 Order statistics, 90 Ordinal scaling, 144 Orthonormal contrasts, 121 Outliers, 25, 26, 33, 192 in regression, 45, 80 Over-dispersion, 91 Overdetermined model, 56, 102 Paired t test, 93 Parameters condence contours, 131, 257 condence limits, 60, 190, 242 correlation matrix, 39 estimable, 43, 51 limits les, 386 redundancy, 25, 107 signicant differences between, 40, 257 standard errors, 43, 51, 257 starting estimates, 25, 26, 69, 304 t test and p values, 39 Parametric equations, 280, 281 Partial autocorrelation functions, 172 Partial clustering, 142 Partial correlation, 136 Paste, 11 Pearson product moment correlation, 132 Percentiles, 56, 131 pH Michaelis functions, 333 Michaelis-Menten kinetics, 336 Pharmacokinetics, 27, 59, 212 Phase portraits, 269 Pie charts, 24 Pielou evenness, 194 Plotting transformed data, 226 Plotting user dened models, 309 Poisson distribution, 91, 99, 190, 285 Polnom (program), 375 constructing a calibration curve, 73 Polygamma function, 295 Polynomial, 73 Horners method, 319 Portable document format, 370

Portable network graphics, 371 Positive-denite symmetric matrix, 204 Postx notation, 299 PostScript SIMFIT character display codes, 349 adding extra text, 342 changing line and symbol types, 341 changing line thickness and plot size, 339 changing PS fonts, 339 changing title and legends, 340 characters outside the keyboard set, 344 creating PostScript text les, 350 decorative fonts, 343 deleting graphical objects, 341 driver interface, 19 editing SIMFIT PS les, 338 editps text formatting commands, 350 example les, 387 GSview and Ghostscript, 355 ISOLatin1Encoding vector, 346 specials, 235, 353 standard fonts, 343 StandardEncoding vector, 345 summary, 19 SymbolEncoding vector, 347 user dened dictionary, 223 warning about editing PS les, 338 ZapfDingbatEncoding vector, 348 Power and sample size, 183, 259 1 binomial sample, 185 1 correlation, 188 1 normal sample, 186 1 variance, 188 2 binomial samples, 185 2 correlations, 188 2 normal samples, 187 2 variances, 188 chi-square test, 189 Fisher exact test, 186 k normal samples (ANOVA), 187 Predator-prey equations, 302 Preece and Baines, 28, 332 Presentation graphics, 235 Principal components analysis, 148, 252, 266 Principal coordinates, 142 Probit analysis, 131 Probit model, 334, 337 Procrustes analysis, 151 Prole analysis, 157, 158 Progress curve, 209, 333 Project archives, 11, 40 Projecting space curves onto planes, 265 Proportional hazards model, 56, 178, 181 Pseudo inverse, 205

402

Index

PSfrag, 222, 232, 355 Psi function, 295, 315 Qnt (program), 375 advanced curve tting, 69 calculating error bars, 241 calibration, 73 estimating AUC, 73 estimating derivatives, 73 graphical deconvolution, 271 numerical deconvolution, 272 QR factorization of a matrix, 203 Quadratic binding model, 333 Quadratic forms, 205 Quadrature, 309311, 326 Qualitative variables, 35, 36, 56 Quantal data, 51 Quantitative variables, 35 Quartiles, 79 Quartimax rotation, 152 Quasi-Newton, 25, 69, 328 R-squared test, 40, 42 Random walks, 258 Randomized block, 103 Rank deciency, 51 Rannum (program), 375 random permutations and Latin squares, 197 random walks, 258 Rate constants, 333 Rational functions, 30, 64 Rectied sine half-wave function, 315 Rectied sine wave function, 298, 315 Rectied triangular wave function, 298, 315 Reduced major axis line, 134 Regression L1 norm, 45 L2 norm, 45 L norm, 45 binary logistic, 56 comparing parameter estimates, 40 Cox, 176, 181 generalized linear, 25, 50, 125 linear, 25 logistic, 51, 56, 125 logistic polynomial, 56 multilinear, 42 nonlinear, 25, 26, 38 on ranks, 49 orthogonal, 44 reduced major and major axis, 44, 134 robust, 45 Relaxation times, 209 Repeated-measurements design, 103, 121, 158

Replicates, 14, 21 warning about tting means, 21, 25, 26 Residuals, 25, 26, 39, 40, 52, 60, 106 deviance, 62 studentized, 42, 62 Results, 366 Reverse Polish, 299 Rft (program), 375 tting positive rational functions, 30, 64 Richards growth model, 332 Robust parameter estimates, 192, 193 Robust regression, 45 Roots of a polynomial of degree n - 1, 200 Roots of equations, 32, 309, 310, 313, 325 Rosenbrocks function, 263 Rotating graphs, 228 Rstest (program), 376 nonparametric tests, 108 Run test, 40, 105 Run5 (program), 376 Running medians, 171 Runs up and down test for randomness, 108 Sample size, 259 Saturated model, 102 Saturation function, 29 Sawtooth graph, 225 Sawtooth wave function, 298, 315 Scalar product, 322 Scaling classical metric, 142 non-metric (ordinal), 144 Scatchard plot, 21 warning about uncritical use, 30 Scattergrams, 81, 115, 150, 244, 252 Schwarz Bayesian criterion, 62 Scores, 149 Scree plot, 138, 151, 252 Seasonal differencing, 172 Segmented models, 274 Sensitivity analysis, 31 Sft (program), 376 cooperativity analysis, 195 tting a saturation function, 29 tting cooperative ligand binding, 29, 64 Shannon diversity index, 194 Shapiro-Wilks test, 40, 79, 90, 93 Sign test, 39, 40, 104 Signal-to-noise ratio, 183 Simt character display codes, 349 conguration les, 387 default les, 387 dynamic link libraries, 372

Index

403

error message les, 386 error messages, 26 le format, 13, 15 goodness of t statistics, 38 library les, 384 model les, 385 Open . . . , 10 parameter limits les, 386 Save As . . . , 10 saving results, 9 starting estimates, 25, 26 temporary les, 388 test les, 10, 381 the main menu, 7 Similarity matrix, 140 Simplot (program), 376 creating a simple graph, 21 Simpsons rule, 310, 311 Simstat (program), 376 1-sample t test, 88 1-way ANOVA and Kruskal-Wallis, 112 2-way-ANOVA and Friedman, 115 3-way ANOVA and Latin squares, 117 all possible pairwise tests, 87 analysis of proportions, 125 binomial test, 103 chi-square and Fisher exact tests, 99 Cochran Q test, 103 Cochran-Mantel-Haenszel test, 127 constructing a calibration curve, 73 cooperativity analysis, 195 data exploration, 79 determinant, inverse, eigenvalues, eigenvectors, 200 exhaustive analysis of a multivariate normal matrix, 81 exhaustive analysis of an arbitrary matrix, 81 exhaustive analysis of an arbitrary vector, 79 F test, 107 factorial ANOVA, 118 Fisher exact Poisson test, 91 groups and subgroups ANOVA, 117 Kolmogorov-Smirnov 1-sample test, 88 Kolmogorov-Smirnov 2-sample test, 94 lags and autocorrelations, 172 Mann-Whitney U test, 96 McNemar test, 102 non-central distributions, 195 nonparametric correlation, 135 paired t test, 93 parameter condence limits, 190 Pearson correlation, 132 power and sample size, 183, 259 random permutations and Latin squares, 197

run test, 105 Shapiro-Wilks test, 90 sign test, 104 singular value decomposition, 200 solving Ax = b, 205 statistical tests, 88 t and variance ratio tests, 91 trinomial condence regions, 131 Tukey Q test, 114 Wilcoxon signed-ranks test, 97 zeros of a polynomial, 200 Simulation 2-dimensional families of curves, 267 3-dimensional families of curves, 268 adding error, 33 differential equations, 33 experimental error, 33 plotting parametric equations, 280, 281 plotting user dened models, 309 summary, 31 Sine integral, 294, 315 Singular value decomposition, 43, 200 Skewness, 79 Skyscraper plots, 24, 81, 239 Slanting error bars, 240 Slopes, 208 Space curves, 264 Spearmans rank, 135 Special functions, 315 special functions, 293 Species fractions, 195, 278 Spence integral, 294, 315 Sphericity test, 84, 121 Spline (program), 376 Splines, 73, 211, 214 Spreadsheet, 15 Spreadsheet les, 361 Spreadsheet tables, 358 Square root transformation, 292 Square wave function, 298, 315 Standard distributions, 195 Starting estimates, 69, 304 Statistics analysis of proportions, 125 ANOVA 1-way, 112 ANOVA 2-way, 115 ANOVA 3-way, 117 binomial test, 103 Bonferroni correction, 78 canonical variates, 158 chi-square test, 39, 99, 150 chi-square test on observed and expected frequencies, 98 cluster analysis, 140, 145

404

Index

Cochran Q test, 103 Cochran-Mantel-Haenszel test, 127 correlation (canonical), 138 correlation (nonparametric), 135 correlation (parametric), 132 correlation(partial), 136 distribution from nonlinear regression, 38 Dunn-Sidak correction, 78 Durbin-Watson test, 40 F test, 39, 107 Fisher exact Poisson test, 91 Fisher exact test, 99 Friedman test, 116 groups and subgroups ANOVA, 117 K-means cluster analysis, 145 Kolmogorov-Smirnov 1-sample test, 78, 88, 256 Kolmogorov-Smirnov 2-sample test, 78, 94 Kruskal-Wallis test, 113 Latin squares, 117 log rank test, 255 Mann-Whitney U test, 78, 96 MANOVA, 153 Mantel-Haenszel log rank test, 176 Mantel-Haenszel test, 255 McNemar test, 102 Meta Analysis, 127 multiple tests, 78 multivariate cluster analysis, 140 non-central distributions, 195, 291 nonparametric tests, 102 performing tests, 78 plotting cdfs and pdfs, 254, 256 power and sample size, 183, 259 principal components analysis, 148, 252 R-squared test, 40 run test, 40, 105 Shapiro-Wilks test, 40 sign test, 39, 40, 104 standard distributions, 195 summary, 78 t test, 39, 88, 91 trinomial condence regions, 131 Tukey Q test, 78, 114 variance ratio test, 91, 188 Wilcoxon rank-sum test, 96 Wilcoxon signed-ranks test, 97 Yatess correction to chi-square, 99 Steady states, 29, 208, 333 Strata, 58 Stretching graphs, 243, 248250 Struve functions, 295, 315 Studentized residuals, 42 Substrate activation, 30, 64

Substrate inhibition, 30, 64 Sum of squares and products matrix, 133 Surfaces, 24, 260 Survival analysis, 255 tting survival curves, 28 general principles, 38 indexbf, 176 statistical theory, 285 using generalized linear models, 56, 180 Survivor function, 176, 285 SVD, 43, 51, 149, 200 Swap-over points, 274 Symmetric eigenvalue problem, 207 t distribution, 288 t test, 39, 86, 87, 91, 186, 187 1-sample, 88 2-sample paired, 93 2-sample unpaired, 91 T4253H smoother, 171 Tables, 358 Temporary les, 388 Test les, 15, 381 Text formatting commands, 350 The law of n, 183 Three dimensional bar charts, 24, 237 Three dimensional scatter diagrams, 266 Three dimensional space curves, 264 Time at half survival, 176 Time series, 172, 174 plot, 80 Time to half maximum response, 59, 66, 67 Training sets, 161 Trapezoidal method, 212 Trigamma function, 295, 315 Trigonometric functions, 331 Trimmed mean, 192 Trinomial condence regions, 131, 191 Ttest (program), 376 Tukey Q test, 78, 114 Type 1 error, 78 Under-dispersion, 91 Uniform distribution, 286 Unit impulse function, 297, 315 Unit impulse wave function, 315 Unit spike function, 297, 315 Unpaired t test, 91 Usermod (program), 376 calling special functions, 315 calling sub-models, 310 checking user dened models, 299 developing models, 325 integrating a user dened model, 310, 311, 326

Index

405

minimizing a function, 328 plotting user dened models, 299, 309 simulating 2D families of curves, 267 simulating 3D families of curves, 268 simulating parametric equations, 280, 281 simulating projections, 265 zeros of n functions of n variables, 310, 313 zeros of user dened models, 310, 325 Variables categorical, 36, 53 dummmy, 36 independent, 36 qualitative, 36 quantitative, 36 Variance, 33, 39, 40, 259 stabilizing transformations, 292 Variance ratio test, 91, 188 Varimax rotation, 152 Vector norms, 45, 321 Venn diagrams, 224 Wave functions, 276, 297 Weibull distribution, 290 Weibull survival, 56, 176, 180, 332 Weighting, 25, 42, 67 Welchs approximate t, 288 Wilcoxon rank-sum test, 96 Wilcoxon signed-ranks test, 97, 192 Winsorized mean, 192 Word, 367 WSSQ, 25, 3840, 261 Yatess correction to chi-square, 99 ZapfDingbats, 19, 348 Zero centered rods plot, 80 Zeros of a polynomial of degree n - 1, 200 Zeros of n functions of n variables, 200, 299, 309, 310, 313, 326 Zeros of nonlinear equations, 32, 310, 313, 325

You might also like