IFP - Mathematics and Statistics PDF

International Foundation Programme
Mathematics
and Statistics
James Ward and

James Abdey
FP0001
This guide was prepared for the University of London by:
J.M. Ward, The London School of Economics and Political Science

J.S. Abdey, The London School of Economics and Political Science
This is one of a series of subject guides published by the University. We regret that due to pressure
of work the authors are unable to enter into any correspondence relating to, or arising from, the
guide. If you have any comments on this subject guide, please use the online form found on the
virtual learning environment.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Published by: University of London

© University of London 2013. Reprinted with minor amends 2018
The University of London asserts copyright over all material in this subject guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any form, or
by any means, without permission in writing from the publisher. We make every effort to respect
copyright. If you think we have inadvertently used your copyright material, please let us know.
Contents
Contents
Introduction 1
Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Time management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Recommendations for working through the units . . . . . . . . . . . . . . 2
Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
The subject guide and textbooks . . . . . . . . . . . . . . . . . . . . . . 2
Online study resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Virtual Learning Environment (VLE) . . . . . . . . . . . . . . . . . . . . 3
Making use of the Online Library . . . . . . . . . . . . . . . . . . . . . . 4
Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Part 1 Mathematics 6
Introduction to Mathematics 7
Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Learning outcomes for the course (Mathematics) . . . . . . . . . . . . . . . . . 8
Textbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Review I — A review of some basic mathematics 9

1.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Basic arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.3 Powers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.1 Algebraic expressions . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.2 Equations, formulae and inequalities . . . . . . . . . . . . . . . . 26
Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
i
Contents
2 Review II — Linear equations and straight lines 33

2.1 Linear equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.1 Linear equations in one variable . . . . . . . . . . . . . . . . . . . 33
2.1.2 Linear equations in two variables . . . . . . . . . . . . . . . . . . 34
2.1.3 Visualising the solutions of linear equations in two variables . . . 35
2.2 Straight lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Drawing straight lines given their equations . . . . . . . . . . . . 37
2.2.2 The intercepts of a straight line . . . . . . . . . . . . . . . . . . . 38
2.2.3 The gradient of a straight line . . . . . . . . . . . . . . . . . . . . 40
2.2.4 Finding the equation of a straight line . . . . . . . . . . . . . . . 40
2.2.5 Applications of straight lines . . . . . . . . . . . . . . . . . . . . . 42
2.3 Simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.1 Visualising the solution to a pair of simultaneous equations . . . . 44
2.3.2 Solving simultaneous equations algebraically . . . . . . . . . . . . 45
2.3.3 An application of simultaneous equations in economics . . . . . . 47
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Review III — Quadratic equations and parabolae 50

3.1 Quadratic equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.1 Factorising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.2 Completing the square . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.3 Using the completed square form to solve quadratic equations . . 55
3.1.4 Warning! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.5 The quadratic formula . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Parabolae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Sketching parabolae . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Where do a parabola and a straight line intersect? . . . . . . . . . 62
3.2.3 Where do two parabolae intersect? . . . . . . . . . . . . . . . . . 63
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Functions 67
4.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 What is a function? . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Some common functions . . . . . . . . . . . . . . . . . . . . . . . 69
ii
Contents
4.1.3 Combinations of functions . . . . . . . . . . . . . . . . . . . . . . 72

4.1.4 Functions in economics . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Finding inverse functions . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.2 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Calculus I — Differentiation 86
5.1 The gradient of a curve at a point . . . . . . . . . . . . . . . . . . . . . . 86
5.1.1 Tangents to a parabola . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.2 Chords of a parabola . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.3 Tangents to other curves . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 What is differentiation? . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.1 Standard derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.2 Two rules of differentiation . . . . . . . . . . . . . . . . . . . . . . 94
5.2.3 Some general points on what we have seen so far . . . . . . . . . . 96
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Calculus II — More differentiation 100

6.1 Three more rules of differentiation . . . . . . . . . . . . . . . . . . . . . . 100
6.1.1 The product rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.2 The quotient rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1.3 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1.4 Further applications of the chain rule . . . . . . . . . . . . . . . . 106
6.1.5 Using these rules of differentiation together . . . . . . . . . . . . . 107
6.2 Approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7 Calculus III — Optimisation 113

7.1 What derivatives tell us about functions . . . . . . . . . . . . . . . . . . 113
7.1.1 When is a function increasing or decreasing? . . . . . . . . . . . . 113
7.1.2 Stationary points . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.3 Second derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1.4 What second derivatives tell us about a function . . . . . . . . . . 117
iii
Contents
7.1.5 A note on the ‘large x ’ behaviour of functions . . . . . . . . . . . 118

7.2 Optimisation and curve sketching . . . . . . . . . . . . . . . . . . . . . . 119
7.2.1 Steps 1 and 2: Finding and classifying stationary points . . . . . . 120
7.2.2 Curve sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.3 Step 3: Looking for global maxima and global minima . . . . . . . 123
7.2.4 An economic application: Profit maximisation . . . . . . . . . . . 124
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 Calculus IV — Integration 128

8.1 Indefinite integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.1 Finding simple indefinite integrals . . . . . . . . . . . . . . . . . . 130
8.1.2 The basic rules of integration . . . . . . . . . . . . . . . . . . . . 132
8.2 Definite integrals and areas . . . . . . . . . . . . . . . . . . . . . . . . . 135
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9 Financial Mathematics I — Compound interest and its uses 146

9.1 Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.1.1 A formula for balances under annually compounded interest . . . 148
9.1.2 Other compounding intervals . . . . . . . . . . . . . . . . . . . . 149
9.1.3 Continuous compounding . . . . . . . . . . . . . . . . . . . . . . 151
9.2 Problems involving interest rates . . . . . . . . . . . . . . . . . . . . . . 153
9.2.1 How much do I need to invest to get...? . . . . . . . . . . . . . . . 153
9.2.2 What interest rate do I need to get...? . . . . . . . . . . . . . . . 153
9.2.3 How long do I need to invest to get...? . . . . . . . . . . . . . . . 154
9.2.4 Annual percentage rates . . . . . . . . . . . . . . . . . . . . . . . 154
9.3 Depreciation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10 Financial Mathematics II — Applications of series 158

10.1 Sequences and series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.1.1 Arithmetic sequences and series . . . . . . . . . . . . . . . . . . . 158
10.1.2 Geometric sequences and series . . . . . . . . . . . . . . . . . . . 161
10.2 Financial applications of geometric series . . . . . . . . . . . . . . . . . . 165
10.2.1 Regular saving plans . . . . . . . . . . . . . . . . . . . . . . . . . 165
iv
Contents
10.2.2 Annuities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

10.2.3 Future and present values . . . . . . . . . . . . . . . . . . . . . . 168
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Part 2 Statistics 172

Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Learning outcomes for the course (Statistics) . . . . . . . . . . . . . . . . . . . 174
Textbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11 Data exploration I – The nature of statistics 175

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.1.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.1.2 Data classification . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.1.3 Data summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.1.4 Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.1.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.1.6 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.1.7 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.1.8 Descriptive and inferential statistics . . . . . . . . . . . . . . . . . 177
11.2 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.3 The role of statistics in the research process . . . . . . . . . . . . . . . . 179
11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
11.5 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
12 Data exploration II – Data visualisation 188

12.1 Grouping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
12.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
12.3 Pie charts and bar graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 192
12.4 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.5 Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
v
Contents

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13 Data exploration III – Descriptive statistics: measures of location, dispersion

and skewness 199
13.1 Summation notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.2 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.2.1 Which ‘average’ should be used? . . . . . . . . . . . . . . . . . . . 204
13.2.2 Frequency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.3 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.3.1 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 208
13.3.2 Variance using frequency distributions . . . . . . . . . . . . . . . 209
13.4 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
14 Probability I – Introduction to probability theory 215

14.1 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
14.1.1 Assigning probabilities . . . . . . . . . . . . . . . . . . . . . . . . 216
14.1.2 The classical method . . . . . . . . . . . . . . . . . . . . . . . . . 216
14.1.3 The relative frequency approach . . . . . . . . . . . . . . . . . . . 217
14.1.4 Subjective probabilities . . . . . . . . . . . . . . . . . . . . . . . . 217
14.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
14.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
14.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
14.5 Complementary events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
14.6 Additive laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
14.7 Multiplicative laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
14.8 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
14.8.1 Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.8.2 Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.8.3 Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.9 Summary – a listing of probability results . . . . . . . . . . . . . . . . . 227
14.10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
14.11Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
vi
Contents

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
15 Probability II – Probability distributions 234

15.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
15.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 235
15.3 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 236
15.4 Expectation of a discrete random variable . . . . . . . . . . . . . . . . . 236
15.5 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . 237
15.6 Variance of a discrete random variable . . . . . . . . . . . . . . . . . . . 239
15.7 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 241
15.8 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
15.9 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
15.10Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
15.11A word on calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
15.12Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
15.13Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
16 Probability III – The normal distribution and sampling distributions 253

16.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
16.1.1 Probabilities for any normal distribution . . . . . . . . . . . . . . 258
16.1.2 Some probabilities around the mean . . . . . . . . . . . . . . . . . 261
16.2 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
16.2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
16.2.2 Sampling from a normal population . . . . . . . . . . . . . . . . . 266
16.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
17 Sampling and experimentation I – Sampling techniques and contact

methods 273
17.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
17.1.1 Non-probability sampling techniques . . . . . . . . . . . . . . . . 276
vii
Contents
17.1.2 Probability sampling techniques . . . . . . . . . . . . . . . . . . . 279

17.1.3 Contact methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
17.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
18 Sampling and experimentation II – Bias and the design of experiments286

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
18.2 Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
18.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
18.4 Adjusting for non-response . . . . . . . . . . . . . . . . . . . . . . . . . . 289
18.5 Experimental design in the social and medical sciences . . . . . . . . . . 290
18.5.1 Experimental versus observational studies . . . . . . . . . . . . . 291
18.5.2 Randomised controlled clinical trials . . . . . . . . . . . . . . . . 291
18.5.3 Randomised blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 292
18.5.4 Multifactorial experimental designs . . . . . . . . . . . . . . . . . 292
18.5.5 Quasi-experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 292
18.5.6 Cluster randomised trials . . . . . . . . . . . . . . . . . . . . . . . 292
18.5.7 Analysis and interpretation . . . . . . . . . . . . . . . . . . . . . 293
18.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
19 Fundamentals of regression I – Correlation and the simple linear regression

model 296
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
19.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
19.3 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
19.4 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
19.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
19.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
viii
Contents
20 Fundamentals of regression II – Interpretation of computer output and

assessing model adequacy 308
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
20.2 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
20.3 Coefficient of determination, R2 . . . . . . . . . . . . . . . . . . . . . . . 309
20.4 Computer output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
20.5 Several explanatory variables . . . . . . . . . . . . . . . . . . . . . . . . . 314
20.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Part 3 Appendices 319
A A sample examination paper 320
B Solutions to the sample examination paper 326

Section A: Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Section B: Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
C Table of cumulative normal probabilities 333
ix
Contents
x
Introduction
Introduction
Welcome to the world of Mathematics and Statistics! These are disciplines which are
widely applied in areas such as finance, business, management, economics and other
fields in the social sciences. The following units will provide you with the opportunity to
grasp the fundamentals of these subjects and will equip you with some of the vital
quantitative skills and powers of analysis which are highly sought-after by employers in
many sectors.
As Mathematics and Statistics has so many applications, it should not be surprising
that it forms the compulsory component of the International Foundation Programme.
The analytical skills which you will develop on this course will therefore serve you well
in both your future studies and beyond in the real world of work. The material in this
course is necessary as preparation for other courses you may study later on as part of a
degree programme or diploma; indeed, in many cases a course in Mathematics or
Statistics is a compulsory component on University of London degrees.
Route map to the guide
This subject guide provides you with a framework for covering the syllabus of the
Mathematics and Statistics course in the International Foundation Programme and
directs you to additional resources such as readings and the virtual learning
environment (VLE).
The following 20 units will introduce you to these disciplines and equip you with the
necessary quantitative skills to assist you in further programmes of study. Given the
cumulative nature of Mathematics and Statistics, the units are not a series of
self-contained topics, rather they build on each other sequentially. As such, you are
strongly advised to follow the subject guide in unit order. There is little point in rushing
past material which you have only partially understood in order to reach the final unit.
Once you have completed your work on all of the units, you will be ready for
examination revision. A good place to start is the sample examination paper which you
will find at the end of the subject guide.
Time management
About one-third of your private study time should be spent reading and the other
two-thirds doing problems. (Note the emphasis on practising problems!)
To help your time management, each unit of this course should take a week to study
and so you should be spending 10 weeks on Mathematics and 10 weeks on Statistics.
1
Introduction
Recommendations for working through the units

The following procedure is recommended for each unit.
i. Read the overview and the aims of the unit.

ii. Now work through each section of the unit making sure you can understand the
examples given and try the activities as you encounter them. In parallel, watch the
accompanying video tutorials for each section on the VLE.
iii. At the end of the unit, review the intended learning outcomes carefully, almost as
a checklist. Do you think you have achieved these targets?
iv. Attempt the unit’s self-test quizzes on the VLE. You can treat these as additional
activities. This time, though, you will have to think a little about which part of
the new material you have learnt is appropriate to each question.
v. Attempt the exercises given at the end of the unit. The solutions can be found on
the VLE, but you should only look at these after attempting them yourself!
vi. If you have problems at this point, go back to the subject guide and work through
the area you find difficult again. Don’t worry — you will improve your
understanding to the point where you can work confidently through the problems.
The last few steps are most important. It is easy to think that you have understood the
text after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take most of your study time (refer to the
‘Time management’ section above). Note that we have given worked examples and
activities to cover each substantive topic in the subject guide. The essential reading
examples are added for further consolidation of the whole unit and also to help you
work out exactly what the questions are about! One of the problems students
sometimes have in an examination is that they waste time trying to understand which
part of the syllabus particular questions relate to. These final questions, together with
the further explanations on the VLE, aim to help with this before you tackle the sample
examination questions at the end of each unit.
Try to be disciplined about this: don’t look up the answers until you have done your
best. Some of the ideas you encounter may seem unfamiliar at first, but your attempts
at the questions, however dissatisfied you feel with them, will help you understand the
material far better than reading and re-reading the prepared answers — honest!
So to conclude, perseverance with problem-solving is your passport to a strong
examination performance. Attempting (ideally successfully!) all the cited exercises is of
paramount importance.
Overview of learning resources
The subject guide and textbooks

This subject guide for Mathematics and Statistics has been structured so that it is
tailored to the specific requirements of the examinable material. It is ‘written to the
2
Introduction
course’, unlike textbooks which may cover additional material which will not be
examinable or may not cover some material that is! Therefore the subject guide should
act as your principal resource.
However, a textbook may give an alternative explanation of a topic (which is useful if
you have difficulty following something in the subject guide) and so you may want to
consult one for further clarification. Additionally, a textbook will contain further
examples and exercises which can be used to check and consolidate your understanding.
For this course, a useful starting point is
+ Swift, L., and S. Piff Quantitative methods for business, management and finance.
(Palgrave, 2014) fourth edition [ISBN 9781137376558].
as this will serve as useful background reading. But, many books are available covering
the material frequently found in mathematics and statistics courses like this one and so,
if you need a textbook for background reading, you should find one that is appropriate
to your level and tastes.
Online study resources

In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
VLE and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register. As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you have forgotten these login details, please click on the ‘Forgotten your password’
link on the login page.
Virtual Learning Environment (VLE)

The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. In addition
to making printed materials more accessible, the VLE provides an open space for you to
discuss interests and to seek support from other students, working collaboratively to
solve problems and discuss subject material. In a few cases, such discussions are driven
and moderated by an academic who offers a form of feedback on all discussions. In other
cases, video material, such as audio-visual tutorials, are available. These will typically
focus on taking you through difficult concepts in the subject guide. For quantitative
courses, such as Mathematics and Statistics, fully worked-through solutions of
practice examination questions are available. For some qualitative courses, academic
interviews and debates will provide you with advice on approaching the subject and
examination questions, and will show you how to build an argument effectively.
3
Introduction
Past examination papers and Examiners’ commentaries are available for download and
these provide advice on how each examination question might best be answered.
Self-testing activities allow you to test your knowledge and recall of the academic
content of various courses. Finally, a section of the VLE has been dedicated to
providing you with expert advice on practical study skills such as preparing for
examinations and developing digital literacy skills.
Making use of the Online Library

The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
Essential reading journal articles listed on a number of reading lists are available to
download from the Online Library.
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed on the reading list, try:
1. removing any punctuation from the title, such as single quotation marks, question
marks and colons, and/or
2. putting quotation marks around the title, for example “Why the banking system
should be regulated”.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login: http://tinyurl.com/ollathens
Examination advice
Important: the information and advice given in the following section are based on the
examination structure used at the time this subject guide was written. Please note that
subject guides may be used for several years. Because of this, we strongly advise you to
check both the current Regulations for relevant information about the examination,
and the current Examiners’ commentaries, where you should be advised of any
forthcoming changes. You should also carefully check the rubric/instructions on the
paper you actually sit and follow those instructions.
The examination is by a two-hour, unseen, written paper. No books may be taken into
the examination, but you will be provided with extracts of statistical tables (as
reproduced in this subject guide). A calculator may be used when answering questions
on this paper, see below, and it must comply in all respects with the specification given
in the General Regulations.
The examination comprises two sections, each containing three compulsory questions.
Section A covers the mathematics part of the course counting for 50% of the marks, and
Section B covers the statistics part of the course for the remaining 50% of the marks.
You are required to pass both Sections A and B to pass the examination.
4
Introduction
In each section, the first question contains four short questions worth 5 marks each,
followed by two longer questions worth 15 marks each. Since the examination will seek
to assess a broad cross-section of the syllabus, we strongly advise you to study the
whole syllabus. A sample examination paper is provided at the end of this subject guide
along with a commentary providing extensive advice on how to answer each question.
Remember, it is important to check the VLE for:
Up-to-date information on examination and assessment arrangements for this

course.
Where available, past examination papers and Examiners’ commentaries for

the course which give advice on how each question might best be answered.
Calculators
You will need to provide yourself with a basic calculator. It should not be
programmable, because such machines are not allowed in the examination by the
University. The most important thing is that you should accustom yourself to using
your chosen calculator and feel comfortable with it. Your calculator must comply in all
respects with the specification given in the General Regulations.
5
Part 1
Mathematics
6
Introduction to Mathematics
Syllabus
This half of the course introduces some of the basic ideas and methods of Mathematics
with an emphasis on their application. The Mathematics part of this course has the
following syllabus.
Arithmetic and algebra: A review of arithmetic (including the use of fractions

and decimals) and the manipulation of algebraic expressions (including the use of
brackets and the power laws). Solving linear equations and the relationship
between linear expressions and straight lines (including the solution of
simultaneous linear equations). Solving quadratic equations and the relationship
between quadratic expressions and parabolae.
Functions: An introduction to functions. Some common functions (including

polynomials, exponentials, logarithms and trigonometric functions). The existence
of inverse functions and how to find them. The laws of logarithms and their uses.
Calculus: The meaning of the derivative and how to find it (including the
product, quotient and chain rules). Using derivatives to find approximations and
solve simple optimisation problems with economic applications. Curve sketching.
Integration of simple functions and using integrals to find areas.
Financial mathematics: Compound interest over different compounding

intervals. Arithmetic and geometric sequences. The sum of arithmetic and
geometric series. Investment schemes and some ways of assessing the value of an
investment.
Aims of the course

The aims of the Mathematics part of this course are to provide:
a grounding in arithmetic and algebra;
an overview of functions and the fundamentals of calculus;
an introduction to financial mathematics.

Throughout, the treatment is at an elementary mathematical level but, as you progress
through this part of the course, you should develop some quite sophisticated
mathematical skills.
7
Learning outcomes for the course (Mathematics)

At the end of the Mathematics part of the course, you should be able to:
manipulate algebraic expressions;
graph, differentiate and integrate simple functions;
calculate basic quantities in financial mathematics.
Textbook
As previously mentioned in the main introduction, this subject guide has been designed
to act as your principal resource. The textbook
+ Swift, L., and S. Piff Quantitative methods for business, management and finance.
may be useful as ‘background reading’ but it is not essential. However, you might
benefit from reading parts of it if you find any of the material difficult to follow at first.
8
1. Review I — A review of some basic mathematics
1
Unit 1: Review I
A review of some basic mathematics
Overview
In this unit we revise some material on arithmetic and algebra which you should have
encountered before. Starting with arithmetic, this will involve revising the basic
mathematical operations and how they can be combined with and without the use of
brackets, how we can manipulate fractions and the use of powers. We then look at some
basic algebra and see how to use and manipulate algebraic expressions.
Aims
The aims of this unit are as follows.
To revise the basics of arithmetic, including the use of fractions and powers.
To revise the most basic ideas behind algebra.

Specific learning outcomes can be found near the end of this unit.
1.1 Arithmetic
In this section we revise some material which could be called ‘arithmetic’. The idea
behind this revision is to refresh our memories about how things like brackets, fractions
and powers work so that our revision of ‘algebra’ in the next section will, hopefully, be
easier.
1.1.1 Basic arithmetic

In mathematics we use four basic mathematical operations:
addition denoted by ‘+’ gives us ‘sums’, e.g. 6 + 3 = 9;
subtraction denoted by ‘−’ gives us ‘differences’, e.g. 6 − 3 = 3;
multiplication denoted by ‘×’ or ‘·’ gives us ‘products’, e.g. 6 × 3 = 18 or

6 · 3 = 18;
division denoted by ‘÷’ or a ‘horizontal line’ gives us ‘quotients’, e.g. 6 ÷ 2 = 3 or

6
2
= 3.
In particular, notice that there are two common notations for multiplication and
division. For multiplication, the reason for this is that a handwritten ‘×’ can be
9
1
confused with a handwritten ‘x’ whereas, for division, the reason is that writing
expressions that involve division (i.e. ‘÷’) as fractions enables us to manipulate them
more easily using the laws of fractions.
Combinations of operations
Often, different mathematical operations will occur in the same expression. For
example, we might be asked to work out the values of the expressions
1. 22 − 7 + 12 − 26 + 1,
2. 125 ÷ 25 × 2 × 3 ÷ 15,
3. 22 − 20 × 3 ÷ 4 − 5.
In such cases, we have the following rules.
1. If only addition and subtraction are involved: We work from left to right to get
22 − 7} +12 − 26 + 1 = 15
| {z + 12} −26 + 1 = 27
| {z − 26} +1 = 1
| {z + 1} = 2.
| {z
2. If only multiplication and division are involved: We work from left to right to get
125
| {z÷ 25} ×2 × 3 ÷ 15 = 5 × 2} ×3 ÷ 15 = 10
| {z × 3} ÷15 = 30
| {z ÷ 15} = 2.
| {z
3. When addition/subtraction and multiplication/division are involved: We work out

the multiplications and divisions first (working left to right as necessary) and then
we do the additions and subtractions (working left to right as necessary) to get
22 − 20 × 3} ÷4 − 5 = 22 − 60
| {z ÷ 4} −5 = 22
| {z | −{z 15} −5 = 7 − 5} = 2.
| {z
Brackets I: Evaluating expressions that involve brackets
If an expression involves brackets, then the operations within the brackets must be
performed first. As such, brackets can be used to change the order in which operations
are performed. For example, we might be asked to work out the values of the expressions
1. 9 − (4 + 3) as opposed to 9 − 4 + 3,
2. 6 ÷ (2 × 3) as opposed to 6 ÷ 2 × 3,
3. (12 × 3 − 8) × 2 as opposed to 12 × 3 − 8 × 2.
In such cases, we work out the expression in brackets first, i.e. we get
1. working out the expression in brackets first we get

9 − (4| {z
+ 3}) = 9 − 7 = 2,
as opposed to
− 4} +3 = 5 + 3 = 8,
9| {z
where we work from left to right.
10
1
2. working out the expression in brackets first we get
6 ÷ (2| {z
× 3}) = 6 ÷ 6 = 1,
as opposed to
÷ 2} ×3 = 3 × 3 = 9,
|6 {z
where we work from left to right.
3. working out the expression in brackets first, proceeding to the rules above as
necessary, we have
(12 × 3} −8) × 2 = (36
| {z − 8}) × 2 = 28 × 2 = 56,
| {z
as opposed to
12 × 3} − 8| {z
| {z × 2} = 36 − 16 = 20,
where we multiply first and then subtract.
What if we have two or more sets of brackets? Well, if they are not ‘nested’, for example
if we have
(12 × 3 − 8) × (24 − 14),
then we need to work out what is in each of the brackets first, proceeding according to
the rules above, i.e.
(12 × 3} −8) × (24
| {z − 14}) = (36
| {z − 8}) × 10 = 28 × 10 = 280.
| {z
And, if the brackets are ‘nested’, for example
6 + (9 − (4 + 3)),
then we start with the innermost set of brackets and work ‘outwards’, i.e.
6 + (9 − (4| {z − 7}) = 6 + 2 = 8.
+ 3})) = 6 + (9| {z
These rules allow you to work out the values of simple mathematical expressions using
brackets. In a moment we shall see another way of dealing with brackets which will be
more useful to us in this course.
Negative numbers
Consider the following three expressions and their values.
1. 6 − 3 = +3,
2. 6 − 6 = 0,
3. 6 − 9 = −3.
In this case, we can see that subtracting larger and larger numbers from six, gives us a
positive answer, zero and a negative answer respectively. For simplicity, we usually omit
the ‘+’ sign and write ‘+3’, say, as 3.
When we have expressions involving negative numbers, we have the following handy
rules.
11
1
1. Adding a negative number: This has the same effect as subtracting the
corresponding positive number, e.g.
5 + (−3) = 5 − (+3) = 5 − 3 = 2,
and
−5 + (−3) = −5 − (+3) = −5 − 3 = −8.
2. Subtracting a negative number: This has the same effect as adding the
corresponding positive number, e.g.
5 − (−3) = 5 + (+3) = 5 + 3 = 8.
and
−5 − (−3) = −5 + (+3) = −5 + 3 = −2.
3. Multiplying a positive number by a negative number: This gives us a negative

number, e.g.
(+5) × (−3) = −(5 × 3) = −15.
and
(−5) × (+3) = −(5 × 3) = −15.
This is normally remembered as ‘positive times negative is negative’.
4. Multiplying a negative number by a negative number: This gives us a positive

number, e.g.
(−5) × (−3) = +(5 × 3) = +15 = 15.
This is normally remembered as ‘negative times negative is positive’.
5. Dividing a positive number by a negative number (or vice versa): This gives us a
negative number, e.g.
(+6) ÷ (−3) = −(6 ÷ 3) = −2.
and
(−6) ÷ (+3) = −(6 ÷ 3) = −2.
This is normally remembered as ‘positive divided by negative is negative’ (or vice
versa).
6. Dividing a negative number by a negative number: This gives us a positive number,

e.g.
(−6) ÷ (−3) = +(6 ÷ 3) = +2 = 2.
This is normally remembered as ‘negative divided by negative is positive’.
Indeed, notice the similarity between (3) and (5) which can be remembered as
‘multiplying (or dividing) a positive and a negative yields a negative’ and (4) and (6)
which can be remembered as ‘multiplying (or dividing) a negative and a negative yields
a positive’.
12
1
Brackets II: Removing brackets from expressions
A more useful way of thinking about brackets involves being able to ‘remove’ the
brackets from an expression. For example, consider the expression
3 + 2 × (9 − 4).
Using the rules above we could work this out by thinking of it as
3 + 2 × (9| {z
− 4}) = 3 + |2 {z
× 5} = 3 + 10 = 13.
Alternatively, we can ‘remove’ the brackets by thinking of the ‘2’ in ‘2 × (9 − 4)’ as

multiplying everything in the bracket, i.e.
2 × (9 − 4) = (2 × 9) − (2 × 4).
Using this method we get
3 + 2 × (9 − 4) = 3 + ((2 × 9}) − (2
| {z × 4})) = 3 + (18
| {z − 8}) = 3 + 10 = 13,
| {z
| {z }
which is the same answer as before.
Activity 1.1 Show that if we worked out 3 + (9 − 4) × 2, we would also get 13.
What if we had to work out 3 − (9 − 4)? We adopt the convention that a minus sign in
front of a bracket is the same as adding something that has been multiplied by −1.
Using this, and what we saw above, gives us
3 − (9 − 4) = 3 + (−1) × (9 − 4) = 3 + ((−1 × 9) − (−1 × 4)) = 3 + (−9 − (−4))

= 3 + (−9 + 4) = 3 + (−5) = −2.
− 4}) = 3 − 5 = −2.
Of course, this is what we should expect as 3 − (9| {z
Absolute values
The magnitude (or absolute value) of a number is found by ignoring the minus sign
(if there is one). For example, the magnitude of 6, written |6|, is 6 and the magnitude of
−6, written | − 6|, is also 6, i.e. we have
|6| = 6 and | − 6| = 6.
In a way, the magnitude operation acts like a bracket as we need to evaluate the
magnitude of the number inside it before we use it in calculations, e.g.
4 − |2 − 3| = 4 − 1 = 3 as |2 − 3| = | − 1| = 1, and
|4 − 2| − 3 = 2 − 3 = −1 as |4 − 2| = |2| = 2.
13
1
Inequalities
We use the symbols ‘<’ and ‘>’ to show that one number is ‘less than’ or ‘greater than’
another number respectively. So, for example, 2 < 3 and 5 > 1. Zero is less than any
positive number and greater than any negative number, e.g. 0 < 5 and 0 > −5. As such,
any negative number is less than any positive number, e.g. −3 < 2. Negative numbers
are larger when they have smaller magnitudes (i.e. when they are closer to zero), e.g.
−3 < −2 and −1 > −5. As such, we can say that smaller negative numbers (like −100
compared to −1) have larger absolute values (like 100 compared to 1).
1.1.2 Fractions
A fraction such as 32 is, using our ‘horizontal line’ notation for division, the same as
dividing the number above the line (i.e. 3) by the number below the line (i.e. 2). We call
the number above the line the numerator and the number below the line the
denominator. If we have two fractions, say
3 4
and ,
5 2
the number we get by multiplying their denominators together is called the common
denominator of these fractions, and this will be 5 × 2 = 10 in this case.
Manipulating fractions
Sometimes we want to manipulate fractions in order to simplify them or to put them in

a form where their denominator is the common denominator. The two basic procedures
we use to do these two manipulations are as follows.
6
To simplify a fraction we want to write it in lowest terms,1 e.g. 10
can be written as
6 2×3 3
= = ,
10 2×5 5
by dividing through on top and bottom by the common factor of 2.
Conversely, to write a fraction so that its denominator is a common denominator,

e.g. to write 35 so that its denominator is, as above, the common denominator of 10
we note that it can be written as
3 2×3 6
= = ,
5 2×5 10
by multiplying top and bottom by 2.

This second technique is especially useful when we add and subtract fractions as we
shall now see.
1
That is, so that the numerator and denominator have no common divisors.
14
1
Adding and subtracting fractions
To add or subtract fractions, we first put them over a common denominator, e.g.
4 2 4×3 2×5 12 10 12 + 10 22
+ = + = + = = ,
5 3 5×3 3×5 15 15 15 15
and
4 2 4×3 2×5 12 10 12 − 10 2
− = − = − = = .
5 3 5×3 3×5 15 15 15 15
Multiplying fractions
To multiply fractions, we just multiply the numerators and denominators together, e.g.
4 2 4×2 8
× = = .
5 3 5×3 15
Reciprocals
The reciprocal of a fraction is what we get when we swap the numerator and
denominator around, e.g. the reciprocal of 53 is 53 . The reciprocal is useful when we come
to divide fractions as we shall now see.
Dividing fractions
To divide fractions, we multiply the first fraction by the reciprocal of the second, e.g. if
we want to evaluate
4 2
÷ ,
5 3
4
the rule tells us that this is the same as multiplying 5
by the reciprocal of 23 , which is 32 ,
and so we have
4 2 4 3
÷ = × .
5 3 5 2
This can now be worked out using the multiplication rule, i.e.
4 2 4 3 4×3 12
÷ = × = = .
5 3 5 2 5×2 10
Of course, we can simplify this by noting that the numerator and denominator have a
common factor of 2, i.e. the answer is 65 in lowest terms.
It is, perhaps, also interesting to note that the reciprocal of a fraction is just one
divided by that fraction, e.g. as
3 2 2
1÷ =1× = ,
2 3 3
we can see that the reciprocal of 32 , i.e. 32 , is just one divided by 32 .
15
1
Improper and proper fractions
An improper fraction is one where the numerator is greater in magnitude than the
denominator and a proper fraction is one where the numerator is less in magnitude
than the denominator, e.g. 22
5
is an improper fraction and 45 is a proper fraction.
Sometimes it is convenient to be able to write improper fractions as proper fractions,
e.g. we can write
22 20 + 2 20 2 2
= = + =4+ ,
5 5 5 5 5
2
as 5 goes into 20 four times. This can be written as 4 5 and we read it as ‘four and two
fifths’ to indicate that 22
5
is the same as four ‘wholes’ and two fifths of a ‘whole’.
However, in this course, we will usually not use this way of writing fractions as, using
our convention of writing 4 × 25 as 4 · 25 , we can easily get confused between ‘four and
two fifths’ and ‘four times two fifths’. As such, when the need arises, we will normally
stick to improper fractions.
Decimals
Often, you will see fractions written as decimals and vice versa, e.g. the fraction 41 is
exactly the same as the decimal 0.25. But, be aware that some fractions do not have a
nice finite decimal expansion, e.g.
1
is the decimal 0.333333 . . . ,
3
i.e. there is an infinite number of threes after the decimal point. The problem with this
is that, in such cases, using decimals instead of fractions can lead to rounding errors, e.g.
1
3× = 1,
3
exactly. But, just keeping the first four threes of the decimal expansion for 13 , i.e.
rounding 13 to four decimal places, written 4dp, we have 0.3333 and this gives us
1
3× ' 3 × 0.3333 = 0.9999,
3
where ‘'’ means ‘approximately equal to’. That is, using the decimal rounded to four
decimal places gives us an answer which is not exactly one, i.e. there is a rounding error
in our calculation, and this is why we generally use fractions instead of decimals.
Percentages
20
The percentage sign, i.e. ‘%’, means ‘divide by 100’, e.g. 20% is the same as 100
as a
fraction, or 0.2 as a decimal. As such, 20% of 150 is
20 3, 000
150 × = = 30.
100 100
Knowing this, we can see what it means to increase 150 by 20% or decrease 150 by 20%,
i.e.
16
1
to increase 150 by 20%, we get
150 + 30 = 180,
as 30 is 20% of 150. Notice that an increase by 20% can also be seen as 120% of the
original, i.e.
120 18, 000
150 × = = 180,
100 100
as before.
to decrease 150 by 20%, we get
150 − 30 = 120,
as 30 is 20% of 150. Notice that a decrease by 20% can also be seen as 80% of the
original, i.e.
80 12, 000
150 × = = 120,
100 100
as before.
These ideas will be particularly useful when we come to consider compound interest in
Unit 9.
1.1.3 Powers
Another operation that you will have come across before is the idea of ‘raising a number
to a certain power’. The number which represents the power can also be called the
exponent and the number which is being raised to that power is called the base. For
1
example, we could have 42 , 4−2 or 4 2 and, in each case, ‘4’ is the base and the other
number, i.e. ‘2’, ‘−2’ or ‘ 12 ’ respectively, is the exponent or power. We often refer to
expressions of this form as ‘powers’.
Positive integer powers
The simplest powers to work out are those where the power is a positive integer such as
1, 2, 3, . . . . In such cases, the power just means ‘multiply the base by itself that many
times’, e.g.
41 = 4, 42 = 4 × 4 = 16, 43 = 4 × 4 × 4 = 64, . . . .
One application of this is standard index form (or scientific notation) where we are
able to write large numbers in terms of powers of 10, e.g. we can write three million as
3, 000, 000 = 3 × 1, 000, 000 = 3 × 106 ,
as 1, 000, 000 is the same as 106 .
Powers and other operations
In terms of combinations of operations, evaluating the effect of a power comes before

multiplying and dividing, e.g. we can see that
42 + 3 = |2 ×
2 × |{z} {z16} + 3 = 32 + 3} = 35.
| {z
17
1
Of course, as before, we can also use brackets to change the order in which we do the
operations, e.g.
× 4})2 + 3 = |{z}
(2| {z 82 + 3 = 64 + 3} = 67,
| {z
and
42 + 3) = 2 × (16
2 × (|{z} + 3}) = 2| ×
| {z {z19} = 38.
In particular, when writing out expressions involving brackets, take care to distinguish
between, e.g. 23 + 5 and 23+5 , as the former is 13 whilst the latter is 256!
Also, similar to what we saw earlier, it is possible to remove the brackets from
expressions involving powers by applying the power to all of the numbers in the bracket.
For example,
(2 × 3)4 = 24 × 34 = 16 × 81 = 1, 296.
4
2 24 16
= 4 = .
3 3 81
The power laws
If we have the same base, then the power laws can allow us to simplify expressions
that involve multiplying powers, dividing powers and raising to powers. These laws are
as follows.
Multiplying powers: If we multiply two powers, we add the powers. For example,
if we have 24 × 23 , we can write,
24 × 23 = 24+3 ,
as 24 × 23 = 16 × 8 = 128 and 24+3 = 27 = 128.

Dividing powers: If we divide two powers, we subtract the power in the
denominator from the power in the numerator. For example, if we have 24 /23 , we
can write,
24
= 24−3 ,
23
24 16
as 23
= 8
= 2 and 24−3 = 21 = 2.
Raising to powers: If we raise a power to another power, we multiply the
powers. For example, if we have (24 )3 , we can write,
(24 )3 = 24×3 ,
as (24 )3 = 163 = 4, 096 and 24×3 = 212 = 4, 096.

Notice that, if the bases of the powers are not the same, then we can not use the power
laws. For example, to calculate
34 × 25 we could use 34 × 25 = 81 × 32 = 2, 592, but we could not use the power law.
34 34 81
we could use = , but we could not use the power law.
25 25 32
18
1
Negative integer powers
Negative integer powers, such as −1, −2, −3, . . ., mean ‘take the reciprocal of the base
raised to the corresponding positive power’. For example,
1 1 1 1 1 1
4−1 = 1
= , 4−2 = 2
= , 4−3 = 3
= ,....
4 4 4 16 4 64
1
In particular, note that a power of −1 is the same as the reciprocal, e.g. 4−1 = 4
which
is the reciprocal of 4. Similarly, this means that
−1
3 5
= ,
5 3
which is the reciprocal of 35 .
Zero powers
We now observe that any number raised to the power zero is one. For example, as
41 × 4−1 = 41−1 = 40 ,
by the power law, and

1
41 × 4−1 = 4 × = 1,
4
we can see that 40 = 1.
Fractional powers I: Square roots
A square root of a number, say 64, is a number which, when multiplied by itself, gives
us 64. So, as
8 × 8 = 64,
we can see that 8 is a square root of 64. Indeed, since a negative number times a
negative number is positive, we can see that
(−8) × (−8) = 64,
and so −8 is also a square root of 64. Thus, we can see that the square roots of 64 are 8
and −8. We often express this by saying ‘the square roots of 64 are ±8’ where the ‘±’ is
read ‘plus or minus’. Thus, we can see, by repeating this argument, that every positive
number has two square roots, one positive and one negative, and both of the same
magnitude.
What about other numbers? Well, since 0 × 0 = 0, we can see that the square root of
zero is zero and, moreover, zero is the only square root of zero. And, if we consider
negative numbers, say −64, we can see that there are no square roots since there is no
way of multiplying a number by itself to get −64.
√
We often denote the positive
√ square root
√ of a number, say 64, by ‘ 64’ and so, from the
above we can see that 64 =√8 and 0 = 0. Of course, as negative numbers have no
square roots, something like −64 does not exist.
19
1
Going back to our earlier example, as the square root of 64 is a number which, when
multiplied by itself, gives us 64 we can see that
√ 2 √ √
64 = 64 × 64 = 64,
and this is why the square root is so called: squaring the square root gives us the original
number. Now, if we think of raising the number 64 to the power 21 , we can see that
1
2 1
64 2 = 64 2 ×2 = 641 = 64,
using the power laws. And, comparing

√ these two expressions, it is natural to think of
1
64 2 as exactly the same thing as 64, i.e.
1 √
64 2 = 64,
and so we identify square roots with powers of 12 .
Activity 1.2 Find the square roots of 4, 9, 16, 25, 36 and 49.
Fractional powers II: nth roots
More generally, if n is a positive integer greater than 2, we say that an nth root of a
number, say 64, is a number which gives us 64 √ when raised to the power n. We often
denote the nth root of a number, say 64, by n 64. For example,
√
3
the cube root of 64, denoted by 64, is 4 as four cubed is 64, i.e.
√
43 = 64 and so
3
64 = 4.
Notice that 64 has no negative cube root since (−4)3 = −64 and not 64, as such 64
only has one cube root, i.e. 4. Repeating this argument, we can see that all positive
numbers only have one cube root.
√ 3
64 = 43 = 64 and
3
In terms of powers, as
1
3 1
64 3 = 64 3 ×3 = 641 = 64,
1
comparing
√ these two expressions it is natural to think of 64 3 as exactly the same
3
thing as 64, i.e. √
1 3
64 3 = 64,
and so we identify cube roots with powers of 13 .
√
the sixth root of 64, denoted by 6 64, is 2 as two to the power six is 64, i.e.
√
26 = 64 and so
6
64 = 2.
Notice that 64 also has a negative sixth root since (−2)6 = 64 and so 64 has two
sixth roots, i.e. ±2. Repeating this argument, we can see that all positive numbers
will have two sixth roots.
20
1
√ 6
64 = 26 = 64 and
6
In terms of powers, as
1
6 1
64 6 = 64 6 ×6 = 641 = 64,
1
comparing
√ these two expressions it is natural to think of 64 6 as exactly the same
6
thing as 64, i.e. √
1 6
64 6 = 64,
and so we identify sixth roots with powers of 16 .
√
And, more generally, we can write the positive nth root of a number a, or n
a, as a to
1
the power of n1 , i.e. a n .
Activity 1.3 Find the cube root of 27 and the fourth roots of 81.
Fractional powers III: powers of nth roots

2
Other fractional powers can be evaluated using the rules above, e.g. to evaluate 8 3 we
can think of it as
2 1 1 1
8 3 = 82× 3 = (82 ) 3 = 64 3 = 4,
or as
2 1 1
8 3 = 8 3 ×2 = (8 3 )2 = 22 = 4,
using the power laws. Other examples involving fractional roots would be
1 1
(3 2 )4 = 3 2 ×4 = 32 = 9, and
2
43 2 1 4−1 3 1
1 = 43−6 = 4 6 = 4 6 = 4 2 = 2,
4 6
using the power laws.
Fractional powers IV: Warnings
When using the above ideas you should also bear the following in mind.
√ √
When using the square root and nth root sign, i.e. and n , always be clear
about what parts of the expression are included in the root. For example,
√ √
4 × 16 and 4 × 16,
are different expressions (the former is equal to 8 whilst the latter is equal to 32).
Generally speaking, you can make your expressions clear by extending the ‘tail’ of
the root sign or using brackets.
Be careful when working with powers of negative numbers since even roots of
negative numbers do not exist. For example,
1 1
((−2)2 ) 2 = 4 2 = 2,
1
1
2
is fine, but (−2) does not exist and, as such, nor does (−2)
2 2 .
21
1
Recap on combinations of operations
To summarise everything we have seen above about this, operations are done in
‘BEDMAS’ order, i.e.
Brackets, Exponents, Division, Multiplication, Addition, Subtraction.
Otherwise, we work from left to right.
1.2 Algebra
We use algebra to express and manipulate information about unknown quantities.
These unknown quantities are called variables and these are normally represented by
letters such as x, y and z. One way to think of this is that numbers are constants, i.e.
they always have the same value, whereas variables can take different values depending
on the context.
1.2.1 Algebraic expressions

An algebraic expression is a sequence of numbers, variables and operations, e.g.
4x + 3y − 7. In expressions such as this, 4x means 4 × x, i.e. four lots of x. As such, we
can see that, for any value of x, we have things like
4x + 3x = 7x,
as four lots of x plus three lots of x is seven lots of x. Note that all of the mathematical
operations that we have seen so far can be used in algebraic expressions.
Attributing meaning to algebraic expressions
Often, we use mathematical expressions to represent the value of some quantity. For
instance, we can consider the following examples.
1. If you have a job which pays £10 per hour and you work x hours, then your income
is given by the algebraic expression £10x.
2. If a firm has a revenue of £x and costs of £y, then its profit is £(x − y).
3. If a firm prices a product at £x per unit and sells x units of this product, then the
revenue is £x2 . If the costs are £x, then its profit is £(x2 − x).
As the above examples show, some algebraic expressions contain one variable, such as
4x + 3x, some contain two variables, such as 4x + 3y − 7, and some can contain one
variable used several times, such as x2 − x where x is used twice (i.e. once in an x term
and once in an x2 term). Of course, the quantities represented may be more complicated
than those given in these examples.
22
1
Example 1.1 Suppose that you heat your house with gas for d days per year and
on each day you use m cubic metres of gas. This means that you use dm cubic
metres of gas each year.
If gas costs £P per cubic metre, this means that the cost of heating your house for a
year is £dmP .
Suppose that you must also pay a fixed amount of £81 per year to the gas company.
This means that the cost of heating your house for a year is now £(dmP + 81).
Suppose that you pay your gas bill in twelve equal monthly instalments, this means
that you must pay
dmP + 81
£
12
every month.
Activity 1.4 What will the annual payment be if the gas company raises the price
of gas by £p per cubic metre? What will the corresponding monthly repayments be?
Evaluating algebraic expressions
Given an algebraic expression, we are sometimes given specific values for each of the
variables involved and asked to evaluate it, i.e. find a value for the whole algebraic
expression given the values of the variables. So, for example, using our examples above
we have the following.
1. With x = 5, you have a job which pays £10 per hour and you work 5 hours, then
your income is given by £(10 × 5) = £50.
2. With x = 40 and y = 30, the firm has a revenue of £40 and costs of £30, and so its
profit is £(40 − 30) = £10.
3. With x = 10, the firm prices the product at £10 per unit and sells 10 units, i.e. the
revenue will be £102 . The costs will be £10, and so its profit is
£(102 − 10) = £(100 − 10) = £90.
Indeed, we can also look at how this works in our more complicated example.
Example 1.2 Following on from Example 1.1, suppose that when heating your
house, gas costs £0.12 per cubic metre and that you use 13 cubic metres of gas per
day for 125 days. This means that we have to pay
13 × 125 × 0.12 + 81 195 + 81 276
£ =£ =£ = £23
12 12 12
every month.
23
1
Activity 1.5 What is the cost of heating your house for a year?
What will the annual payment be if the gas company raises the price of gas by 8p
per cubic metre? What will the corresponding monthly repayments be?
Simplifying algebraic expressions
As long as we take care to combine ‘like with like’, an algebraic expression can
sometimes be simplified, i.e. it can be changed into a form that is easier to evaluate
without altering what we will get from an evaluation. For example, we saw earlier that
4x + 3x = 7x,
and so we can write 4x + 3x as 7x, which is simpler. In particular, we can often simplify
expressions by removing brackets from an expression and simplifying what remains, e.g.
if we have an algebraic expression like 3(2x) we can think of this as ‘three lots of 2x’
which gives us 6x, i.e.
3(2x) = 6x.
However, if we have an algebraic expression like 3(x + 2), which we can think of as
‘three lots of x + 2’, we can remove the brackets by multiplying everything inside the
brackets by 3, i.e.
3(x + 2) = 3x + 6,
whereas if we have an algebraic expression like −(2x − 1), we can think of the minus as
telling us to multiply everything inside the brackets by −1, i.e.
−(2x − 1) = −2x + 1.
Indeed, we may be able to do some simplifying after we have multiplied out the
brackets, e.g.
2(x + 3) + x = 2x + 6 + x = 3x + 6,
where, here, we have multiplied out the brackets and collected ‘like’ terms to get a
simpler expression. Some other examples of simplifying algebraic expressions are:
4x − 3x = x,
4(2x) − x = 8x − x = 7x,
3(x + y) = 3x + 3y,
3(x + 1) + 4(x − 1) = 3x + 3 + 4x − 4 = 7x − 1, and
3(x + 1) − 4(x − 1) = 3x + 3 − 4x + 4 = −x + 7.
Notice that none of these simplifications changes the outcome of any evaluation which
we may want to perform, i.e. whatever we get if we evaluate the expression at the start
we will also get if we evaluate the expression at the end. In this sense, the expressions
may look different, but algebraically they are the same throughout.
24
1
Multiplying out two pairs of brackets
Sometimes we will want to multiply out the brackets in more complicated expressions.
For example, how would you remove the brackets from (x + 3)(y − 2)? We can think of
this in two ways:
Multiplying out the first bracket, everything in the first bracket needs to be
multiplied by the second bracket, i.e.
(x + 3)(y − 2) = x(y − 2) + 3(y − 2),
and then simplifying this as before we get
(x + 3)(y − 2) = x(y − 2) + 3(y − 2) = xy − 2x + 3y − 6.
Multiplying out the second bracket, everything in the second bracket needs to be
multiplied by the first bracket, i.e.
(x + 3)(y − 2) = (x + 3)y + (x + 3)(−2),
and then simplifying this as before we get
(x + 3)(y − 2) = (x + 3)y + (x + 3)(−2) = xy + 3y − 2x − 6.
But, notice that these are the same expression, and so we can multiply out in either
way as long as we make sure that every term in a bracket is multiplied by every term in
the other bracket.
Activity 1.6 We can write (x + 3)2 as (x + 3)(x + 3). Use this to remove the
brackets from the expression (x + 3)2 . In a similar manner, remove the brackets from
the expression (2x + 3)2 .
Factorising
Sometimes we can simplify expressions even further by putting brackets in, e.g. going
back to an earlier example, we could write
2(x + 3) + x = 2x + 6 + x = 3x + 6 = 3(x + 2),
as 3(x + 2) = 3x + 6 if we multiply out the brackets. The process of putting brackets
into an expression is called factorisation. For our current purposes, we just need to
note that we can factorise when every term in our expression has a common factor, such
as 3 in the example above. Some other examples, which can be verified by multiplying
out the brackets, are:
2x − 6 = 2(x − 3),
−2x − 10 = −2(x + 5), and
3xy − 12y = 3y(x − 4).
We will return to factorisation in Unit 3.
25
1
1.2.2 Equations, formulae and inequalities
So far, we have considered how to manipulate algebraic expressions and what they may
be used to express. We now look at the ways in which a pair of algebraic expressions
may be related to one another.
Equations
An equation is a mathematical statement which sets two algebraic expressions equal to

one another. For example, a = b, x2 = 4 and x + 3 = −2x + 4 are all equations.
A solution to an equation is a value for each variable in the equation which is such that,
when we evaluate both expressions with these values substituted for the variables, the
expressions are equal. For example, x = 3 is a solution of the equation x2 − 3 = 2x as,
substituting x = 3 into both sides we get the same number, i.e. 6. Sometimes, an
equation can have more than one solution. For example, x = −1 is also a solution of
x2 − 3 = 2x as, substituting x = −1 into both sides we get the same number, i.e. −2.
Generally speaking, as we shall see in Units 2 and 3, a given equation may have no
solutions, one solution or many solutions.
Solving an equation is to find all of its solutions. Sometimes this is easy and sometimes
it is not so easy to do this. In the simplest case, we just have to simplify both sides to
see the solution. For example, to solve the equation 4x − 3x = 2 + 5, we simplify both
sides to see that x = 7.
If this doesn’t work, we can rearrange the equation into a simpler equation that has the
same solution(s). To do this, we proceed by performing some well-chosen mathematical
operation on both sides at the same time so that the equation is unchanged, but
simplified. The mathematical operations that we can use in such cases are:
add (or subtract) an expression from both sides;
multiply (or divide) by a non-zero expression on both sides.

But, raising both sides to a power can cause problems as if we were squaring both sides,
say, we know that a positive expression has two square roots. For example, the equation
4x − 8 = 2x + 4 has the same solutions as the equations
4x − 8 − (4x + 4) = 2x + 4 − (4x + 4),
and
4x − 8 2x + 4
= ,
9 9
but, it has different solutions to the equation
(4x − 8)2 = (2x + 4)2 .
Bearing this in mind, let’s see how we would actually solve this equation.
26
1
Example 1.3 Solve the equation 4x − 8 = 2x + 4.
We solve this by rearranging it, i.e. performing some well chosen mathematical
operations on both sides at the same time:
4x − 8 = 2x + 4 our equation
4x − 8 − 2x = 2x + 4 − 2x subtracting 2x from both sides
2x − 8 = 4 simplifying
2x − 8 + 8 = 4 + 8 adding 8 to both sides
2x = 12 simplifying
x=6 dividing both sides by 2
Thus, the solution to our equation is x = 6.
Lastly, always check that any solution you find is a solution by using it to evaluate both
sides of the original equation.
Activity 1.7 Check that x = 6 is a solution to the original equation.
Example 1.4 Solve the equation 3x + 6 = 5x − 10.
We again proceed by rearranging the equation:
3x + 6 = 5x − 10 our equation
3x + 6 − 3x = 5x − 10 − 3x subtracting 3x from both sides
6 = 2x − 10 simplifying
6 + 10 = 2x − 10 + 10 adding 10 to both sides
16 = 2x simplifying
8=x dividing both sides by 2
Thus, the solution to our equation is x = 8.
Activity 1.8 Check that x = 8 is a solution to the equation 3x + 6 = 5x − 10.
The equations in the last two examples are linear equations and they will be the
starting point for a more detailed discussion of equations that will start in Unit 2.
Inequalities
An inequality is a mathematical statement where two algebraic expressions are related

by an inequality, such as ‘>’ or ‘<’, so that we know that one of the expressions is
greater than or less than the other. Inequalities can be solved by finding the range of
values, for each variable, that make it true. For example, the inequality x < 2 is true
precisely when x < 2.
27
1
As with equations, inequalities can be solved by rearranging them into simpler
inequalities that are true for the same range of values. Generally, given an inequality,
this means that we can:
add (or subtract) an expression from both sides, or
multiply (or divide) by a positive expression on both sides,

to simplify, but not change, the inequality. For example,
x + 4 > −1 can be simplified to give x > −5 by subtracting 4 from both sides.
3x > 6 can be simplified to give x > 2 by dividing both sides by 3 (as 3 is positive).
However, if we multiply (or divide) by a negative expression, we must ‘reverse the
direction’ of the inequality. For example,
−3x > 6 can be simplified to give x < −2 by dividing both sides by −3 and
reversing the direction of the inequality (as −3 is negative).
To see why we need to do this, consider the inequality 2 < 3 which is true. If we
multiply by 2 (which is positive) we get 4 < 6 which is still true, but if we multiply by
−2 (which is negative) we get −8 < −12 which is not true. However, if we reverse the
direction of the inequality as well, we get −8 > −12 which is now true.
Example 1.5 Solve the inequality 4x − 6 < 6x − 2.
We solve this by rearranging it, i.e. performing some well chosen mathematical
operations on both sides at the same time:
4x − 6 < 6x − 2 our inequality

4x − 6 − 4x < 6x − 2 − 4x subtracting 4x from both sides
−6 < 2x − 2 simplifying
−6 + 2 < 2x − 2 + 2 adding 2 to both sides
−4 < 2x simplifying
−2 < x dividing both sides by 2
Thus, the solution to our inequality is −2 < x, or rewriting this, x > −2.
Alternatively, we could have rearranged it by doing some slightly different

operations:
4x − 6 < 6x − 2 our inequality

4x − 6 − 6x < 6x − 2 − 6x subtracting 6x from both sides
−2x − 6 < −2 simplifying
−2x − 6 + 6 < −2 + 6 adding 6 to both sides
−2x < 4 simplifying
x > −2 dividing both sides by −2 and reversing the inequality
Thus, the solution to our inequality is, again, x > −2.
28
1
Formulae
A formula is an algebraic expression where a single variable, the subject, is equated

to an expression involving other variables. For example, the area, A, of a circle is given
in terms of its radius, r, by the well-known formula A = πr2 . Sometimes we will want to
rearrange a formula so that a different variable is the subject. The procedure for doing
this is the same as the one we used to solve an equation, but the ‘solution’ will be an
algebraic expression rather than a number.
Example 1.6 Following on from Example 1.1, let S denote the amount, in pounds,
of our monthly gas payments so that
dmP + 81
S= .
12
If our monthly repayment, S, is now given, for how many days, d, can we heat our
house?
We proceed by rearranging the formula:

dmP + 81
S= our formula
12
12S = dmP + 81 multiplying both sides by 12
12S − 81 = dmP subtracting 81 from both sides
12S − 81
=d dividing both sides by mP
mP
12S − 81
Thus we can see that the number of days is given by d = .
mP
Activity 1.9 In a similar manner, find the price, P , per cubic metre of gas.
Identities
An identity is a special kind of mathematical formula that allows us to rewrite one

mathematical expression in another way. For instance,
x(x + 1) = x2 + x,
is an identity because reading it from left to right tells us how to multiply out the
brackets in ‘x(x + 1)’ and reading it from right to left tells us how to factorise the
quadratic x2 + x. In particular, notice that although this looks like an equation, it isn’t
really because it is true for all values of x! In fact, throughout this unit we have been
reviewing how certain mathematical operations work and, as you have probably
realised, many of these can be usefully summarised by using identities. For instance, the
following identities allow us to summarise some of the ideas that we encountered when
we discussed fractions.
29
1
Arithmetic with fractions
To add and subtract fractions, we use the rules

a c a+c a c a−c
+ = and − = ,
b d bd b d bd
where bd is called the common denominator. To multiply fractions we use the rule
a c ac
× = ,
b d bd
and we divide fractions by using the rule
a c a d
÷ = × ,
b d b c
where d/c is called the reciprocal of c/d.
At this stage, we can also usefully summarise some of the ways in which powers work as
follows.
Power laws
The power laws state that

an
an · am = an+m = an−m (an )m = anm
am
provided that both sides of these expressions exist. In particular, we have
1
a0 = 1 and a−n = .
an
√ 1
If it exists, we also define the positive nth root of a, written n
a, to be a n .
We can also summarise some of our results concerning brackets by using identities as
you can see in the next activity.
Activity 1.10 Write out the identities that arise when you remove the brackets
from the following algebraic expressions.
i. a(bc), ii. a(b + c), iii. (a + b)2 , iv. (a + b)(c + d).
And, just to be sure that we understand what is going on, try the next activity.
Activity 1.11 Use these identities to simplify the following algebraic expressions.
(x + y)2 − (x − y)2 √ √ √
i. (x+y)2 −x(x+y)−y(x+y), ii. , iii. x + y−( x+ y).
4xy
30
1
Learning outcomes
At the end of this unit, you should be able to:
simplify and evaluate arithmetic expressions including those that involve brackets
and powers;
manipulate algebraic expressions including those that involve brackets and powers;
solve simple equations and inequalities;
model certain situations using formulae and be able to rearrange such formulae;
use identities to manipulate arithmetic and algebraic expressions.
Exercises
Exercise 1.1
6 3 − (−3 − (4 − 5) − 2) − 6
Evaluate the expressions 3 · 2 + · 7 + 4 and .
2 −(−(−1)) − 1
Exercise 1.2
Evaluate the following expressions.
i. | − 3| + | − 2|, ii. | − 3| − | − 2|, iii. − |3| + | − 2|, iv. − |3| − | − 2|, v. − |3| − |2|.
Exercise 1.3
Write the proper fractions 4 72 , 1 32 and 2 14 as improper fractions.
Exercise 1.4
Evaluate the following expressions, writing your answers in lowest terms.
1 1 30 5 2 25 13 9
i. + , ii. − , iii. · , iv. ÷ .
3 2 7 3 5 4 8 4
Exercise 1.5
You deposit £1000 in a bank account that pays 10% interest. What will the balance be
after one year? Two years?
After two years, what is the increase in the balance as a percentage of the original
deposit?
Exercise 1.6
Evaluate the following expressions.
1 1 1 1 2 10−2
i. 92 − 9 2 , ii. 16− 4 + 16 2 , iii. 7 3 · 7 3 , iv. .
2−10
31
1
Exercise 1.7
Express the following in the simplest form possible.
x2 y xy 3 z 1 1
i. + , ii. x(y 2 z 3 ) 2 (xz)−2 , iii. x(xy)−2 (x + z) 2 .
2xz xy
Exercise 1.8
Multiply out the brackets in the following expressions simplifying your answers as far as
possible.
i. (x + 1)(x − 1), ii. (2y + 3)(y − 2), iii. (x + 3y)(2x − y), iv. (2x − 3y)(x + z).
Exercise 1.9
Solve the following equations.
i. −3 p = 21, ii. 4 q − 1 = 15,
5 1 2
iii. 5 z + 4 (z − 2) = 1, iv. 6
k − 2k + 3
= 3
,
v. 5 m − 3 (m − 2) = 11 (m + 2), vi. 83 (w − 1,996) + 17 (w − 1,996) = 600.
Exercise 1.10
You hire a car for £20 plus the cost of petrol used. Let x be the distance you travel in
miles and p be the price, in pence, of petrol per gallon. If petrol consumption is 30 miles
per gallon, write down expressions, in pence, for the amount you spend on petrol and
the cost per mile.
Exercise 1.11
z
Rearrange the formula y = − 3 to make x the subject.
2+x
Exercise 1.12
Solve the inequality 5 − x > 2x − 1.
32
2. Review II — Linear equations and straight lines
Unit 2: Review II 2
Linear equations and straight lines
Overview
In this unit we continue our study of equations by looking at linear equations in one and
two variables. In particular, we will see that linear equations in two variables represent
straight lines. We see how to sketch these lines by finding their intercepts and
investigate how their gradient allows us to measure changes. Lastly, we will see how to
solve problems that involve simultaneous equations.
Aims
To see how to solve simple linear equations in one and two variables.
To see how linear equations in two variables represent straight lines.
To see how to sketch straight lines and find their gradient.
To see how to solve simultaneous equations.

2.1 Linear equations

We start with a brief review of how to solve linear equations in one variable and see how
such equations can have no solutions, one solution or an infinite number of solutions. We
then look at linear equations in two variables, find that they give us an infinite number
of solutions, and see how we can represent these solutions in a straightforward way.
2.1.1 Linear equations in one variable

A linear equation in one variable, let’s call it x, is an equation of the form
ax = c.
where a 6= 0 and c are constants. As in Unit 1, we can solve this by dividing through on
both sides by a to get
c
x= ,
a
33
as a 6= 0 and, in this case, we say that such an equation has a unique solution.1 Of
course, the variable need not be x, as linear equations in one variable may use a
different variable, for example, the variable
2
i. y in 4y = −8, which gives the solution y = −2;
ii. z in 3z = 9 ,which gives the solution z = 3;
iii. q in 3q = 9, which gives the solution q = 3.
Notice, in particular, that examples ii. and iii. are the same equation written in terms of
two different variables.
More generally, a linear equation in one variable can come about through an equation
that only involves multiples of the variable and constants. For example, if we consider
the equations,
1. 6y + 4 = 2y − 4,
2. 2z − 6 = −z + 3,
3. q − 5 = −2q + 4,
we can rearrange them, as in Unit 1, to yield the linear equations that we saw above.
The only exceptions to this are when we have something like
2x − 6 = 2x + 2 which rearranges to 0 = 8,
and this is never true, i.e. an equation like this has no solutions since, whatever value of
x we put into the equation, it is never satisfied. Or we have something like
2x + 2 = 2x + 2 which rearranges to 0 = 0,
and this is always true, i.e. an equation like this has an infinite number of solutions
since, whatever value of x we put into the equation, it is always satisfied. That is, this
equation is actually an identity because it is true for all values of x.
2.1.2 Linear equations in two variables

In its simplest form, a linear equation in two variables, say x and y, is an equation of
the form
ax + by = c,
where at least one of the constants a and b is non-zero. Unlike the situation with one
variable, this will generally have an infinite number of solutions.
Example 2.1 If we have the linear equation in two variables given by
2x + y = 7,
then we can rearrange this to get
y = 7 − 2x.
1
That is, it always has a solution and there is only one solution.
34
Now, if we substitute any value of x into this equation, it will give us a value of y.
For instance, if we take
x = 1 we get y = 5;
2
x = 2 we get y = 3;
x = 3 we get y = 1;
and so on for any other values of x that we may choose. Furthermore, each of these
pairs of numbers is a solution to the equation as putting the x value and its
corresponding y value into the equation satisfies it.
Example 2.2 Consider the linear equation in two variables given by x = 2. Notice
that this linear equation only contains the variable x, but as we are told that it is a
linear equation in two variables, we have to think about what this means for the
other variable, which we can call y. The way to think about this is to write it as
x + 0y = 2,
and then notice that, for any value of y, the quantity ‘0y’ is always zero and so we
must always get x = 2. That is, among the solutions to this linear equation in two
variables we will find the pairs of numbers
x = 2 and y = 1;
x = 2 and y = 2;
x = 2 and y = 3;
and so on for any other value of y as long as we pair it with x = 2.
Activity 2.1 What are the solutions to the linear equation in two variables given
by y = 3?
2.1.3 Visualising the solutions of linear equations in two

variables
Consider again the equation
2x + y = 7
from Example 2.1 and one of the solutions to this equation that we found there, say,
x = 1 and y = 5. We can write such a solution as the ordered pair (1, 5).2 In such a
pair, we often call the value of x, i.e. 1 in this case, the x-coordinate and the value of y,
i.e. 5 in this case, the y-coordinate. Indeed, such ordered pairs, or coordinates, can be
represented as a point on a diagram such as the one in Figure 2.1(a). This diagram
consists of two axes, the x-axis that runs horizontally and the y-axis that runs
vertically. We take the point where the axes cross, labelled O in this diagram, to be the
2
As we shall see, this is an ordered pair since the pair (1, 5) is different from the pair (5, 1). That is,
the order in which the numbers appear in the brackets matters.
35
point with coordinates (0, 0) and we call this the origin. We often refer to the ‘space’
which contains all the points with (x, y) coordinates as the ‘xy-plane’.3 Repeating this
for the other two solutions of the equation we found earlier, i.e. those with coordinates
2 (2, 3) and (3, 1), yields three points on our diagram as shown in Figure 2.1(b). This
procedure, of representing the solutions to an equation in two variables as points on
such a diagram, is known as plotting those points.
y y
5 5
3 3
1 1
O x O x
1 2 3 1 2 3
(a) (b)
Figure 2.1: Plotting points. (a) The point (1, 5) and (b) the points (1, 5), (2, 3) and (3, 1).
All of the points plotted here are solutions to the equation 2x + y = 7.
If we were to repeat this procedure, i.e. if we were to plot all the points which
represented a solution to our linear equation, we would find that they are all on the
straight line shown in Figure 2.2(a). In fact, any linear equation in two variables that
has an infinite number of solutions can be represented as a line on such a diagram.
Indeed, the lines which represent the linear equations in two variables given by x = 2
(from Example 2.2) and y = 3 (from Activity 2.1) are illustrated in Figure 2.2(b). In
particular, notice that points on the vertical line, which represent the solutions to the
equation x = 2, always have coordinates of the form (2, y) where y can take any value.
Activity 2.2 In a similar manner, what can we say about the coordinates of the
points on the horizontal line? What are the coordinates of the point at which this
horizontal line and this vertical line intersect?
Activity 2.3 If we have the vertical line x = k and the horizontal line y = l where
k and l are constants, what can we say about the coordinates of the points on these
lines? What are the coordinates of the point at which these two lines intersect?
3
Basically, it’s a ‘plane’ because it is ‘flat and level’. It’s the xy-plane because the points have (x, y)
coordinates as determined by the x and y-axes.
36
y y
2x + y = 7 x=2
2
5 5
y=3
3 3
1 1
O x O x
1 2 3 1 2 3
(a) (b)
Figure 2.2: Drawing straight lines. (a) Each point on this line has coordinates that satisfy
the equation 2x + y = 7. (b) Each point on the vertical line has coordinates that satisfy
the equation x = 2 and each point on the horizontal line has coordinates that satisfy the
equation y = 3.
Activity 2.4 What are the equations of the lines that we use to represent the x and
y-axes? What are the coordinates of the points on these lines?
2.2 Straight lines

We now turn our attention to straight lines in general. In particular, we want to be able
to draw the straight line that represents the solutions to a given linear equation in two
variables and we want to be able to find the linear equation in two variables whose
solutions are represented by a given straight line.
2.2.1 Drawing straight lines given their equations

From what we have seen above, there are three kinds of straight line depending on the
form of the linear equation in two variables we are dealing with. In particular, if we
have the equation
ax + by = c,
then we find that:
If a = 0 and b 6= 0, then the equation can be written as y = c/b and so, for any
value of x, a point with coordinates (x, c/b) is on this line. As in Activity 2.1, where
we had y = 3, we see that these equations represent horizontal straight lines as
illustrated in Figure 2.2(b). In particular, the line with equation y = 0 is the x-axis.
37
If a 6= 0 and b = 0, then the equation can be written as x = c/a and so, for any
value of y, a point with coordinates (c/a, y) is on this line. As in Example 2.2,
where we had x = 2, we see that these equations represent vertical straight lines as
2 illustrated in Figure 2.2(b). In particular, the line with equation x = 0 is the y-axis.
If a 6= 0 and b 6= 0, then the equation can not be written so simply and any point
whose coordinates satisfy this equation will be on this line. As in Example 2.1,
where we had 2x + y = 7, we see that these equations represent lines which are
neither horizontal nor vertical and we call them oblique straight lines, as illustrated
in Figure 2.2(a).
Now, given a linear equation in two variables, when it comes to drawing the straight
line that represents its solutions, all we need to do is find at most two points on the
line. In particular, on the one hand, if we can see from its equation that the line is
horizontal or vertical, we need only one point on the line to draw it. On the other hand,
if we can see from its equation that the line is oblique, then we only need to find two
points on the line to draw it. That is, if we find any two points whose coordinates
satisfy the equation, we can plot these two points on our diagram and the line we seek
is the one that goes through these two points.
2.2.2 The intercepts of a straight line

For oblique lines, two extremely easy points to find are the x and y-intercepts. For
instance, if the equation of the line is given by
ax + by = c,
and we have a 6= 0 and b 6= 0 so that it is oblique, we can find the:
x-intercept, i.e. the value of x where the line crosses the x-axis. But, the x-axis is
the line y = 0 and so we are looking for the value of x which occurs when y = 0 in
our equation. That is, we want x to be such that
c
ax = c =⇒
x= ,
a
c
and so the coordinates of the x-intercept are ,0 .
a
y-intercept, i.e. the value of y where the line crosses the y-axis. But, the y-axis is
the line x = 0 and so we are looking for the value of y which occurs when x = 0 in
our equation. That is, we want y to be such that
c
by = c =⇒
y= ,
b
c
and so the coordinates of the y-intercept are 0, .
b
This general case is illustrated in Figure 2.3(a).
38
y y
2
5
c
b 4
ax + by = c 3 2x + y = 4
O x O x
c 1 2 3
a
(a) (b)
Figure 2.3: The x and y-intercepts of an oblique straight line. (a) In general, the oblique
line ax + by = c has a 6= 0 and b 6= 0, so the x and y-intercepts are given by the points
( ac , 0) and (0, cb ) respectively. (b) The line 2x + y = 4 has x and y-intercepts given by the
points (2, 0) and (0, 4) respectively.
Example 2.3 As an example of how this works, consider the linear equation in two
variables given by
2x + y = 4.
This represents an oblique line and so we can find its x and y-intercepts as follows.
For the x-intercept, we set y = 0 to get 2x = 4 and hence x = 2. Thus, the point
with coordinates (2, 0) is the x-intercept of this line.
For the y-intercept, we set x = 0 to get y = 4. Thus, the point with coordinates
(0, 4) is the y-intercept of this line.
Once we have plotted these two points, the line that we seek is the one that goes
through both of them, as illustrated in Figure 2.3(b).
Activity 2.5 Suppose that you are going to spend exactly £3 when buying x
apples and y bananas. If apples cost 50p each and bananas cost 30p each, find a
linear equation in terms of x and y that gives the combinations of apples and
bananas that you can purchase. Draw the straight line that is represented by this
linear equation and comment on its economic significance.
39
2.2.3 The gradient of a straight line

The gradient, or slope, of a straight line is a measure of how ‘steep’ the line is. That
2 is, it can be found by taking two points on the line and dividing the change in y by the
change in x as we see in the following definition.
Gradient of a straight line
If (x1 , y1 ) and (x2 , y2 ) are the coordinates of two distinct points on a straight line,
then
the change in y is ∆y = y2 − y1 , and
the change in x is ∆x = x2 − x1 .
The gradient, m, of this straight line is then given by
∆y y2 − y1
m= = .
∆x x2 − x1
In particular, whichever two points on the straight line we use when finding the
gradient, we will always get the same value.
Example 2.4 Using the line from Example 2.3 which was illustrated in
Figure 2.3(b), we can see that it goes through the points with coordinates (2, 0) and
(0, 4). As such, using these two points, we can see that
the change in y is ∆y = 4 − 0 = 4, and
the change in x is ∆x = 0 − 2 = −2,

which means that the gradient, m, of this line is
∆y 4
m= = = −2.
∆x −2
Notice that the gradient of the line is negative and this means, as we can see in the
figure, that along this line the y-coordinate decreases as the x-coordinate increases.
Activity 2.6 Following on from Example 2.1, find the gradient of the straight line
whose equation is 2x + y = 7.
Following on from Example 2.2 and Activity 2.1, what can you say about the
gradients of the straight lines whose equations are x = 2 and y = 3?
2.2.4 Finding the equation of a straight line

So far, we have seen how the equation of a straight line allows us to draw it and find its
intercepts and gradient. We now consider how we can find the equation of a straight
line if we are given some information about it and there are three common cases which
can occur.
40
Given the gradient and the y -intercept
If we know that the line has gradient, m, and y-intercept, k, then the equation of the
line is given by
2
y = mx + k.
For example, if we are told that a line has a gradient of 3 and its y-intercept is the point
with coordinates (0, 7), then the equation of the line is
y = 3x + 7
and, if the gradient of the line is zero and the y-intercept is the point with coordinates
(0, 5), then the equation of the line is y = 5.
Given the gradient and a point on the line
If we are given the gradient of the line, m, and a point on the line other than the
y-intercept, say the point (x1 , y1 ), then we know that for any other point, (x, y), on the
line we must have
y − y1
m= ,
x − x1
as the gradient of a line is the same regardless of which pair of points on the line we use
to calculate it.
To verify that this formula works, consider again the line which has a gradient of 3 and
whose y-intercept is the point with coordinates (0, 7). Using the formula, this yields the
equation,
y−7
3= which can be rearranged to give y = 3x + 7,
x−0
as before. Similarly, in the case of the line which has a gradient of zero and whose
y-intercept is the point with coordinates (0, 5), we can see that the formula yields the
equation
y−5
0= which can be rearranged to give y = 5,
x−0
as before.
However, the full power of this formula is when we have to find the equation of the line
that, for example, has a gradient of 10 and goes through the point with coordinates
(2, 3). Using the formula in this case yields the equation
y−3
10 = =⇒ y − 3 = 10(x − 2) =⇒ y − 3 = 10x − 20,
x−2
or y = 10x − 17. Indeed, we can verify this is correct since the x-coefficient is the
gradient and the point (2, 3) satisfies the equation.
Given two points on the line
If we know that the two distinct points (x1 , y1 ) and (x2 , y2 ) lie on the line, then its
gradient is given by
y2 − y1
m= .
x 2 − x1
41
However, if the point (x, y) is also on the line, then the gradient between it and any
other point on the line, say (x1 , y1 ), is given by
2 m=
y − y1
.
x − x1
So, as the line has the same gradient regardless of the pairs of points we take, this
means that the equation of the line is given by
y − y1 y2 − y1
= .
x − x1 x 2 − x1
For example, if the points with coordinates (1, −7) and (2, 3) are on the line, then the
equation of the line is given by
y − (−7) 3 − (−7)
= =⇒ y + 7 = 10(x − 1) =⇒ y + 7 = 10x − 10,
x−1 2−1
or y = 10x − 17, i.e. this line is the same as the one we saw earlier.
2.2.5 Applications of straight lines

If we have a situation where our two variables represent certain quantities and the
relationship between these variables gives us a straight line, we often find that the
gradient of the straight line also has a useful interpretation.
Example 2.5 If y is the distance travelled and x is the time taken, then a linear
equation that relates these two variables would give us a straight line that represents
the distance travelled in terms of the time taken. In this case, the gradient of the
line, i.e.
∆y
m= ,
∆x
is the speed of the object whose motion we are considering.
If the x variable is time, as in this example, we often call the gradient the rate of change
of y. So, in this example, speed is the rate of change of distance, as one might expect. If
the x variable is something else, then we call the gradient the rate of change of y with
respect to x.
In economics, if x measures the quantity being produced, then gradients are usually
referred to as marginals. So, for instance, the rate of change of profit with respect to
the amount produced would be the marginal profit and so on. To motivate this, let’s
consider another example.
Example 2.6 Suppose that a factory produces a certain product and the profit,
when written in terms of the amount produced, is thought to be linear. One year,
they produce 40 units and lose £4, 000, while the next year, production is doubled
to 80 units resulting in a profit of £1, 000. What is the equation describing the profit
in this case?
One way to start is to denote the profit by π and the amount produced by x. Then
since we know that we are looking for a linear expression, we can use the
42
information about change in profit and change in production to calculate the

gradient of this linear function. That is, writing the figures as a change from
(40, −4000) to (80, 1000), we can see that its gradient will be 2
∆π 1, 000 − (−4, 000) 5, 000
m= = = = 125.
∆x 80 − 40 40
This means that our linear relationship between π and x will be given by
π = 125x + k.
However, we can find k as we know that the point (80, 1000) must satisfy this linear
relationship, i.e. we must have
1, 000 = 125 × 80 + k =⇒ 1, 000 = 10, 000 + k =⇒ k = −9, 000.
Thus, the linear relationship we seek is
π = 125x − 9, 000,
and this can be verified by showing that it is also satisfied by the point (40, −4000).
So, in this example, the gradient of the straight line is the marginal profit of the factory.
That is, when quantities like profit and production are related by a straight line, we say
that the marginal profit is the gradient of that line, i.e. the change in profit divided by
the change in production.
2.3 Simultaneous equations

So far, we have seen how to identify the points that are on a given line, i.e. they will be
the points (x, y) which are solutions to a linear equation in two variables such as
2x + y = 4.
But, what if we want to find the points (x, y) that two lines have in common? That is,
what if we have two linear equations, say,
2x + y = 4 and x − y = −1,
and we want to find the points (x, y) that are solutions to both of them? In such cases,
we say that we are solving the two equations simultaneously, we call the pair of
equations simultaneous equations and we usually denote this by ‘pairing’ them with a
curly bracket, i.e. )
2x + y = 4
x − y = −1
Sometimes, we will refer to a collection of two or more equations, such as the ones
above, as a system of linear equations. We now turn our attention to visualising what
the solution to a pair of simultaneous equations is and how to solve them using algebra.
43
2.3.1 Visualising the solution to a pair of simultaneous equations

Geometrically, if we draw the two lines represented by this pair of equations, as in
2 Figure 2.4(a), then we can see that the solution to these simultaneous linear equations,
y y
5 5
4 4
x − y = −1
3 3 2x + y = 4
2 2
1 2x + y = 4 1
O x O x
1 2 3 1 2 3
(a) (b)
Figure 2.4: Finding points of intersection. (a) The lines represented by the linear equations
2x + y = 4 and x − y = −1 intersect at the point (1, 2). (b) The x and y-intercepts of the
line represented by the linear equation 2x + y = 4 are its points of intersection with the
lines y = 0 (i.e. the x-axis) and x = 0 (i.e. the y-axis) respectively.
i.e. the point that the lines they represent have in common, is the point (1, 2) where the
two lines intersect. However, using pictures, no matter how well they are drawn, to find
such points can be inaccurate and so we want to develop an algebraic method for
solving such equations.
However, we have already seen examples of such an algebraic method. For example,
when we found the x and y-intercepts of the line represented by the linear equation
2x + y = 4, as illustrated in Figure 2.3(b). In this case, finding the x-intercept of the
line involved finding the point (x, y) that lies on the line and the x-axis, i.e. this
involved solving the simultaneous equations
)
2x + y = 4
x=0
whereas finding the y-intercept of the line involved finding the point (x, y) that lies on
the line and the y-axis, i.e. this involved solving the simultaneous equations
)
2x + y = 4
y=0
and, geometrically, these intersections are illustrated in Figure 2.4(b).
44
2.3.2 Solving simultaneous equations algebraically

We shall consider two methods for solving simultaneous equations using algebra.
Generally speaking, there is not much difference between the methods and students are 2
encouraged to use the method that they feel most comfortable with.
Method I: Substitution
This method involves rearranging one of the two equations so that it is in the form
y = mx + k, say, and then using this to substitute for the y in the other equation. This
yields an equation that allows us to solve for x. We can then find y by substituting this
value of x back into our equation of the form y = mx + k.
Example 2.7 To solve the simultaneous equations

)
2x + y = 7
x − 2y = 1
by substitution, we make y the subject of the first equation by rearranging. This

yields the equation
y = 7 − 2x,
and then, substituting this into the other equation, we get
x − 2(7 − 2x) = 1 =⇒ x − 14 + 4x = 1 =⇒ 5x = 15,
and so x = 3. Substituting this value into y = 7 − 2x then yields the value of y

which, in this case, is y = 1. Thus, the solution to these simultaneous equations is
x = 3 and y = 1.
Activity 2.7 Verify that the solution to these simultaneous equations is x = 3 and
y = 1 by showing that these values satisfy the two original linear equations.
Activity 2.8 Make x the subject of the second equation in Example 2.7 and use
this to find the solution to the given simultaneous equations.
Activity 2.9 Solve the simultaneous equations y = 2x + 4 and y − 3x = 2 using this

method.
Method II: Elimination
This method involves multiplying each equation by a specially chosen number, namely
the number that makes the coefficients of one of the variables the same in both
equations. Then, by subtracting one equation from the other, we can eliminate that
variable and hence solve what is left for the other.
45
Example 2.8 To solve the simultaneous equations
2
)
2x + y = 7
x − 2y = 1
by elimination, we want to make the coefficient of x, say, the same in both

equations. So, taking the equations individually we have:
2x + y = 7 multiply by 1 to get 2x + y = 7
x − 2y = 1 multiply by 2 to get 2x − 4y = 2
and subtracting gives 5y = 5
which tells us that y = 1. Then, using this value of y in either of the original
equations, say the second, we see that x = 3. Thus, the solution is x = 3 and y = 1.
Activity 2.10 Repeat the calculation in this example, but instead of eliminating x
as we did above, use your multiplications to eliminate y.
Activity 2.11 Solve the simultaneous equations y = 2x + 4 and y − 3x = 2 using

this method.
A warning
With either of these methods, we may find that when we eliminate one variable, we
eliminate the other variable as well, ending up with something like 0 = 0 or 2 = 5. In
such cases we conclude that:
If we get the former, i.e. we get something which is always true, this means that
our simultaneous equations have an infinite number of solutions. This occurs when
the two lines that are represented by our simultaneous equations are actually just
the same line and so every point on this line is a solution as every point is a point
of intersection.
If we get the latter, i.e. we get something which is never true, this means that our
simultaneous equations have no solutions. This occurs when the two lines that are
represented by our simultaneous equations are parallel, i.e. they never intersect,
and so no point on either of the lines can be a solution.
Thus, we can see that in such cases, we will always get parallel lines. And, if the parallel
lines are the same we get an infinite number of solutions, and if they are different we get
no solutions.
One way of seeing when such cases occur is to note that parallel lines have the same
gradient. As such, if you find that your simultaneous equations represent lines with the
same gradient, the question is whether they always intersect (e.g. they have the same
y-intercept as well) or whether they never intersect (e.g. they have different
y-intercepts).
46
2.3.3 An application of simultaneous equations in economics

Simultaneous equations arise in economics when we consider questions of supply and
demand. In general: 2
The level of supply, q, for a product depends on the [per-unit] price, p, of the
product. Generally, the level of supply grows as the price increases, and so a line
representing this relationship between p and q must have a positive gradient. We
generally denote the supply line by S.
The level of demand, q, for a product depends on the [per-unit] price, p, of the
product. Generally, the level of demand falls as the price increases, and so a line
representing this relationship between p and q must have a negative gradient. We
generally denote the demand line by D.
The point where the supply and demand lines intersect is called the equilibrium point.
In theory, this is the point where the market stabilises since, at this point, the [per-unit]
price is such that the levels of supply and demand are equal. This is illustrated in
Figure 2.5.
S
equilibrium point
O q
Figure 2.5: Representing the supply, S, and demand, D, by lines, the equilibrium point
is where the two lines intersect. Notice that, at this point, the [per-unit] price, p, is such
that the levels of supply and demand, q, are equal.
Activity 2.12 If demand is given by the equation 2q + 5p = 500 and supply is given
by 3q = 25 + 7p, what are the equilibrium price and quantity?
47
Learning outcomes
2 At the end of this unit, you should be able to:
solve linear equations in one variable;
use a linear equation in two variables to draw the corresponding straight line;
find the gradient of a straight line;
be able to find the equation of a straight line from supplied information;
solve simultaneous equations;
solve problems in economics that use this material.
Exercises
Exercise 2.1
For the following linear equations, find two points that are solutions to the equation and
hence draw the straight line.
i. 3x + 4y = 12; iii. x − 2y = 4;
ii. 2x + y = 10; iv. 3y − 2x = 5.
In each case, use your two points to calculate the gradient of the straight line and use
the equation of the straight line to verify that your answer for the gradient is correct.
Exercise 2.2
Draw the straight lines that go through the following pairs of points.
i. (1, 2) and (2, 4); iii. (1, 2) and (3, 2);
ii. (0, −3) and (3, 0); iv. (−2, 3) and (4, 6).
In each case, find the equation of the line you have drawn.
Exercise 2.3
Find the equations of the lines with the following properties.
i. A line that passes through the point (8, −1) and has a gradient of 41 .
ii. A line with a gradient of −6 and a y-intercept with coordinates (0, 45).

7 11
iii. A line with a gradient of 2 and an x-intercept with coordinates − , 0 .
2
48
Exercise 2.4
A company increased its weekly production from 20 to 25 units and found that its costs
went up by £800 per week. Assuming that the relationship between costs and 2
production is linear, find the marginal cost of production.
Given that the original cost was £5, 000, find the linear equation that relates the costs
to production.
If the selling price was £200 per unit, how many more units does the company need to
produce in order to break-even?
Exercise 2.5
Solve the following sets of simultaneous equations.
) )
x + 2y = 7 4x + 2y = 5
i. iii.
x − 3y = −3 2x + y = 2
) )
2x + 5y = 11 4x + 2y = 4
ii. iv.
3x + 3y = 12 2x + y = 2
Exercise 2.6
The demand for a product, q, is related to the price, p, by the equation q = 200 − 2p
while suppliers respond to a price of p by supplying an amount, q, given by the equation
q = 3p − 200. Find the equilibrium price and the corresponding level of production.
49
3. Review III — Quadratic equations and parabolae
Unit 3: Review III

Quadratic equations and parabolae
3
Overview
In this unit we see how algebra allows us to solve quadratic equations by factorising and
completing the square. We then see, more generally, that quadratics can represent
special curves known as parabolae and we see how to sketch them.
Aims
To see how to write quadratics in their factorised and completed square forms.
To see how to use these forms to solve quadratic equations.
To see how to sketch a parabola and find various points of interest.
solve problems in economics that use this material.

3.1 Quadratic equations

A quadratic equation in one variable, let’s call it x, is an equation of the form
ax2 + bx + c = 0,
where a 6= 0, b and c are constants.1 As such, we refer to expressions like the one on the
left-hand side of this equation as quadratics and we call the constants a, b and c the
coefficients of the quadratic. In this section we shall investigate several ways in which
we can solve such equations.
3.1.1 Factorising
One way of solving a quadratic equation like
ax2 + bx + c = 0
1
Notice that, if a = 0, then we have bx + c = 0 and this is a linear equation. That is, to be a quadratic
equation in x, there must be an x2 term in the equation.
50
is to factorise it. This involves writing ax2 + bx + c as the product of two linear factors,
i.e. we want to ‘put brackets in’ so that we can write
ax2 + bx + c = (Ax + B)(Cx + D),
for some constants A, B, C and D. If we can do this, we can then rewrite the quadratic
equation as
(Ax + B)(Cx + D) = 0,
3
and this helps us because the product on the left-hand side of the expression can only
equal zero if one of the linear factors in the brackets is equal to zero. That is, the
solutions to our quadratic equation must be the solutions to the two linear equations
Ax + B = 0 and Cx + D = 0.
Thus, as A, C 6= 0,2 the solutions will be given by
B D
x=− and x = − ,
A C
which we can easily find given the constants A, B, C and D. Consequently, we see that
if we can factorise the quadratic in this way, we can easily solve the quadratic equation.
But, how can we go about factorising a quadratic? The basic idea involves the identity
(x + α)(x + β) = x2 + (α + β)x + αβ,
which tells us how a certain factorised form, i.e. the left-hand side, is related to a
certain quadratic, i.e. the right-hand side. So, reading this the other way, if we have the
quadratic
x2 + (α + β)x + αβ,
we can factorise it by simply taking the numbers α and β to get the factorised form
(x + α)(x + β).
But, of course, the problem is that we do not know what numbers α and β are! So, in
this relatively simple case, where a (the x2 coefficient) is one, we will have a quadratic
like
x2 + bx + c,
and we need to find the numbers α and β which add together to give us b (as we need
b = α + β) and which multiply together to give us c (as we need c = αβ). Then, if we
can find the numbers α and β that do this, we will have
x2 + bx + c = x2 + (α + β)x + αβ = (x + α)(x + β),
as the required factorised form. The trick then, is to take the values of b and c, think
carefully about which numbers α and β could add and multiply in the right kind of way,
and hopefully settle on the ones that will make everything work.
However, generally speaking, factorising any given quadratic can be tricky (especially in
the more general case where a, the x2 coefficient, is not one) and so this method for
solving quadratic equations is not always that useful. Nevertheless, because it is so nice
when it works, we will consider some examples below before we move on to some more
‘reliable’ methods.
2
This must be the case since AC = a and, as a 6= 0, neither A nor C can be zero.
51
Example 3.1 Solve the quadratic equation x2 − x − 6 = 0.
We start by factorising the quadratic x2 − x − 6, i.e. we need two numbers that add
together to give us −1 and multiply together to give us −6. A little thought should
convince you that the required numbers are +2 and −3 which means that we have
3 x2 − x − 6 = (x + 2)(x − 3),
as you can easily verify by multiplying out the brackets. This means that we can
rewrite the quadratic equation as
(x + 2)(x − 3) = 0,
so that the solutions will be given by
x + 2 = 0 and x − 3 = 0,
i.e. the solutions are x = −2 and x = 3. When this happens, we say that we have
two distinct solutions.
Activity 3.1 Verify that x = −2 and x = 3 are solutions to the quadratic equation
in Example 3.1 by substituting them into the left-hand side of the equation and
showing that they give zero.
Example 3.2 Solve the quadratic equation x2 − 4x + 4 = 0.
We start by factorising x2 − 4x + 4, i.e. we need two numbers that add together to

give us −4 and multiply together to give us +4. A little thought should convince you
that the required numbers are −2 and −2 which means that we have
x2 − 4x + 4 = (x − 2)(x − 2),
as you can easily verify by multiplying out the brackets. This means that we can
rewrite the quadratic equation as
(x − 2)(x − 2) = 0,
so that the solutions will be given by
x − 2 = 0 and x − 2 = 0,
i.e. both solutions are x = 2. When this happens, we say that x = 2 is a repeated
solution.
Activity 3.2 Verify that x = 2 is a solution to the quadratic equation in

Example 3.2 by substituting it into the left-hand side of the equation and showing
that it gives zero.
52
Activity 3.3 Solve the quadratic equation x2 + 7x + 12 = 0 by factorising.
Unfortunately, as mentioned above, we will meet quadratic equations which are difficult
to solve by factorisation. And for this reason, we now seek a method that will always
work. The method that we will use here requires us to complete the square of the
quadratic instead of factorising it. So we now consider how to perform this procedure
and then, having done this, we will be able to see how to use it to solve quadratic 3
equations.
3.1.2 Completing the square

We know that, by multiplying out the brackets, we have
(x + k)2 = x2 + 2kx + k 2 ,
and, since we can write x2 + 2kx + k 2 as (x + k)2 in this way, we say that it is a perfect
square. That is, it can be written as something (in this case, x + k) squared and nothing
else.
Example 3.3 The quadratic x2 + 6x + 9 is a perfect square because we can write
x2 + 6x + 9 = (x + 3)2 .
But the quadratic x2 + 6x + 10 is not a perfect square as we can only write it as
x2 + 6x + 10 = (x2 + 6x + 9) + 1 = (x + 3)2 + 1,
and not as ‘something squared and nothing else’ due to the presence of the ‘+1’ on
the right-hand side.
Now imagine that we have a quadratic expression of the form x2 + 2kx and we want to
complete the square. That is, we want to find something that we can add to this
expression in order to get a perfect square. The idea is that:
if we add k 2 to x2 + 2kx we get x2 + 2kx + k 2
and, as before, this is now a perfect square because
x2 + 2kx + k 2 = (x + k)2 ,
meaning that we have ‘completed the square’ on x2 + 2kx by adding k 2 to it.

But, what does this tell us about x2 + 2kx, our original quadratic expression? Using this
result, we can take the k 2 on the left-hand side over to the right-hand side in order to get
x2 + 2kx = (x + k)2 − k 2 ,
where, on the right-hand side, we now have a perfect square, i.e. (x + k)2 , plus
something which doesn’t depend on x, i.e. −k 2 . When we write x2 + 2kx in this way, we
have what we call its completed square form.
53
Example 3.4 The quadratic expression x2 − 4x can be made into a perfect square
by adding (−2)2 = 4 to it, i.e.
x2 − 4x + 4 = (x − 2)2 .
Consequently, we can write x2 − 4x as

3
x2 − 4x = (x − 2)2 − 4,
which is its completed square form.
To complete the square on a more complicated quadratic expression, say
x2 + 2kx + c,
we can write x2 + 2kx in completed square form, as before, to get
x2 + 2kx = (x + k)2 − k 2 ,
and then note that
x2 + 2kx + c = x2 + 2kx + c = (x + k)2 − k 2 + c = (x + k)2 + (c − k 2 ),

which is the completed square form of x2 + 2kx + c since, on the right-hand side, we
now have a perfect square, i.e. (x + k)2 , plus something which doesn’t depend on x, i.e.
c − k2.
Example 3.5 Find the completed square form of x2 − 4x + 3.
We note that, if we just had x2 − 4x we would just add 4 to it to get
x2 − 4x + 4 = (x − 2)2 ,
as before. Again, this means that we have
x2 − 4x = (x − 2)2 − 4,
and so we can write
x2 − 4x + 3 = x2 − 4x + 3 = (x − 2)2 − 4 + 3 = (x − 2)2 − 1,

which is the desired completed square form.
We can find the completed square form of even more complicated quadratic expressions,
like ax2 + 2kx + c, by using brackets to break the expression down into simpler parts as
we did above.
Example 3.6 Find the completed square form of −2x2 + 8x + 10.
54
We start by putting in brackets so that we can work with something which is similar
to what we saw above, i.e. we want a quadratic expression where the x2 coefficient is
one. This means that we want to write
−2x2 + 8x + 10 = −2 x2 − 4x + 10.

Now, from the example above we know that

3
2 2
x − 4x = (x − 2) − 4,
and so this means that we have
−2x2 +8x+10 = −2 x2 −4x +10 = −2 (x−2)2 −4 +10 = −2(x−2)2 +8 +10 = −2(x−2)2 +18,

which is the desired completed square form.
Activity 3.4 Verify that, in the previous examples, the completed square form of
the expression is indeed equal to the original expression by multiplying out the
brackets.
Activity 3.5 Find the completed square form of −2x2 + 4x + 8 and verify that your
answer is correct by multiplying out the brackets.
3.1.3 Using the completed square form to solve quadratic

equations
Now that we can complete the square, we can see how we can use it to solve quadratic
equations. The advantage being that, unlike with factorising, we will always know how
to complete the square and so, we will always be able to use it to solve the quadratic
equation! The method is probably best illustrated with an example.
Example 3.7 Solve the quadratic equation x2 − 4x = 0 by completing the square.
We saw earlier that we can write x2 − 4x as
x2 − 4x = (x − 2)2 − 4,
in completed square form. This means that the quadratic equation we have to solve is
(x − 2)2 − 4 = 0.
This is easily rearranged to get

(x − 2)2 = 4,
and then, if we take the square root of both sides, we get
x − 2 = ±2.
Hence, the solutions to our quadratic equation are given by x = 2 ± 2, i.e. x = 4 and
x = 0.
55
To verify that these are the solutions, we could substitute these values into the
left-hand side of the equation and check that we get zero. Or, alternatively, we can
verify our answer by solving this quadratic equation by factorising, i.e.
x2 − 4x = 0 =⇒ x(x − 4) = 0 =⇒ x = 0 and x = 4,
as before.
3
Example 3.8 Solve the quadratic equation x2 − 4x + 3 = 0 by completing the
square.
We saw earlier that we can write x2 − 4x + 3 as
x2 − 4x + 3 = (x − 2)2 − 1
in completed square form. This means that the quadratic equation we have to solve
is the same as
(x − 2)2 − 1 = 0.
(x − 2)2 = 1,
x − 2 = ±1.
x = 1.
x2 − 4x + 3 = 0 =⇒ (x − 1)(x − 3) = 0 =⇒ x = 1 and x = 3,
as before.
Example 3.9 Solve the quadratic equation −2x2 + 8x + 10 = 0 by completing the

square.
We saw earlier that, we can write −2x2 + 8x + 10 as

−2x2 + 8x + 10 = −2(x − 2)2 + 2
in completed square form. This means that the quadratic equation we have to solve
is the same as
−2(x − 2)2 + 18 = 0.
(x − 2)2 = 9,
x − 2 = ±3.
56
x = −1.
−2x2 +8x+10 = 0 =⇒ x2 −4x−5 = 0 =⇒ (x−5)(x+1) = 0 =⇒ x = 5 and x = −1, 3

as before.
Activity 3.6 In Examples 3.1 and 3.2, we solved the quadratic equations
x2 − x − 6 = 0 and x2 − 4x + 4 = 0
by factorising. Verify your answers by solving them by completing the square.
Activity 3.7 In Activity 3.3, you were asked to solve the quadratic equation
x2 + 7x + 12 = 0 by factorising. Verify your answer by solving it by completing the
square.
3.1.4 Warning!
So far, we have looked at the solutions to several quadratic equations and we have
found that there can be either two distinct solutions or one repeated solution. But, this
is not always the case! Consider the quadratic equation
ax2 + bx + c = 0,
for some numbers a, b and c. When written in completed square form this will give us
a(x + p)2 − q = 0,
for some numbers p and q. Now, this can be rearranged to get

q
(x + p)2 = ,
a
and we would then take the square root of both sides of this equation to find its
solutions. But:
r
q q
If > 0, we will get two distinct [real] solutions, i.e. x = −p ± .
a a
q
If = 0, we will get one repeated [real] solution, i.e. x = −p ± 0 = −p ‘twice’.
a
q
If < 0, we will get no [real] solutions as the square root of a negative number
a
does not exist!
57
Consequently, we can see that a quadratic equation can have two, one or no [real]
solutions depending on what happens when we rearrange the completed square form.3
We shall investigate the consequences of this observation in the following sections.
Activity 3.8 (Hard)

By finding the completed square form of ax2 + bx + c show that when
3 ax2 + bx + c = a(x + p)2 − q,
the formulae
b b2
p= and q= − c,
2a 4a
tell us the values of p and q.
3.1.5 The quadratic formula

Another way of solving quadratic equations is by using the quadratic formula which is
as follows.
Quadratic formula
The quadratic equation

ax2 + bx + c = 0,
with a 6= 0, has solutions given by
√
−b ± b2 − 4ac
x= .
2a
This formula and its use should be familiar to everyone and so we will only give one
example of its use.
Example 3.10 Solve the quadratic equation

3x2 + 22x + 24 = 0,
by using the quadratic formula.
Comparing the quadratic equation

3x2 + 22x + 24 = 0 with ax2 + bx + c = 0,
we see that we have a = 3, b = 22 and c = 24. Putting these numbers into the
quadratic formula √
−b ± b2 − 4ac
x= ,
2a
then gives us
p √ √
−22 ± 222 − 4(3)(24) −22 ± 484 − 288 −22 ± 196 −22 ± 14
x= = = = ,
2(3) 6 6 6
3
We shall hear more about ‘real’ numbers in Unit 4.
58
so that taking the ‘+’ from the ‘±’ we have

−22 + 14 8 4
x= =− =− ,
6 6 3
and taking the ‘−’ from the ‘±’ we have
−22 − 14 36
x=
6
= − = −6.
6
3
That is, the solutions to this quadratic equation are x = − 43 and x = −6.
Activity 3.9 In Examples 3.7, 3.8 and 3.9, we solved the quadratic equations
x2 − 4x = 0, x2 − 4x + 3 = 0 and − 2x2 + 8x + 10 = 0,
by completing the square. Use the quadratic formula to verify your answers.
You should also note that this formula comes from our method of solving quadratic
equations by completing the square. Indeed, the conditions from Section 3.1.4 for two,
one or no solutions can also be written as:
If b2 − 4ac > 0, we will get two distinct [real] solutions from the quadratic formula.
If b2 − 4ac = 0, we will get one repeated [real] solution from the quadratic formula.
If b2 − 4ac < 0, we will get no [real] solutions from the quadratic formula.4
We also note in passing that the quantity b2 − 4ac is called the discriminant.
Activity 3.10 (Hard)

Solve the quadratic equation a(x + p)2 − q = 0 and then, using the results of
Activity 3.8, derive the quadratic formula.
3.2 Parabolae
In Unit 2 we saw that if we had a linear equation in two variables, say ax + by = c, this
represented a straight line. Indeed, we saw that oblique straight lines had equations of
the form
y = mx + k,
with m 6= 0. We now turn our attention to the curves which are represented by
equations of the form
y = ax2 + bx + c,
where a 6= 0 so that we can be sure that we are dealing with a quadratic expression in
x. Two such curves, called parabolae, are illustrated in Figure 3.1. Observe that,
unlike straight lines, parabolae have a minimum — like the point with coordinates
4
Again, this is because the square root of a negative number does not exist!
59
(2, −1) in Figure 3.1(a) — or a maximum — like the point with coordinates (1, 4) in
Figure 3.1(b). Points like these, where the curve ‘stops going down’ or ‘stops going up’
are called turning points.
y y
y = x2 − 4x + 3
3
4 y = −x2 + 2x + 3
3 3
2
O x O x
1 3 −1 1 3
−1
(a) (b)
Figure 3.1: Two parabolae and their ‘key features’. (a) The parabola with equation y =
x2 − 4x + 3 has a minimum with coordinates (2, −1), the y-intercept is y = 3 and the
x-intercepts are x = 1 and x = 3. (b) The parabola with equation y = −x2 + 2x + 3
has a maximum with coordinates (1, 4), the y-intercept is y = 3 and the x-intercepts are
x = −1 and x = 3.
3.2.1 Sketching parabolae

In Unit 2, we saw how to plot a straight line if we are given its equation. The idea in the
case of plotting is that you calculate the coordinates of some number of points that
satisfy the equation and join these points up to get the straight line. However, in this
course, we will generally be sketching curves as opposed to plotting them. As its name
may suggest, a sketch differs from a plot in that the former need only represent the ‘key
features’ of a curve so that we can understand its ‘shape’ and how it is related to our
axes (and, if necessary, other curves) whereas the latter will generally be a much more
accurate drawing of it (which we could use, say, to infer values of certain quantities).
Indeed, as we saw in Unit 2, the ‘key features’ which are needed to sketch a straight line
are simply its x and y-intercepts.
In this section we shall see how to sketch parabolae and, in particular, we will see that
the ‘key features’ of a parabola are its y-intercept, its x-intercepts (if any) and the
coordinates of its maximum or minimum. Indeed, as instructive examples of how to do
this, we will go through the calculations that would enable us to draw the sketches in
Figure 3.1.
60
Example 3.11 Sketch the parabola whose equation is y = x2 − 4x + 3.
We start by noting that, following on from Example 3.5, we can write the equation
of this parabola as
y = (x − 2)2 + 1,
in completed square form. This will enable us to find the ‘key features’ of the
parabola as follows.
3
The y-intercept of the parabola occurs when x = 0 and so, substituting x = 0
into the original form of its equation we get y = 3 as the y-intercept.
The x-intercepts of the parabola occur when y = 0 and so we have to solve the
quadratic equation
x2 − 4x + 3 = 0,
which, as we saw in Example 3.8, is easily done if we use the completed square
form. Thus, as we saw there, the solutions to the quadratic equation above are
x = 1 and x = 3 which means that these values of x are the x-intercepts.
The turning point of the parabola can be found by using the completed square
form of its equation. In this case, as we know that (x − 2)2 ≥ 0 for all [real]
values of x we can see that
(x − 2)2 ≥ 0 =⇒ (x − 2)2 − 1 ≥ −1 =⇒ y ≥ −1,
and so, as y must always be greater than or equal to −1, this must be the
minimum value of y which occurs when (x − 2)2 = 0. Thus, the turning point is
a minimum with coordinates (2, −1).
With this information, we can plot the ‘key features’ of the parabola on the axes and
draw a nice parabolic shape through them to get the sketch in Figure 3.1(a).
Let’s now consider an example where we haven’t done most of the work before.
Example 3.12 Sketch the parabola whose equation is y = −x2 + 2x + 3.
We start by finding the completed square form of the equation of the parabola. This
can be found by writing
y = −x2 + 2x + 3 = − x2 − 2x + 3,

so that, because
x2 − 2x + 1 = (x − 1)2 ,
we get
y = − (x − 1)2 − 1 + 3 = −(x − 1)2 + 1 + 3 = −(x − 1)2 + 4,

in completed square form. This enables us to find the ‘key features’ of the parabola
as follows.
The y-intercept of the parabola occurs when x = 0 and so, substituting x = 0

into the original form of its equation we get y = 3 as the y-intercept.
61
The x-intercepts of the parabola occur when y = 0 and so we have to solve the
quadratic equation
−x2 + 2x + 3 = 0.
But, using the completed square form, this gives us
−(x − 1)2 + 4 = 0 =⇒ (x − 1)2 = 4 =⇒ x − 1 = ±2 =⇒ x = 1 ± 2.

3
Thus, the solutions to the quadratic equation above are x = 3 and x = −1
which means that these values of x are the x-intercepts.
The turning point of the parabola can be found by using the completed square
form of its equation. In this case, as we know that (x − 1)2 ≥ 0 for all [real]
values of x we can see that
−(x − 1)2 ≤ 0 =⇒ −(x − 1)2 + 4 ≤ 4 =⇒ y ≤ 4,
and so, as y must always be less than or equal to 4, this must be the maximum
value of y which occurs when (x − 1)2 = 0. Thus, the turning point is a
maximum with coordinates (1, 4).
With this information, we can plot the ‘key features’ of the parabola on the axes and
draw a nice parabolic shape through them to get the sketch in Figure 3.1(b).
One thing to note from what we have seen so far is the result of the following activity.

From Activity 3.9, we know that
ax2 + bx + c = a(x + p)2 − q,
for certain values of p and q. Using this fact, explain why the turning point of the
parabola
y = ax2 + bx + c
will have coordinates (−p, −q) and why it will be
a minimum if a > 0, and
a maximum if a < 0.
In particular, observe how the sign of a determines whether the parabola has a
maximum or a minimum.
Activity 3.12 Sketch the parabola whose equation is y = −x2 + 4x.
3.2.2 Where do a parabola and a straight line intersect?

In Unit 2, we saw how to find the point where two straight lines intersected. We now
consider how we would go about finding the point(s) where a line and a parabola
intersect by looking at an example.
62
Example 3.13 Find the points of intersection of the parabola y = x2 − 4x + 3 and

the straight line y = −x + 3.
To find the points of intersection, we want to find the values of x that make the
values of y from both equations the same, i.e. we seek the values of x that satisfy the
equation
x2 − 4x + 3 = −x + 3. 3
But, in this case, these are easily found because we can rearrange this to get a
quadratic equation which is particularly easy to solve, namely
x2 − 3x = 0 =⇒ x(x − 3) = 0 =⇒ x = 0 or x = 3.
Now we know the values of x, we can substitute them back into either equation to
get the corresponding values of y. So, as y = −x + 3 is the easier equation, we use
this to get
x=0 =⇒ y = −0 + 3 = 3 and x=3 =⇒ y = −3 + 3 = 0.
Thus, the required points of intersection between the parabola and the straight line
have coordinates (0, 3) and (3, 0) as illustrated by the ‘•’s in Figure 3.2(a).
Activity 3.13 Consider the parabola y = −x2 + 4x which we sketched in

Activity 3.12. Find the point(s) of intersection (if any) of this parabola and the
straight lines (a) y = 2x + 1 and (b) y = 2x + 2. Draw sketches of these curves to
illustrate what you find.
3.2.3 Where do two parabolae intersect?

Following on from this, we can also see how we would go about finding the point(s)
where two parabolae intersect by looking at an example.
Example 3.14 Find the points of intersection of the two parabolae y = x2 − 4x + 3

and y = −x2 + 2x + 3.
To find the points of intersection, we want to find the values of x that make the
values of y from both equations the same, i.e. we seek the values of x that satisfy the
equation
x2 − 4x + 3 = −x2 + 2x + 3.
But, in this case, these are easily found because we can rearrange this to get a
quadratic equation which is particularly easy to solve, namely
2x2 − 6x = 0 =⇒ x2 − 3x = 0 =⇒ x(x − 3) = 0 =⇒ x = 0 or x = 3.
Now we know the values of x, we can substitute them back into either equation to
get the corresponding values of y. So, using y = x2 − 4x + 3, we use this to get
x=0 =⇒ y =0−0+3=3 and x=3 =⇒ y = 9 − 12 + 3 = 0.
63
y y
y = x2 − 4x + 3 y = x2 − 4x + 3
y = −x2 + 2x + 3
4
3
3 3
y = −x + 3
2 2
O x O x
1 3 −1 1 3
−1 −1
(a) (b)
Figure 3.2: Returning to the parabola y = x2 − 4x + 3 first seen in Figure 3.1(a) we can
see: (a) its two points of intersection with the straight line y = −x + 3 and (b) its two
points of intersection with the parabola y = −x2 + 2x + 3 first seen in Figure 3.1(b). In
both cases, the points of intersection are indicated by ‘•’s.
Thus, the required points of intersection between the two parabolae have coordinates
(0, 3) and (3, 0) as illustrated by the ‘•’s in Figure 3.2(b).
Activity 3.14 Consider the parabola y = −x2 + 4x which we sketched in

Activity 3.12. Find the point(s) of intersection (if any) of this parabola and the
parabolae (a) y = x2 + 2 and (b) y = x2 + 3. Draw sketches of these curves to
illustrate what you find.
Learning outcomes
write a simple quadratic in factorised form;
write any quadratic in completed square form;
solve quadratic equations by factorising, completing the square or using the

quadratic formula;
identify the ‘key features’ of a parabola and use these to draw a sketch;
find the points of intersection of a parabola with a straight line or another parabola.
64
Exercises
Exercise 3.1
Multiply out the following brackets.
i. (x + 1)(x + 2); iv. (3x − 5)(3x + 5);

3
ii. (x + 13 )2 ; v. (x − 1)(x − 2)(x + 3);
iii. (3x − 2)(5x + 3); vi. (x − 3y)(2x + 4y).
Exercise 3.2
Factorise the following quadratic expressions.
i. x2 − x − 2; iii. 2x2 + 2x − 12;
ii. x2 + 3x − 18; iv. −x2 + x + 2.
Exercise 3.3
Solve the following equations. Try factorising first and then completing the square. Use
the quadratic formula only as a last resort!
i. x2 = 5; iv. x2 = −7x;
ii. x2 + 4x − 5 = 0; v. 2x2 + 5x = 3;
iii. x2 + 2x + 3 = 0; vi. 5x2 − 8x + 2 = 0.
Exercise 3.4
For each of the following, complete the square and then sketch the graph.
i. y = x2 − 6x + 5; iii. y = −x2 − 6x + 6;
ii. y = x2 − 4x + 5; iv. y = 5x2 − 4x − 1.
In each case, you should determine the coordinates of the turning point and the x and
y-intercepts.
Exercise 3.5
i. Sketch the parabola given by the equation y = 6 − x − x2 and the straight line
given by the equation y = 2x + 4 on the same set of axes.
ii. By solving the appropriate equation, find the points where the parabola and the
straight line intersect.
iii. Sketch another line, parallel to the first, which only intersects the parabola once
and calculate the y-intercept of this second line.
65
Exercise 3.6
A company sells its products in a market where the price, p, is linked to the quantity
sold, q, by the demand equation p = 120 − 2q.
i. Calculate the market price, and the revenue, if the company sells 35 units. What is
the revenue in terms of q?
3 The company incurs fixed costs of 400 and an additional cost of 12 for each unit
produced.
ii. How much will it cost to produce 35 units? What is the total cost in terms of q?
iii. What profit will the company make from producing and selling 35 units? What is
the profit in terms of q?
iv. By completing the square, calculate the number of units that will maximise the
profit. What is the corresponding market price?
66
4. Functions
Unit 4: Functions
Overview
In this unit we introduce the idea of a function. This will play an important role in the
rest of this course and, in particular, it bridges the gap between what we have seen so
far and what we will see when we look at calculus. 4
Aims
To introduce the idea of a function.
To introduce some common functions and look at their properties.
To see how functions can be combined and how they can be used in economics.
To introduce the idea of an inverse and see how it can be found.

4.1 Functions
In this unit, we want to introduce the idea of a function which, at the most basic level,
is just a rule that turns an input into an output. In particular, when we talk about
inputs and outputs we mean numbers, or more specifically, real numbers. These can be
thought of in several ways but, essentially, every number that can be written as a
decimal is a real number. Alternatively, we can think of each real number as a point on
a number line (and vice versa) as illustrated in Figure 4.1. Of course, in a way, we have
√
− 21 2 e π
−3 −2 −1 0 1 2 3
Figure 4.1: The central portion of the real number line and some of the numbers on it.
We will encounter the real numbers e and π shortly.
already seen real numbers represented in this way as the x-axis is just a real number
line which represents all the inputs a rule can have. And, similarly, if we think of the
y-axis as another real number line which represents all the outputs a rule can have, we
may start to appreciate that the curves we have been sketching in Units 2 and 3 are just
ways of visualising how certain rules relate their inputs to their outputs.
67
4. Functions
Now that we have an idea of what our inputs and outputs can be, let’s look more
closely at the relationship between rules and functions.
4.1.1 What is a function?

A function is a rule that gives exactly one output for each input. If we represent the
input by the variable x, and call the function f , we can then use f (x) (read ‘f of x’) to
denote the corresponding output. In this way, it is sometimes convenient to think of a
function as a machine (or ‘black box’) by writing
4 x −→ f −→ f (x),
as this indicates how each input, x, is ‘processed’ by the function f to give the output
f (x). Indeed, observe that we can use any variable to represent the input and so, if we
had used t instead of x, the output would be f (t) and we could write
t −→ f −→ f (t),
to indicate how each input, t, is ‘processed’ by the function f to give the output f (t).
Once we have this notation, we can then capture the effect of any given function on
each input by using an appropriate formula to express the rule.
Example 4.1 Let’s say that the rule we want to capture is ‘square the number and
then add one’. This rule gives us a function, let’s call it f , which can be captured by
the formula f (x) = x2 + 1 which tells us how each input, x, is ‘processed’ by f to
give the output f (x). In particular, we can see that, if x = 1, the output is
f (1) = 12 + 1 = 2 whereas if the input was x = 2, the output would be
f (2) = 22 + 1 = 5.
Notice also, that this rule does define a function because every input, x, gives rise to
exactly one output, namely whatever number x2 + 1 turns out to be. And, indeed, if
we had chosen to use the variable t instead of x we would now be using the formula
f (t) = t2 + 1 to capture the effect of this function.
Activity
√ 4.1 Following on from Example 4.1, find the values of f (0), f (−1) and
f ( 2).
However, not all rules will give us a function. For instance, if we had the rule ‘take the
square root of the number’, we find that
Negative numbers do not have square roots and so this rule gives us no outputs
when the input is a negative number, i.e. this rule can not specify a function
because we do not get at least one output for these inputs.
Positive numbers have two square roots and so this rule gives us two outputs when
the input is a positive number, i.e. this rule can not specify a function because we
do not get at most one output for these inputs.
So, we can see that when looking at whether a rule can define a function, we may need
to take some care when specifying what the inputs are and whether the rule itself
actually satisfies the ‘exactly one output’ requirement.
68
4. Functions
In what follows, we will look at some of the most common functions that occur in
mathematics and we will also see how these functions can be combined to make new
functions.
4.1.2 Some common functions

We have already encountered several functions in this course. For instance, we have seen
constant functions: which take the form f (x) = k for some constant k,
linear functions: which take the form f (x) = ax + b for some constants a 6= 0 and b, 4
2
quadratic functions: which take the form f (x) = ax + bx + c for some constants
a 6= 0, b and c.
In particular, we know what all of these functions look like because we saw how to
sketch them in Units 2 and 3. More generally, these are examples of polynomial
functions because they take the form
f (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 ,
for some constants an , an−1 , . . . , a1 , a0 . Indeed, if xn is the highest power in the

polynomial, we say that it has degree n so that constant, linear and quadratic functions
are polynomials of degree zero, one and two respectively. What do polynomial functions
look like? We will be able to answer this question more thoroughly when we look at
curve sketching in Unit 7.
Now, however, we want to introduce some new functions and get some idea of what
they look like.
Exponential functions
Given a positive number a 6= 1, called the base, an exponential function has the form
f (x) = ax ,
and, depending on whether 0 < a < 1 or a > 1, they give us curves like the ones
illustrated in Figure 4.2. In particular, observe that ax 6= 0 for all values of x.
The most important exponential function occurs when the base is the number e which
is approximately 2.71828 (5dp). We will encounter this function, ex , and see some
reasons why it is so special in Units 5 and 9.
Sine and cosine functions
Two other functions that we will be interested in are the sine and cosine functions. You
are probably familiar with these from their use in problems involving triangles since we
know that
opposite adjacent
sin(θ) = and cos(θ) = ,
hypotenuse hypotenuse
using the right-angled triangle illustrated in Figure 4.3.
69
4. Functions
y y
y = ax y = ax
4
1 1
O x O x
(a) When a > 1 (b) When 0 < a < 1

Figure 4.2: The exponential function when (a) a > 1 and (b) 0 < a < 1.
es
e nu
opposite
ot
p
hy
θ
adjacent
Figure 4.3: The sine and cosine functions can be defined in terms of the sides of a
right-angled triangle.
In this course, however, when we talk about angles, we will measure them in radians
and not degrees. The basic idea here is that π radians, where the number π is
approximately 3.142 (3dp), is the same as 180 degrees and, using this, we can convert
angles in degrees to angles in radians using the formula
π
angle in radians = × angle in degrees.
180
So, if we use the triangles in Figure 4.4 to determine the most important values of these
functions, we would get the results given in the following table.
70
4. Functions
θ in degrees θ in radians sin(θ) cos(θ)
√
π 1 3
30
6 2 2
π 1 1
45 √ √
4 2 2
4
√
π 3 1
60
3 2 2
Activity 4.2 Verify that the results in the table are correct.
π/6
√
2
π/4
3
√
2
π/3 π/4
1 1
(a) (b)
Figure 4.4: The triangles which allow us to find sin(θ) and cos(θ) when (a) θ = π/6 or
θ = π/3 and (b) θ = π/4.
More generally, as illustrated in Figure 4.5, we find that these functions are periodic
with a period of 2π radians, a fact that we could express mathematically by writing
sin(x + 2π) = sin(x) and cos(x + 2π) = cos(x),
and we can also see that the cosine function is just the sine function ‘shifted to the left’
by π/2 radians, a fact that we could express mathematically by writing
π
cos(x) = sin x + .
2
Observe, in particular, that some other important values of these functions are given in
the following table.
71
4. Functions
θ in degrees θ in radians sin(θ) cos(θ)
0 0 0 1
π
90 1 0
2
4 180 π 0 −1
y = sin(x) y = cos(x)
Figure 4.5: The sine and cosine functions for −π ≤ x ≤ 4π. Notice, in particular, that
they are both periodic with period 2π and that the cosine function is just the sine function
‘shifted to the left’ by π/2 radians.
Activity 4.3 What are sin(x) and cos(x) when x is 3π/2? 2π?
4.1.3 Combinations of functions

It is also possible to combine functions in certain ways to get new functions and this
generally works in the obvious way. For instance, if we have a function, f , and a
constant, k, we can get the new function kf , which is called a constant multiple of f , by
using the rule
(kf )(x) = k · f (x),
and, similarly, if we have two functions, f and g, we can get the new function f + g,
which is called the sum of f and g, by using the rule
(f + g)(x) = f (x) + g(x).
This may sound a bit abstract, but the following example should make it clear.
72
4. Functions
Example 4.2 Suppose that the functions f and g are given by the formulae
f (x) = x2 − 4 and g(x) = ex .
In this case, the function 3f would be given by the formula
(3f )(x) = 3 · f (x) = 3(x2 − 4) = 3x2 − 12,
i.e. it is just three times f (x), whereas the function f + g would be given by the
formula
(f + g)(x) = f (x) + g(x) = x2 − 4 + ex ,
4
i.e. it is just the sum of f (x) and g(x).
Indeed, if we have two functions, f and g, and two constants, k and l, we can get the
new function kf + lg, called a linear combination of f and g, by using the rule
(kf + lg)(x) = k · f (x) + l · g(x),
which should be fairly obvious given the two rules above.
Example 4.3 Following on from Example 4.2, the function 2f − g would be given
by the formula
(2f − g)(x) = 2f (x) + (−1)g(x) = 2(x2 − 4) + (−1) ex = 2x2 − 8 − ex ,
as we can think of 2f − g as 2f + (−1)g.
√ 4.4 Following on from Example 4.3, find the formulae for the functions
Activity
−f , 2g, f − g, −9f + 2g.
Activity 4.5 Explain how the linear combination rule can be obtained from the
constant multiple and sum rules.
If f and g are functions, write down the rule which would give us the new function
f − g, called the difference of f and g.
Products and quotients
Two other ways of combining functions are products and quotients. The former, as its
name may suggest, is what we get when we have two functions, f and g, and we
multiply them together to get the new function f · g, called the product of f and g, by
using the rule
(f · g)(x) = f (x) · g(x).
Similarly, if we divide f by g we get the new function f /g, called the quotient of f and
g, by using the rule
f f (x)
(x) = .
g g(x)
73
4. Functions
It is important to observe, however, that this last rule can only be used if we have
values of x where g(x) 6= 0. In particular, if g(x) = 0 at some value of x, the new
function f /g is undefined at that value of x because division by zero is never allowed.
Example 4.4 Following on from Example 4.2, the function f · g would be given by
the formula
(f · g)(x) = f (x) · g(x) = (x2 − 4) ex ,
i.e. it is just f times g, whereas the function f /g would be given by the formula
x2 − 4

f f (x)
(x) = = ,
4 g g(x) ex
i.e. it is just f divided by g. Notice, in particular, that this quotient is defined for all
values of x because ex is never equal to zero.
Activity 4.6 Following on from Example 4.4, verify that the function g · f is the
same as the function f · g.
Find the formula for the function g/f . For which inputs is this function not defined?
Compositions
The last way of combining functions that we will consider is their composition. If we
have two functions, f and g, then we can get the new function f ◦ g, which is the
composition we get when we apply f after applying g, by using the rule
(f ◦ g)(x) = f (g(x)),
provided that it makes sense to apply the rule for f to each output, g(x), of g. Indeed,
to see what is happening here, it useful to think of these functions as machines again so
that we can represent this composition as
x −→ g −→ g(x) −→ f −→ f (g(x)),
and, in this way, we see that it can only make sense if g(x) is giving us an input for f
that allows us to get its output f (g(x)).
Of course, in a similar manner, we can also get the new function g ◦ f , which is the
composition we get when we apply g after applying f , by using the rule
(g ◦ f )(x) = g(f (x)),
provided that it makes sense to apply the rule for g to each output, f (x), of f . In this
case, we could represent the composition as
x −→ f −→ f (x) −→ g −→ g(f (x)),
and, in this way, we see that it can only make sense if f (x) is giving us an input for g
that allows us to get its output g(f (x)). In particular, observe that the functions f ◦ g
and g ◦ f are usually different as we can see in the next example.
74
4. Functions
Example 4.5 Following on from Example 4.2, the function f ◦ g would be given by
the formula
(f ◦ g)(x) = f (g(x)) = f ( ex ) = ( ex )2 − 4 = e2x − 4,
whereas the function g ◦ f would be given by the formula
2 −4
(g ◦ f )(x) = g(f (x)) = g(x2 − 4) = ex .
Notice, in particular, that these functions are not the same!
Activity 4.7 Suppose that the functions f and g are given by the formulae 4
√
f (x) = x − 1 and g(x) = x.
Find the formulae for the compositions f ◦ g and g ◦ f . For which inputs is the latter
function not defined?
A last word on combinations of functions
So far, we have seen how to combine certain functions in different ways to get new
functions. However, when we come to look at calculus, we will also need to do this ‘in
reverse’, i.e. we will need to be able to look at a function and see how it has been
constructed by combining other, simpler functions. This is usually quite straightforward
as illustrated in the following example.
Example 4.6 The function given by
ex sin(x),
is the product, f · g, of the functions f and g where
f (x) = ex and g(x) = sin(x),
whereas the function

(x2 + 1)2 ,
is the composition, f ◦ g, of the functions f and g where
f (x) = x2 and g(x) = x2 + 1,
since it is just f (g(x)).
Activity 4.8 Find two functions f and g that can be combined to get the functions
x2
(i) x2 ex , (ii) , (iii) e2x , (iv) e2x + 3 ex + 1.
ex
In each case, also indicate the kind of combination that you have found.
75
4. Functions
4.1.4 Functions in economics

Functions are widely used in economics and one particularly important example occurs
when we consider supply and demand like we did in Section 2.3.3. In particular, if we
have a supply equation which can be written in the form q = qS (p), then we call qS the
supply function whereas if we have a demand equation which can be written in the form
q = qD (p), then we call qD the demand function. With these functions, we can then see
that the equilibrium price, i.e. the price that makes the quantity supplied equal to the
quantity demanded, can be found by solving the equation
qS (p) = qD (p),
4 and then we can use either of these functions to find the corresponding equilibrium
quantity.
Activity 4.9 In Activity 2.12, supply was given by the equation 3q = 25 + 7p and
demand was given by the equation 2q + 5p = 500. Find the supply and demand
functions.
Use these functions to find the equilibrium price and quantity.
We will also encounter other functions of economic significance. For instance, if a

company manufactures an amount, q, of some product then the money it makes from
selling this amount is given by its revenue function, R(q), whereas the money spent on
producing this amount is given by its cost function, C(q). The difference between these
two functions then gives us the firm’s profit function,
π(q) = R(q) − C(q),
which, for a given value of q, may be positive or negative meaning that the firm is
making a profit or a loss respectively.1
Activity 4.10 A company sells each unit of its product for £4. What is its revenue
function?
If its profit function is given by π(q) = −q 2 + 6q − 4, what is its cost function?
4.2 Inverse functions

We have seen that a function, f , is a rule that gives exactly one output for each input
and we can think of this by writing
x −→ f −→ f (x).
Now, we want to consider the circumstances under which we can ‘reverse’ this process.
That is, under what circumstances can we find a function, which we will call f −1 , whose
job can be thought of by writing
x ←− f −1 ←− f (x).
1
Notice, in particular, that we use the Greek letter ‘π’ to denote the profit function as we are already
using ‘p’ to denote prices.
76
4. Functions
In particular, if we can find such a function, called the inverse of f , we see that it takes
the original outputs, f (x), as inputs and gives us the corresponding original inputs, x,
as outputs.2 Indeed, we will find that some functions have an inverse whereas others do
not, unless we take some care with the inputs and outputs that we are considering.
Another thing to notice is that, if an inverse function exists, then the composition of a
function and its inverse gives us a function which takes an input and gives us this very
same input as its output. To see this, consider that the composition f −1 ◦ f can be
represented as
x −→ f −→ f (x) −→ f −1 −→ x,
and so we should always find that, if the inverse exists, 4
(f −1 ◦ f )(x) = f −1 (f (x)) = x,
whereas the composition f ◦ f −1 can be represented as
y −→ f −1 −→ f −1 (y) −→ f −→ y,
and so we should always find that, if the inverse exists,
(f ◦ f −1 )(y) = f (f −1 (y)) = y.
In particular, notice that this is one of the few cases where the composition gives us the
same function regardless of the order in which we perform the composition. As we shall
see, the fact that the composition of a function and its inverse must behave in this way
will provide us with a useful way of verifying that we have found the correct formula for
an inverse function!
4.2.1 Finding inverse functions

Suppose that we have a function, f , and given an input x, we take y = f (x) to be the
output. Written in this form, the inputs are related to the outputs by the equation
y = f (x) which tells us y in terms of x. To find the inverse function, we need to find x
in terms of y and, if this gives us exactly one value of x for each value of y that we are
considering, then we have found the inverse function, f −1 . In particular, when written
in this new form, we now have the equation x = f −1 (y) and so we can identify the
formula for f −1 . Let’s look at an example.
Example 4.7 Suppose that the function f is given by the formula
f (x) = 2x + 3.
If we set y = f (x), this gives us

y = 2x + 3,
2
Note that we are using ‘f −1 ’ to denote a new function which does a specific job and it is not to be
thought of as ‘f to the power of −1’ ! In particular, if we wanted to think of ‘f to the power of −1’ we
would surely mean ‘1/f (x)’ which we would call the ‘reciprocal of f ’ or ‘1/f ’. As such, the inverse of a
function, say f −1 , and its reciprocal, say 1/f , are two completely different things!
77
4. Functions
as the equation which relates the inputs, x, of f to its outputs, y. Rearranging this
to find x in terms of y, we find that
y−3
y = 2x + 3 =⇒ 2x = y − 3 =⇒ x= ,
2
and this equation now relates the outputs, y, of f to its inputs, x. Moreover, as each
value of y will give exactly one value of x, the inverse function exists and so,
thinking of this equation as x = f −1 (y), we can then deduce that
y−3
f −1 (y) = ,
4 2
is the formula for the inverse of f .
In particular, notice that we can verify that this is correct by noting that
(2x + 3) − 3 2x
(f −1 ◦ f )(x) = f −1 (f (x)) = f −1 (2x + 3) = = = x,
2 2
and, indeed, that
y−3 y−3

−1 −1
(f ◦ f )(y) = f (f (y)) = f =2 + 3 = (y − 3) + 3 = y,
2 2
as we should expect given our discussion above.
Indeed, thinking back to our discussion of supply and demand functions in Section 4.1.4,
we may be able to find their inverses. That is, if we have a supply equation which can
be written in the form p = pS (q), then we call pS the inverse supply function as it is just
qS−1 whereas if we have a demand equation which can be written in the form p = pD (q),
−1
then we call pD the inverse demand function as it is just qD . With these functions, we
can then see that the equilibrium quantity, i.e. the quantity that makes the suppliers’
price equal to the consumers’ price, can be found by solving the equation
pS (q) = pD (q),
and then we can use either of these functions to find the corresponding equilibrium
price.
Activity 4.11 Following on from Activity 4.9, where the supply was given by the
equation 3q = 25 + 7p and demand was given by the equation 2q + 5p = 500. Find
the inverse supply and demand functions.
Use these functions to find the equilibrium price and quantity.
What can go wrong?
However, situations where inverses do not exist are not hard to find as the next example
shows.
78
4. Functions
Example 4.8 Suppose that the function, f , is given by the formula
f (x) = x2 .
If we set y = f (x), this gives us

y = x2
as the equation which relates the inputs, x, of f to its outputs, y. Rearranging this
to find x in terms of y, we find that
√
y = x2 =⇒ x = ± y,
4
and this equation now relates the outputs, y, of f to its inputs, x. However, we can
not use this to define an inverse function because we have two problems:
If the output, y, of f is negative this equation gives us no value for the

corresponding input, x.
If the output, y, of f is positive this equation gives us two values for the
corresponding input, x.
And, of course, we need to get exactly one value of x for each value of y in order to
define an inverse function!
Although, having said this, if we take some care with the inputs and outputs that we
are considering, then we can often overcome such problems and find an inverse function
as the next example shows.
Example 4.9 Suppose that the function, f , is given by the formula
f (x) = x2
and we only want to consider values of x that are positive, i.e. we have x > 0. If we
set y = f (x), this gives us
y = x2
as the equation which relates the inputs, x, of f to its outputs, y. In particular, as
x2 > 0 for all values of x > 0, this can only give us outputs, y, that are positive, i.e.
we have y > 0. Rearranging this equation to find x in terms of y, we again find that
√
y = x2 =⇒ x = ± y,
and this equation again relates the outputs, y, of f to its inputs, x. But, now we can
√
find an inverse function since y > 0 means that we can always find a value for y
√
and x > 0 means that we are only interested in the positive value, + y, we get from
√
the square root, i.e. we can reject the problematic − y as we know that the values
of x that we started with are positive. That is, since f only takes positive numbers
as inputs and only gives positive numbers as outputs, we must have
√
x= y,
79
4. Functions
for all allowed values of x and y. Consequently, as each allowed value of y will give
exactly one allowed value of x, the inverse function exists and so, thinking of this
equation as x = f −1 (y), we can then deduce that
√
f −1 (y) = y,
is the formula for the inverse of f .
Activity 4.12 Following on from Example 4.9, verify that the inverse is correct by
showing that
4 (f −1 ◦ f )(x) = x and (f ◦ f −1 )(y) = y,
as we should expect.
Activity 4.13 Following on from Example 4.9, suppose that the function f is again
given by the formula f (x) = x2 , but now we only want to consider values of x that
are negative. Does f have an inverse? If it does, what is it?
If you do find an inverse, verify that
(f −1 ◦ f )(x) = x and (f ◦ f −1 )(y) = y,
as we should expect. (Take care here: Remember that x < 0!)
Furthermore, in some cases, this method for finding an inverse function just does not
work as we have no useful algebraic way of ‘rearranging’ the relevant equation. Instead,
we often have to define an entirely new, but related, function to do the job. This is what
happens, for instance, when we have an exponential function of the form
f (x) = ax ,
whose inverse is given by an appropriate logarithm. Indeed, as well as giving us the

inverse of exponential functions, logarithmic functions are useful in their own right. As
such, we now introduce logarithms as the last of our common functions, and take a look
at their special properties.
4.2.2 Logarithms
Logarithms are defined as follows.
Logarithms
If a 6= 1 is a positive number (called the base) and x is a positive number (called

the argument) then the logarithm of x to base a, denoted by loga (x), is the number
b such that ab = x. That is,
ab = x means exactly the same thing as b = loga (x).
In particular, it is always the case that aloga (x) = x.
80
4. Functions
For example, we can use this definition to see that as

1
22 = 4, 21 = 2, 20 = 1 and 2−1 = ,
2
we must have

1
log2 (4) = 2, log2 (2) = 1, log2 (1) = 0 and log2 = −1,
2
respectively. Notice, in particular, that even though the base and the argument of a
logarithm must be positive, the logarithm itself can be negative.
Activity 4.14 Suppose that a 6= 1 is any positive number. Explain why the
following results are true. 4

1
i. loga (1) = 0, ii. loga (a) = 1, iii. loga = −1, iv. loga (ab ) = b.
a
Activity 4.15 Suppose that, for some positive number, a 6= 1, the function, f , is
given by the formula
f (x) = ax .
Explain why the inverse of f is given by f −1 (y) = loga (y) as long as y is a positive
number.
Why does the inverse of f not exist if a = 1?
The laws of logarithms
As logarithms to base a are closely related to powers of a, the power laws that we saw
in Unit 1 allow us to deduce the laws of logarithms. These are as follows.
The laws of logarithms
Logarithms obey some simple laws:
The multiplication law:
loga (x · y) = loga (x) + loga (y),
which follows from the fact that au · av = au+v .
The division law:

x
loga = loga (x) − loga (y),
y
which follows from the fact that au /av = au−v .
The power law:

loga (xy ) = y loga (x),
which follows from the fact that (au )v = au·v .
Notice that, when using these laws, all the logarithms have the same base.
81
4. Functions
Example 4.10 From the examples above we know that
log2 (4) + log2 (2) = 2 + 1 = 3,
We can verify this by using the laws of logarithms by noting that
log2 (4) + log2 (2) = log2 (4 · 2) using the multiplication law

= log2 (23 ) as 4 · 2 = 22 · 2 = 23
= 3 log2 (2) using the power law
=3 as log2 (2) = 1
4
as before.
Activity 4.16 Explain why loga (x2 ) = 2 loga (x) and loga (x3 ) = 3 loga (x) using (i)
the power law and (ii) the multiplication law. Can you see how this generalises?

Use the power laws to derive the laws of logarithms.
Changing base
Generally, when we use logarithms, we use logarithms to the base 10 or base e. As these
bases are so special, we have special names for them:
Logarithms to the base 10 are denoted by ‘log’ and are called ‘common logarithms’.
Logarithms to the base e are denoted by ‘ln’ and are called ‘natural logarithms’.
The main reason for emphasising these logarithms is that many calculators have
buttons which enable us to easily work them out. But, in this course, the basic
calculator which you are allowed to use in the examination does not have these buttons
and so the values of any logarithms (which can not easily be figured out) will be given
to an appropriate number of decimal places.
Example 4.11 If we needed the value of log(100), we would say that
log(100) = 2 because 100 = 102 ,
as this can be easily figured out.

However, if we needed the value of log(101), we might be told its value to an
appropriate number of decimal places, say
log(101) = 2.00432 (to 5dp),
as this value is not easy to figure out. Or, we might be told some other information
that would allow us to find this value using our basic calculators.
82
4. Functions
Sometimes, however, it is convenient to work with logarithms to some other base a, i.e.
‘loga ’, and in such cases, when it comes to working them out we need to know how to
convert the ‘loga ’ into, say, ‘log’s or ‘ln’s or whatever so that we can use any given
values to evaluate them. For such purposes the rule is as follows.
The change of base rule for logarithms
Given two bases a and b, we can convert a logarithm to base a, say loga (x), into a
logarithm to base b, say logb (x), by using
logb (x)
loga (x) =
logb (a)
. 4
In particular, if we were given the relevant ‘log’s or ‘ln’s we would have to use
log(x) ln(x)
loga (x) = or loga (x) = ,
log(a) ln(a)
respectively.
To see how this works, let’s say that we wanted to work out log100 (10000) using
common logarithms. Given the numbers involved we don’t need a calculator to see that
log(10, 000) log(104 ) 4

log100 (10, 000) = = = = 2,
log(100) log(102 ) 2
which is what we would expect as log100 (10000) = log100 (1002 ) = 2. Alternatively, we

could have used natural logarithms, in which case we would have had to use a calculator
to see that
ln(10, 000) 9.21034...
log100 (10, 000) = = = 2,
ln(100) 4.60517...
as before.
Activity 4.18 Following on from Example 4.11, use the change of base rule for
logarithms to find log(101) to 5dp given that, to 5dp, ln(101) = 4.61512 and
ln(10) = 2.30259.

Derive the change of base rule for logarithms.
Learning outcomes
explain what a function is;
solve problems that involve the given common functions and combinations of them;
83
4. Functions
explain what an inverse function is;
find inverse functions if they exist or explain why they do not;
solve problems that involve logarithms.
Exercises
Exercise 4.1
4 The function C(x) = 10x + 315 gives the cost of producing x units of a product. Find
the cost when the following numbers of units are produced.
(i) 24, (ii) y, (iii) 3a + 2, (iv) x + y.
Exercise 4.2
Suppose that the functions f , g and h are given by
f (x) = 3x + 6, g(x) = 2x , and h(x) = sin(x).
What are the following functions?
(i) (f + h)(x); (ii) (f · h)(x); (iii) (f ◦ g)(x);
(iv) (g ◦ h)(x); (v) f −1 (x); (vi) g −1 (x).
Exercise 4.3
To convert a temperature from degrees Fahrenheit to degrees Centigrade we subtract 32
and then multiply by 5/9. If f is the temperature in degrees Fahrenheit and c is the
temperature in degrees Centigrade, find the function c(f ).
What is the inverse of this function?
Exercise 4.4
The total revenue that a firm receives from selling different levels of output q is given by
the function R(q) = 40q − 4q 2 for 0 ≤ q ≤ 10. A manager would like to know the inverse
function so that they can determine how many products need to be sold in order to
obtain a certain revenue. Explain why it is not possible to find this inverse function.
What happens if 0 ≤ q ≤ 5?
Exercise 4.5
Use the laws of logarithms to evaluate the following.
(i) log3 (812 ); (ii) log5 (25 · 125);
(iii) log(1, 000, 000); (iv) log(1003 ) − 2 log(100).
84
4. Functions
Exercise 4.6
Solve the following equations.
2
(i) x2 = 4, (ii) 2x = 4, (iii) 2x = 4.
85
5. Calculus I — Differentiation
Unit 5: Calculus I
Differentiation
Overview
In this unit, we start our study of calculus by introducing the notion of differentiation.
In particular, we ask how we can find the gradient of a curve at a point and see that
differentiation allows us to answer this question in a very easy way. We also see how to
differentiate some simple functions using standard derivatives and rules of
5 differentiation.
Aims
To see how derivatives are related to the gradient of a curve.
To introduce the techniques for finding simple derivatives.

5.1 The gradient of a curve at a point

We have seen that a function is a way of mathematically describing how one quantity
depends on another. That is, given x, which we call the independent variable since we
are free to specify its value, we can find the value of f (x). If we let y = f (x), then each
value of x, through f , gives us the value of y, which we call the dependent variable since
the value we get for y depends on the value of x we used. Differentiation, as we shall
soon see, is a way of seeing how changes in x are related to changes in y.
Example 5.1 In economics, if we are given a function which links the profit, π, of
a firm to its production level, q, we may want to find out how the profit changes if
we change the production level.
Indeed, if the profit function is linear, i.e. its graph is a straight line, we have
something like
π(q) = mq + c,
and we can easily see how this works since the gradient of a linear function tells us
how changes in profit, ∆π, are related to changes in the production level, ∆q. That
is, we have
∆π
m= ,
∆q
86
and so an increase in production level of one unit (i.e. ∆q = 1) leads to a change in

profit given by ∆π = m.
However, we can only tell such a simple story for linear functions, i.e. straight lines,
because the gradient of a straight line is constant. That is, as we saw before, whichever
two points on the line we use to calculate the gradient, we always get the same answer.
But, unfortunately, this doesn’t work for more complicated curves.
Example 5.2 If we are given the quadratic function f (x) = x2 , whose graph is the
parabola y = x2 , we could try to estimate its gradient at the point (2, 4) by
considering the changes in x and y between this point and the points, say, (3, 9) and
(4, 16) which also lie on the parabola.
Here, the points (2, 4) and (3, 9) give us a gradient of

5
∆y 9−4 5
= = = 5,
∆x 3−2 1
and the points (2, 4) and (4, 16) give us a gradient of
∆y 16 − 4 12
= = = 6.
∆x 4−2 2
But, unsurprisingly perhaps, these give us different values and so, clearly, we can not
just use a pair of points on a parabola to find its gradient at the point (2, 4).
So, in the case of non-linear functions, i.e. curves that are not straight lines, since we
cannot just look at the changes between a point and any other point on the curve to
find the gradient of the curve at that point, we must ask what we can do to find it.
5.1.1 Tangents to a parabola

We start our discussion of how to find the gradient of a curve which isn’t a straight line
by considering how we could go about doing this for parabolae. Consider, for example,
the parabola y = x2 illustrated in Figure 5.1(a). Let’s say that we want to find the
gradient of this curve at the point labelled A in the diagram. As you can see, this is on
the curve and has coordinates (2, 4).1
To do this, we want to find the tangent to the curve at this point. This is the straight
line that goes through the point in a ‘special way’ as illustrated by the line labelled T in
Figure 5.1(a). Indeed, if we can find this line, then we say that the gradient of the curve
at A is given by the gradient of this line. That is, the gradient of the curve at A is
defined to be the gradient of the straight line which is the tangent to the curve at A.
But, how can we find the gradient of this line? We first notice that there are many lines,
with different gradients, that will go through this point. For instance, consider two
other straight lines that go through the point A, say the lines L1 and L2 in
Figure 5.1(b), and observe that
1
Obviously, to find the gradient of a curve at any given point, that point must lie on the curve!
87
y y
y = x2 y = x2
A A
4 4
C
5 O
2
x O
2
x
T L2 T L1
(a) (b)
Figure 5.1: (a) The straight line labelled T is the tangent to the parabola y = x2 at
the point, A, on the curve with coordinates (2, 4). (b) The tangent, T , goes through the
point, A, with coordinates (2, 4) in a special way, namely, unlike other lines through A
(such as the lines L1 and L2 ) it only has one point of intersection with the parabola.
(Note that L1 is ‘steeper’ than T and so it cuts the parabola at both A and B whereas
L2 is ‘shallower’ than T and so it cuts the parabola at both A and C.)
L1 is steeper than T ; whereas
L2 is shallower than T .
That is, we can see that the gradient of T must be somewhere between the gradient of
L2 and the gradient of L1 . This means that we can try and find the gradient of T by
considering the gradients of other lines whose gradients provide us with better estimates
of its value and one way of doing this is to look at chords.
5.1.2 Chords of a parabola

Given two points on a curve, we call the line segment joining them a chord. So, in
Figure 5.2(a), the line segment C is the chord joining the points A(2, 4) and B(3, 9). We
can use chords to estimate the gradient of the tangent to y = x2 at the point (2, 4), i.e.
the straight line T in Figure 5.2(a), and once we have this, we have an estimate of the
gradient of the curve at that point. Indeed, looking at Figure 5.2(b), we have drawn
three chords and these give us the following estimates for the gradient of T .
88
y
9 B
9
y = x2
64
9
C
49
9
4
A 4
O 2 x
7 8 3
O
2 3
x T 1
3 3 5
3
2
T 3
1
(a) (b)
Figure 5.2: (a) C, the chord joining the points A(2, 4) and B(3, 9) on the parabola y = x2
and, T , the tangent to y = x2 at the point (2, 4). (b) The chords joining the points (3, 9),
8 64
and 73 , 49

,
3 9 9
to the point (2, 4). Observe that, as the x-coordinate of the chord gets
closer to x = 2, the gradient of the chord gets closer to the gradient of T , the tangent to
y = x2 at the point (2, 4).
Now, the chord joining the points (2, 4) and
9−4 5
(3, 9) has a gradient given by m = = = 5.
3−2 1
64 28
8 64
−4 14
9 9
= 4 23 .

,
3 9
has a gradient given by m = 8 = 2 =
3
−2 3
3
49 13
7 49
−4 13
9 9
= 4 13 .

,
3 9
has a gradient given by m = 7 = 1 =
3
−2 3
3
In particular, notice that as the x-coordinate of the other point on the curve gets closer
to x = 2, the gradients of the chords get smaller (i.e. the chords get less steep) and get
closer to the gradient of T . That is, we are getting better estimates of the gradient of T
and this is an idea that we can generalise to find the gradient of T itself!
To see how this generalisation works, let h > 0 be a real number and consider the chord,
C, joining the points (2, 4) and (2 + h, [2 + h]2 ) on the parabola y = x2 as in
Figure 5.3(a). As before, we now look at the gradient of the chord joining these two
points and we find that
[2 + h]2 − 4 [4 + 4h + h2 ] − 4 4h + h2
m= = = = 4 + h.
(2 + h) − 2 h h
89
Now, if we let h get smaller, i.e. the x-coordinate of B gets closer to x = 2, we should
get a better estimate of the gradient of T .2 Indeed, we can see that as h gets closer and
closer to zero, m gets closer and closer to four and so, this must be the sought-after
gradient of T .
[2 + h]2 B
y
y= x2
[2 + h1 ]2
5 C
[2 + h2 ]2
[2 + h3 ]2
A
4
4
O 2 x
O x
2 2+h T h3
T h2
h1
(a) (b)
Figure 5.3: C, the chord joining the points A(2, 4) and B(2 + h, [2 + h]2 ) on the parabola
y = x2 for some real number h > 0 and, T , the tangent to y = x2 at the point (2, 4). (b)
As we take three successively smaller values of h given by h1 > h2 > h3 , we see that the
gradient of the chord gets closer to the gradient of T .
But, of course, we can generalise this method further by asking for the gradient of the
tangent at any point (x, x2 ) on the curve y = x2 . In this case, we want the chord joining
the point (x, x2 ) with the point (x + h, [x + h]2 ) for some real number h > 0. The
gradient of the chord joining these two points is then given by
[x + h]2 − x2 [x2 + 2hx + h2 ] − x2 2hx + h2

m= = = = 2x + h.
(x + h) − x h h
Now, if we let h get smaller, i.e. the x-coordinate of the point (x + h, [x + h]2 ) gets
closer to the x-coordinate of the point (x, x2 ), we should get a better estimate of the
gradient of the tangent line at the point (x, x2 ). Indeed, we can see that as h gets closer
and closer to zero, m gets closer and closer to 2x and so, this must be the sought-after
gradient of the tangent to y = x2 at the point (x, x2 ). Indeed, notice that if we have
x = 2 as we did above, we find that the gradient of the tangent is 2 × 2 = 4 as before!
2
This idea is illustrated in Figure 5.3(b) where we take three successively smaller values of h given
by h1 > h2 > h3 and see how the gradient of the chord gets closer to the gradient of T .
90
5.1.3 Tangents to other curves

We have seen, in the special case where f (x) = x2 , that we can find the gradient of the
tangent to the curve y = f (x) at the point (x, f (x)) by considering the gradient of the
chord between the points (x, f (x)) and (x + h, f (x + h)) for some real number h > 0.
Indeed, the gradient of this chord is given by
f (x + h) − f (x) f (x + h) − f (x)
m= = ,
(x + h) − x h
and the gradient of the tangent to the curve y = f (x) at the point (x, f (x)) is what this
gives us when we let h go to zero.
Activity 5.1 What is the gradient of the tangent to the curve y = k (where k is a
fixed real number) at the point with x-coordinate, x? 5
Activity 5.2 What is the gradient of the tangent to the curve y = mx + c (where
m 6= 0 and c are fixed real numbers) at the point with x-coordinate, x?
Activity 5.3 What is the gradient of the tangent to the curve y = ax2 + bx + c
(where a 6= 0, b and c are fixed real numbers) at the point with x-coordinate, x?
5.2 What is differentiation?

As mentioned above, we want to define the gradient of the curve y = f (x) at the point
(x, f (x)) to be the gradient of the tangent to the curve at this point. And, we have seen
how to find the latter by looking at the chords between the point (x, f (x)) and
(x + h, f (x + h)) for some real number h > 0 and then seeing what happens to the
quantity
f (x + h) − f (x)
,
h
as h goes to zero. This procedure is known as differentiation and we denote the new
function of x we find from this process by
df
, or more compactly, f 0 (x),
dx
and this notation tells us to differentiate f (x) with respect to the variable x. The
result of this procedure is called the derivative of f (x) with respect to x.
Example 5.3 As we saw above, the gradient of the tangent to y = f (x) with
f (x) = x2 at the point (x, x2 ) is given by 2x. Thus, the derivative of f (x) with
respect to x can be written as
df
= 2x, or more compactly, f 0 (x) = 2x.
dx
91
If we want to calculate the derivative at a certain point, say when x = 2, we can now
evaluate
df
= 2 · 2 = 4, or more compactly, f 0 (2) = 2 · 2 = 4.
dx x=2
By definition, this must be the gradient of the tangent line to the parabola y = x2 at
the point where x = 2, i.e. the point with coordinates (2, 4), and this is indeed what
we found earlier.
Most functions can be differentiated, but we don’t want to use the definition of
differentiation every time we need to find a derivative. So, we let other people do the
hard work and take note of two different kinds of thing they can tell us, namely:
standard derivatives so that we can differentiate our basic functions, and

5 rules of differentiation so that we can differentiate combinations of our basic
functions.
So, we now start our study of differentiation proper by introducing the most basic
standard derivatives and the two easiest rules.
5.2.1 Standard derivatives

We now introduce the standard derivatives which we will be using in this course. We
start with functions which are either constants or constant powers of x as these follow
on quite naturally from what we have seen so far. We then introduce the standard
derivatives for the other basic functions that we need.
Standard derivatives: Constant functions
If k is a constant, then the derivative of the function f (x) = k is
f 0 (x) = 0.
That is, if k is a constant, f (x) = k is a function whose derivative (or gradient) is equal
to zero at every point.
Example 5.4 Clearly, this means that:
If f (x) = 5, then f 0 (x) = 0.
If f (x) = 0, then f 0 (x) = 0.
If f (x) = −5, then f 0 (x) = 0.

That is, if we have a constant function (i.e. a function that gives us the same fixed
number as its output for any value of its input, x) we will get zero when we
differentiate it!
92
Activity 5.4 Thinking geometrically, why is this standard derivative obvious?
We now introduce a more complicated, and more useful, standard derivative.
Standard derivatives: Constant powers of x
If k 6= 0 is a constant, then the derivative of the function f (x) = xk is
f 0 (x) = kxk−1 .
Observe, in particular, that if k = 0, we have f (x) = x0 = 1, which is a constant

function and so its derivative is f 0 (x) = 0 using the previous standard derivative. Also,
as
f (x) = x can be written as f (x) = x1 , we have f 0 (x) = 1x0 = 1, 5
i.e. we have
f (x) = x =⇒ f 0 (x) = 1,
which is a useful thing to remember.
If f (x) = x5 , then f 0 (x) = 5x4 .
If f (x) = x0 , then f 0 (x) = 0.
If f (x) = x−5 , then f 0 (x) = −5x−6 .

Indeed, using powers we can see that this allows us to differentiate some quite
complicated looking functions of x:
√ 1 1
If f (x) = x = x 2 , then f 0 (x) = 12 x− 2 .
1 1 3
If f (x) = √ = x− 2 , then f 0 (x) = − 12 x− 2 .
x
√ 3 1
If f (x) = x3 = x 2 , then f 0 (x) = 32 x 2 .
And so, when differentiating, always be on the look-out for functions of x which are
constant powers ‘in disguise’.
Standard derivatives: exponential and logarithm functions
The derivative of the exponential function f (x) = ex is
f 0 (x) = ex .
Observe, in particular, that this is one of the special properties of the function ex where
e is the exponential constant: it is the function whose value and gradient are the same
at every point. (We will see another special property of ex in Section 9.1.3.)
93
The derivative of the logarithm function f (x) = ln x is

1
f 0 (x) =
.
x
Of course, as the functions ex and ln x are related by the fact that one is the inverse of
the other, we should expect that there is a relationship between these results for their
derivatives. This is indeed the case as you will see in Activity 6.1.
We also have exponential and logarithm functions with bases other than e, but these
are easily derived from these standard derivatives as you will see in Activity 6.2.
Standard derivatives: sine and cosine functions
The derivative of the sine function f (x) = sin(x) is

f 0 (x) = cos(x),
5
whereas the derivative of the cosine function f (x) = cos(x) is
f 0 (x) = − sin(x).
Of course, as the sine and cosine functions are related by the fact that one is just a
‘shift’ of the other, we should expect that there will be a relationship between their
derivatives. Maybe you can look at the graphs of these functions (see Figure 4.5 in
Section 4.1.2) and convince yourself that their derivatives (i.e. their gradients at each
point) are related in this way.
Standard derivatives: summary
In summary, we have the following standard derivatives.
Standard derivatives
If k is a constant, then f (x) = k gives f 0 (x) = 0.
If k 6= 0 is a constant, then f (x) = xk gives f 0 (x) = kxk−1 .
f (x) = ex gives f 0 (x) = ex .

1
f (x) = ln x gives f 0 (x) = .
x
f (x) = sin(x) gives f 0 (x) = cos(x).
f (x) = cos(x) gives f 0 (x) = − sin(x).
We now look at how we can differentiate some simple combinations of these functions.
5.2.2 Two rules of differentiation

In Section 4.1.3, we saw five ways of combining given functions to make new functions.
The first two of these were, given a constant k and two functions f (x) and g(x), that we
94
could find:
a constant multiple of f , which was the function kf where (kf )(x) = k · f (x).
the sum of f and g, which was the function f + g where (f + g)(x) = f (x) + g(x).
The question is, if we can differentiate the functions f and g, can we also differentiate
the functions kf and f + g? Obviously, the answer is ‘yes’, and we do it by using rules
of differentiation. Among other things, these rules will allow us to differentiate any
polynomial function of x.
The constant multiple rule
The constant multiple rule tells us how to differentiate a constant multiple of a function
f (x) and it works as follows.
5
Constant multiple rule
If k is a constant and f is a function, then

d df
[kf (x)] = k ,
dx dx
or, using our shorthand, (kf )0 (x) = kf 0 (x).
If f (x) = 5x3 , then f 0 (x) = 5(3x2 ) = 15x2 .

− 12 1 − 32 3
0
If f (x) = −3x , then f (x) = −3 − 2 x = 32 x− 2 .
√ 3
1 1
If f (x) = −6 x = −6x , then f (x) = −6 32 x 2 = −9x 2 .
3 2
0
So, in these cases, we just differentiate as before and then multiply the answer by
the appropriate constant multiple.
The sum rule
The sum rule tells us how to differentiate the sum of two functions f (x) and g(x) and it
works as follows.
Sum rule
If f and g are functions, then

d df dg
[f (x) + g(x)] = + ,
dx dx dx
or, using our shorthand, (f + g)0 (x) = f 0 (x) + g 0 (x).
95
If f (x) = x + x3 , then f 0 (x) = 1 + 3x2 .

1 1
If f (x) = x2 + x 2 , then f 0 (x) = 2x + 12 x− 2 .
√ 1 1 1 1 3
If f (x) = x + √ = x 2 + x− 2 , then f 0 (x) = 12 x− 2 − 12 x− 2 .
x
So, in these cases, we just differentiate as before and then add the answers together.
5.2.3 Some general points on what we have seen so far

We now take a moment to see what we can and can not do based on what we have seen
in this section.
5
Combining our two rules of differentiation
It should be clear that, taken together, our two rules of differentiation enable us to
differentiate functions of the form kf (x) + lg(x) as follows.
Linear combination rule
If k and l are constants and f and g are functions, then

d df dg
[kf (x) + lg(x)] = k +l ,
dx dx dx
or, using our shorthand, (kf + lg)0 (x) = kf 0 (x) + lg 0 (x).
If f (x) = 5x2 + 7x3 , then f 0 (x) = 5(2x) + 7(3x2 ) = 10x + 21x2 .

1 1
If f (x) = x2 − x 2 , then f 0 (x) = 2x − 12 x− 2 .
√ 3 1 1
If f (x) = x − √ = x 2 − 3x− 2 , then
x
1 − 12 3 1 3
f (x) = 2 x − 3 − 12 x− 2 = 12 x− 2 + 32 x− 2 .
0
So, in these cases, we just differentiate as before and combine the answers in the
obvious way.
Activity 5.5 Show that the constant multiple rule and the sum rule do indeed give
the linear combination rule.
Hence, use the linear combination rule to find the derivative of the function
f (x) − g(x) in terms of the derivatives of the functions f (x) and g(x).
96
What we can and can not differentiate
There are some functions, related to the ones that we have seen above, which we can
differentiate using what we have seen so far.
Example 5.9 The following functions can be differentiated by simplifying them

first.
As f (x) = (2x)2 = 4x2 , we have f 0 (x) = 4(2x) = 8x.

1 1

1 − 12 1
0
As f (x) = (4x) = 2x , we have f (x) = 2 2 x
2 2 = x− 2 .
As f (x) = (2x)−3 = 18 x−3 , we have f 0 (x) = 81 (−3x−4 ) = − 38 x−4 .

In particular, be sure that you understand why we get these derivatives and not
anything else! 5
However, there are many other functions that we can not differentiate yet! And, in
particular, we now consider some common errors so that we can be sure that we don’t
make them in the future!
Example 5.10 Please note that, for two functions f (x) and g(x),
d df dg
[f (x) · g(x)] is NOT · .
dx dx dx

d f (x) df dg
is NOT .
dx g(x) dx dx
And, for things like f (x) = e2x , we can NOT say what f 0 (x) is, as even though we
can differentiate ex , we don’t yet know how to deal with the ‘2’ in e2x .
The correct way of differentiating all of the things listed in this example will be dealt
with in Unit 6.
Activity 5.6 If k is a constant, for each of the following functions, find its
derivative or explain why it can not be found using the results in this unit.
k
(i) f (x) = ex+k , (ii) g(x) = ekx , (iii) h(x) = ex .
Activity 5.7 If k is a constant, for each of the following functions, find its
derivative or explain why it can not be found using the results in this unit.
(i) f (x) = ln(x + k), (ii) g(x) = ln(kx), (iii) h(x) = ln(xk ).
Differentiating with respect to other variables
So far, we have only been differentiating functions of x, like f (x), with respect to x. But
sometimes, we will want to differentiate functions of other variables with respect to
97
their variable. The good news is that everything we have seen so far carries over in a
straightforward way.
Example 5.11 Given the function f (y) = y k where k 6= 0 is a constant, we can

write our standard derivative as
df
= ky k−1 , or more compactly, f 0 (y) = ky k−1 ,
dy
so that we have things like
f (y) = 7y 3 =⇒ f 0 (y) = 7(3y 2 ) = 21y 2 .
That is, everything stays the same with the exception that the ‘x’s are now replaced
with ‘y’s.
5
Example 5.12 Similarly, if f (q) = q 2 − 3q + 7, then f 0 (q) = 2q − 3(1) + 0 = 2q − 3.
Learning outcomes
explain the relationship between the gradient of a curve and the derivative of a
function;
find simple derivatives by using the definition of the derivative;
find simple derivatives by using standard derivatives and the rules of differentiation.
Exercises
Exercise 5.1
Consider the parabola y = x2 and the point (3, 9) that is on this curve.
i. Find the gradient of the chords joining this point to the points on the curve with
x = 4, x = 3 12 and x = 3 14 .
ii. Find the gradient of the chord joining this point to the point on the curve with
x = 3 + h where h > 0 is a real number. What value does this give you as h goes
to zero?
iii. By differentiating the function f (x) given by the curve y = f (x) above, find the
gradient of the curve at the point (3, 9).
Note: Your final answers to ii. and iii. should be the same!
98
Exercise 5.2
Find the derivatives of the following functions.
i. f (x) = −17; v. k(x) = 3 + ln(x);
ii. g(x) = 27x; vi. l(x) = 2 sin(x);
iii. h(x) = 2x3 ; vii. n(x) = 3 + x + cos(x);
iv. j(x) = 20x253 ; viii. p(x) = x + 2 ex .
Exercise 5.3
√
i. f (x) = sin(x) + cos(x); vi. l(x) = 5 x;
x 2
5
ii. g(x) = ln(x) + 4 e ; vii. n(x) = 3x − 5x + 7;
iii. h(x) = ex − cos(x); viii. p(x) = 3x10 + 8x5 ;

√
iv. j(x) = 3 sin(x) − 3 ln(x); ix. r(x) = 3 x3 − 2x−1/2 ;
4 2 3
v. k(x) = ; x. s(x) = + 5.
x3 x 2 2x
Exercise 5.4
√
i. f (y) = 6y − 5; iii. h(z) = z 2 − z;
ii. g(q) = q 2 − 3q + 2; iv. j(p) = −6.
99
6. Calculus II — More differentiation
Unit 6: Calculus II
More differentiation
Overview
We start by looking at how to differentiate products, quotients and compositions of

functions by introducing three new rules of differentiation. We then consider how
derivatives can be used to find approximations to functions and the relevance of this to
economics.
Aims

6
To introduce the techniques for finding more complicated derivatives.
To see how derivatives allow us to find approximations.

6.1 Three more rules of differentiation

We now consider the other three ways of combining given functions which we saw in
Section 4.1.3. These were, given two functions f (x) and g(x), we could find the:
Product of f and g, which was the function f · g where (f · g)(x) = f (x)g(x).
Quotient of f and g, which was the function f /g where (f /g)(x) = f (x)/g(x).1
Composition of f and g, which was the function f ◦ g where (f ◦ g)(x) = f (g(x)).

Once again, the question is, if we can differentiate f and g, can we also differentiate the
functions f · g, f /g and f ◦ g? And, once again, the answer is ‘yes’ and we do it by using
three new rules of differentiation.
6.1.1 The product rule
The product rule tells us how to differentiate the product of two functions f (x) and g(x)
and it works as follows.
1
Provided, of course, that g(x) 6= 0.
100
Product rule

d df dg
[f (x)g(x)] = g(x) + f (x) ,
dx dx dx
or, using our shorthand, (f · g)0 (x) = f 0 (x)g(x) + f (x)g 0 (x).
Example 6.1 Differentiate the function h(x) = (x + 1)2 .
We can write this function as h(x) = (x + 1)(x + 1) and so we have the product of
the two functions
f (x) = x + 1 and g(x) = x + 1,
and these give us
f 0 (x) = 1 and g 0 (x) = 1.
As such, the product rule tells us that
h0 (x) = (1)(x + 1) + (x + 1)(1) = 2(x + 1).

6
Notice that we can check this answer as we can write h(x) = (x + 1)2 as
h(x) = x2 + 2x + 1,
by multiplying out the brackets and, differentiating, this gives us
h0 (x) = 2x + 2(1) = 2(x + 1),
if we factorise. Clearly, this is the same as the answer we got from the product rule.
Example 6.2 Differentiate the function h(x) = x ex .
This is the product of the two functions
f (x) = x and g(x) = ex ,
and these give us

f 0 (x) = 1 and g 0 (x) = ex .
h0 (x) = (1)( ex ) + (x)( ex ) = (1 + x) ex .
Here, we can not check the answer as we can not rewrite the function h(x) = x ex .
Example 6.3 Differentiate the function h(x) = x ln(x).

f (x) = x and g(x) = ln(x),
101
and these give us

1
f 0 (x) = 1 and g 0 (x) = .
x

0 1
h (x) = (1)(ln(x)) + (x) = ln(x) + 1.
x
Here, we can not check the answer as we can not rewrite the function h(x) = x ln(x).
Example 6.4 Differentiate the function h(x) = ex ln(x).
f (x) = ex and g(x) = ln(x),
and these give us

1
f 0 (x) = ex and g 0 (x) = .
6 x

0 x 1x x 1
h (x) = ( e )(ln(x)) + ( e ) = e ln(x) + .
x x
Here, we can not check the answer as we can not rewrite the function
h(x) = ex ln(x).
6.1.2 The quotient rule

The quotient rule tells us how to differentiate the quotient of two functions f (x) and
g(x) and it works as follows.
Quotient rule

df dg

d f (x)
g(x) − f (x)
= dx dx ,
dx g(x) [g(x)]2
or, using our shorthand,

0
f f 0 (x)g(x) − f (x)g 0 (x)
(x) = .
g [g(x)]2
Of course, this all assumes that the quotient of the two functions is defined, i.e. it
only works for values of x where g(x) 6= 0.
102
x+1
Example 6.5 Differentiate the function h(x) = for x 6= 0.
x
This is the quotient of the two functions
f (x) = x + 1 and g(x) = x,
and these give us

f 0 (x) = 1 and g 0 (x) = 1.
As such, the quotient rule tells us that
(1)(x) − (x + 1)(1) 1
h0 (x) = 2
= − 2,
x x
for x 6= 0. Notice that we can check this answer as we can write h(x) as
x+1 x 1 1
h(x) = = + = 1 + = 1 + x−1 ,
x x x x
and, differentiating, this gives us
6
1
h0 (x) = 0 + (−x−2 ) = − .
x2
Clearly, this is the same as the answer we got from the quotient rule.
ex
Example 6.6 Differentiate the function h(x) = for x 6= 0.
x
f (x) = ex and g(x) = x,
and these give us

f 0 (x) = ex and g 0 (x) = 1.
( ex )(x) − ( ex )(1) x−1 x

h0 (x) = = e ,
x2 x2
for x 6= 0. Here, we can not check the answer as we can not rewrite the function
h(x) = ex /x.
x3
Example 6.7 Differentiate the function h(x) = for x 6= 1.2
ln(x)
f (x) = x3 and g(x) = ln(x),
and these give us
1
f 0 (x) = 3x2 and g 0 (x) = .
x
103

1
(3x2 )(ln(x)) − (x3 )

0 x x2 (3 ln(x) − 1)
h (x) = = ,
[ln(x)]2 [ln(x)]2
for x 6= 1. Here, we can not check the answer as we can not rewrite the function
h(x) = x3 / ln(x).
ln(x) 3
Example 6.8 Differentiate the function h(x) = .
ex
f (x) = ln(x) and g(x) = ex ,
and these give us

1
f 0 (x) =and g 0 (x) = ex .
x
6 As such, the quotient rule tells us that
1
x
0 x
( e ) − (ln(x))( ex ) (1 − x ln(x)) ex 1 − x ln(x)
h (x) = = = .
[ ex ]2 x e2x x ex
Here, we can not check the answer as we can not rewrite the function
h(x) = ln(x)/ ex .
6.1.3 The chain rule

The chain rule tells us how to differentiate the composition of two functions f (x) and
g(x) and it works as follows.
Chain rule

d df dg
[f (g(x))] = · ,
dx dg dx
or, using our shorthand, (f ◦ g)0 (x) = f 0 (g)g 0 (x).
Example 6.9 Differentiate the function h(x) = (x + 1)2 .
The function h(x) = (x + 1)2 is the composition of the functions
f (g) = g 2 and g(x) = x + 1.

2
Because, if x = 1, we have ln(x) = 0!
3
Observe that as ex > 0 for all real numbers, x, we don’t have to worry about dividing by zero here.
104
As such we have
f 0 (g) = 2g and g 0 (x) = 1,
and so the chain rule tells us that
h0 (x) = (2g)(1) = 2g = 2(x + 1).
Notice that this is the same as the answer we found Example 6.1.
Example 6.10 Differentiate the function h(x) = (2x + 1)3 .
The function h(x) = (2x + 1)3 is the composition of the functions
f (g) = g 3 and g(x) = 2x + 1.
As such we have
f 0 (g) = 3g 2 and g 0 (x) = 2,
h0 (x) = (3g 2 )(2) = 6g 2 = 6(2x + 1)2 .

6
Notice that we can check this answer as we can write h(x) = (2x + 1)3 as
h(x) = 8x3 + 12x2 + 6x + 1,
by multiplying out the brackets and, differentiating, this gives us
h0 (x) = 24x2 + 24x + 6 = 6(4x2 + 4x + 1) = 6(2x + 1)2 ,
if we factorise. And, clearly, this is the same as the answer we got from the chain
rule.
√
Example 6.11 Differentiate the function h(x) = 2x + 1.
√
The function h(x) = 2x + 1 is the composition of the functions
√ 1
f (g) = g = g2 and g(x) = 2x + 1.
As such we have
1 1
f 0 (g) = g − 2 and g 0 (x) = 2,
2

0 1 −1 1 1
h (x) = g 2 (2) = g − 2 = √ .
2 2x + 1
Here, we√can not check the answer as we can not rewrite the function
h(x) = 2x + 1.
105
√
Example 6.12 Differentiate the function h(x) = x3 + 2.
√
The function h(x) = x3 + 2 is the composition of the functions
√ 1
f (g) = g = g2 and g(x) = x3 + 2.
As such we have
1 1
f 0 (g) = g − 2 and g 0 (x) = 3x2 ,
2
3x2 3x2

0 1 −1
h (x) = g 2 (3x2 ) = √ = √ .
2 2 g 2 x3 + 2
Here, we√can not check the answer as we can not rewrite the function
h(x) = x3 + 2.
6.1.4 Further applications of the chain rule

6
We saw in Activities 5.6 and 5.7 that we could differentiate some quite complicated
functions involving logarithms and exponentials by being clever with the power laws
and the laws of logarithms. However, we also saw that some of the functions that we
wanted to differentiate, such as the functions
k
ln(x + k), ekx and ex ,
where k is a constant, couldn’t be found using such techniques. But, as is hopefully
clear, we can now see how to differentiate these functions by using the chain rule. Let’s
consider each of these in turn:
The function h(x) = ln(x + k) is the composition of the functions

f (g) = ln(g) and g(x) = x + k.
As such we have
1
f 0 (g) = and g 0 (x) = 1,
g
and so we get
0 1 1 1
h (x) = (1) = = ,
g g x+k
from the chain rule.
The function h(x) = ekx is the composition of the functions
f (g) = eg and g(x) = kx.
As such we have
f 0 (g) = eg and g 0 (x) = k,
and so we get
h0 (x) = ( eg )(k) = k eg = k ekx ,
106
k
The function h(x) = ex is the composition of the functions
f (g) = eg and g(x) = xk .
As such we have
f 0 (g) = eg and g 0 (x) = kxk−1 ,
and so we get
k
h0 (x) = ( eg )(kxk−1 ) = kxk−1 eg = kxk−1 ex ,
Activity 6.1 (Hard)

Suppose that y = ex so that x = ln(y). Use the standard derivative for ex and the
chain rule to show that
dx 1
= .
dy y
Hence deduce the standard derivative for ln(x).
Activity 6.2 (Hard) 6

Suppose that a 6= 1 is a positive number. Show that
f (x) = ax =⇒ f 0 (x) = ax ln(a),
and
1
f (x) = loga (x) =⇒ f 0 (x) = .
x ln(a)
6.1.5 Using these rules of differentiation together

Sometimes, it is necessary to apply several of the rules of differentiation in order to find
a derivative. We now consider two examples that show how this can be done.
Example 6.13 Find the derivative of the function l(x) = (x3 + 1) ln(x2 + 4).
f (x) = x3 + 1 and g(x) = ln(x2 + 4),
and clearly, f 0 (x) = 3x2 . But, to differentiate g(x), we need to use the chain rule
because it is a composition. In this case, we have
g(h) = ln(h) and h(x) = x2 + 4,
which gives us
1
g 0 (h) = and h0 (x) = 2x,
h
so that
0 1 2x 2x
g (x) = (2x) = = 2 ,
h h x +4
107
by the chain rule. Now, putting all of this into the product rule gives us
2x(x3 + 1)

0 2 2 3 2x 2 2
l (x) = 3x ln(x + 4) + x + 1 = 3x ln(x + 4) + ,
x2 + 4 x2 + 4
as the sought-after derivative.
2
Example 6.14 Find the derivative of the function l(x) = ex ln(x3 + 1).

2
f (x) = ex and g(x) = ln(x3 + 1).
To differentiate f (x) we need to use the chain rule because it is a composition. In

this case, we have
f (h) = eh and h(x) = x2 ,
which gives us
f 0 (h) = eh and h0 (x) = 2x,
6 so that
2
f 0 (x) = ( eh )(2x) = 2x eh = 2x ex ,
by the chain rule. Then, to differentiate g(x) we need to use the chain rule again
because it is also a composition. In this case, we have
g(h) = ln(h) and h(x) = x3 + 1,
which gives us
1
g 0 (h) =and h0 (x) = 3x2 ,
h
so that
3x2 3x2

0 1
g (x) = (3x2 ) = = 3 ,
h h x +1
by the chain rule. Now, putting all of this into the product rule gives us
3x2 3x2

0 x2 3 x2 3 2
l (x) = 2x e ln(x + 1) + e 3
= 2x ln(x + 1) + 3 ex ,
x +1 x +1
as the sought-after derivative.
6.2 Approximating functions

Given a function, f (x), we can find its derivative, f 0 (x), and we know from Unit 5 that
the latter function tells us the gradient of the curve y = f (x) at the point on the curve
with coordinates (x, f (x)). As we shall now see, knowing the gradient of a function at a
point gives us a way of finding some useful approximations.
To see why, let’s say we have a cost function, C(q), that tells us the cost of producing a
quantity, q, of some good. We might be interested in finding the increase in costs, ∆C,
108
caused by changing the quantity produced from, say, q0 to q0 + ∆q, i.e. an increase in
production of ∆q. In this case, the exact expression for the change in costs given this
change in production will be
∆C = C(q0 + ∆q) − C(q0 ).
This can be thought of as the marginal (or additional) cost of producing an extra
quantity, ∆q, of our good. But, if ∆q is small4 we can find an approximation for ∆C
which uses the derivative of the cost function, C0 (q), namely
∆C ' C0 (q0 )∆q.
Let’s look at an example to see how the answers from these two approaches compare.
Example 6.15 It costs a firm C(q) = 100q + 2q 2 pounds to produce q units of a

good. What is the increase in cost that would result from an increase in production
from 50 to 51 units?
To find the exact increase in costs, ∆C, we need to find
∆C = C(51) − C(50) = [100(51) + 2(51)2 ] − [100(50) + 2(50)2 ] = 302. 6

Or, to find the approximate increase in costs, we note that
C0 (q) = 100 + 4q,
and so, as ∆q = 51 − 50 = 1, we have
∆C ' C0 (50)∆q = [100 + 4(50)](1) = 300.
Either way, we can see that the increase in costs resulting from an increase in
production from 50 to 51 units would be about £300.
The reason why we can use the derivative here is that, geometrically, C0 (q0 ) is the
gradient of the tangent line, T , to the curve y = C(q) at the point (q0 , C(q0 )) and so,
looking at this tangent line we can see that

dC ∆C
C0 (q0 ) = ∆C ' C0 (q0 )∆q,

' =⇒
dq q=q0
∆q
as shown in Figure 6.1. As such, the discrepancy between our exact and approximate
values for ∆C is the difference between the y-coordinates of the curve y = C(q) and the
tangent line T when q = q0 + ∆q. Obviously, the smaller ∆q is, the smaller this
discrepancy will be!
In fact, economists often work with marginal quantities and so, given a function f (x),
we define the marginal function of f to be f 0 (x). This will allow us to find the
approximate value of ∆f , the change in f associated with a change in x from x0 to
x0 + ∆x, by using the formula
∆f ' f 0 (x0 )∆x.
For example, we can define the following important marginal functions from economics.
4
In a sense that we do not exactly specify in this course!
109
y = C(q)
T
C(q0 + ∆q)
i.e. C ′ (q0 )∆q

Approx ∆C,
Exact ∆C
C(q0 )
∆q
O q
q0 q0 + ∆q
Figure 6.1: The curve y = C(q) and the tangent, T , to this curve at the point (q0 , C(q0 )).
Looking at an increase in q from q0 to q0 + ∆q, we can see that the corresponding change
in the function C(q), i.e. ∆C, is given exactly by C(q0 + ∆q) − C(q0 ) and approximately
by C0 (q0 )∆q where C0 (q0 ) is the gradient of the tangent line.
6
If C(q) is a cost function, MC(q) = C0 (q) is the marginal cost function.
If R(q) is a revenue function, MR(q) = R0 (q) is the marginal revenue function.
If π(q) is a profit function, Mπ(q) = π 0 (q) is the marginal profit function.

Indeed, since we are using the derivative to just approximate certain changes in f , let’s
look at an example of what we can do with marginal functions defined in this way.
Example 6.16 The profit function for a firm is π(q) = 100 + 20q − 2q 2 pounds
when it is selling a quantity q. If the quantity sold is increased from 10 to 10.2, what
will be the change in profit?
The marginal profit is

Mπ(q) = π 0 (q) = 20 − 4q,
and so the change in profit will, approximately, be given by
∆π ' π 0 (10)∆q = [20 − 4(10)](0.2) = −4.
Hence, the profit will decrease by approximately £4 if the quantity sold is increased
from 10 to 10.2 units.
Learning outcomes
use the product, quotient and chain rules to find derivatives;
use derivatives to find approximations.
110
Exercises
Exercise 6.1
For the following functions, identify the functions f (x) and g(x) such that the function
is f (x)g(x) and hence find the derivative of the function using the product rule.
√
i. x2 (x + 2); iii. 3x4 x;
ii. (2x + 7)(x5 + 2); iv. (3x2 + 2) ln(x).

Also, in parts i., ii. and iii. check that your answer is correct by rewriting the function
and differentiating it without using the product rule. (Note that this check cannot be
performed in part iv.)
Exercise 6.2
is f (x)/g(x) and hence find the derivative of the function using the quotient rule.
x+2 4x2 + 1
i. ; iii. ;
x2 x3 − 2x 6
32x5 + 3 2x2 + 7
ii. ; iv. .
2x5 ex
Also, in parts i. and ii. check that your answer is correct by rewriting the function and
differentiating it without using the quotient rule. (Note that this check cannot easily be
performed in parts iii. and iv.)
Exercise 6.3
is f (g(x)) and hence find the derivative of the function using the chain rule.
i. (x + 2)2 ; iii. (x4 + 3)−1 ;
√
ii. (x3 + 3x)2 ; iv. 3
2x − 1.
Also, in parts i. and ii. check that your answer is correct by rewriting the function and
differentiating it without using the chain rule. (Note that this check cannot be
performed in parts iii. and iv.)
Exercise 6.4
Differentiate the following functions using the appropriate rules.
p√
i. x5 (x8 + x2 ); iv. x + x;
x4
ii. (x3 + 3)3 ; v. ;
1 + 2x6
iii. ln ((x − 3)2 ); vi. 6x2 (x7 + 6)−2 .
111
Exercise 6.5
Differentiate the following functions with respect to the independent variable using
whichever rule is appropriate.
3 2z 5
i. ; iv. ;
y+1 32z 5 + 3
ii. q 3 eq ; v. ln(y 3 + 3y 2 + 4);

2 +p
iii. e−2p ; vi. ln( ez ).
Exercise 6.6
The level of demand for a product, q, is linked to its price, p, by the equation
p2 q = 6, 000. By writing q as a function of p and differentiating, estimate how sales will
change if the price is increased from £10 to £10.50.
What is the exact value of the change in sales if the price is increased from £10 to
£10.50?
6
112
7. Calculus III — Optimisation
Unit 7: Calculus III

Optimisation
Overview
Having seen how to differentiate functions, we now turn our attention to some
applications of differentiation. In particular, we are interested in what derivatives can
tell us about the behaviour of a function. This will lead on to a study of how we can
optimise a function of one variable, i.e. how we can use differentiation to find the
maximum and/or minimum values of such a function, and how this information is
invaluable when we want to sketch their graphs.
Aims
To see what derivatives tell us about functions. 7

To apply this to optimisation and curve sketching problems.
7.1 What derivatives tell us about functions

In Unit 3 we saw how to use the completed square form of a quadratic to find the
turning point of a parabola and determine whether it was a maximum or a minimum.
But, if we have a curve which arises from a function that is more complicated than a
quadratic, this method is useless since we can’t complete the square. As such, we now
turn our attention to developing another method for optimising a function of one
variable, i.e. finding any maxima or minima that it may have. The advantage of this
method is that it will work for any function of one variable and it will rely heavily on
differentiation. However, before we detail the method itself, it is useful to discuss some
of the ideas behind it.
7.1.1 When is a function increasing or decreasing?

The first thing we want to note is that the sign of the first derivative at a point tells us
whether the function is increasing or decreasing at that point if we think of what is
happening as x itself is increasing. In particular, we note that:
If f 0 (x) > 0, then the function is increasing at that value of x as illustrated in

Figure 7.1(a).
113
If f 0 (x) < 0, then the function is decreasing at that value of x as illustrated in

Figure 7.1(b).
In particular, the key idea is that f 0 (x) tells us the gradient of the tangent line to the
curve at this value of x, which is labelled T in Figure 7.1, and if this is positive (or
negative) the function itself must be increasing (or decreasing).
y y
y = f (x) y = f (x)
T T
O x O x
(a) (b)
Figure 7.1: As x increases, we see that at the indicated value of x, the function f (x) is
7 increasing in (a) and decreasing in (b). These correspond to points on the curve where
the tangent line has a positive or negative gradient in (a) and (b) respectively. That is,
the derivative, i.e. f 0 (x), of the function at these values of x will be positive or negative
respectively.
Quite apart from the application of this idea to optimising functions of one variable,
this idea can be useful in economic contexts as the following example shows.
Example 7.1 Consider a firm whose profit function is given by

π(q) = 100 + 20q − 2q 2 pounds when it sells q units. For what values of q is the
firm’s profit decreasing with increasing q?
The firm’s profit will be decreasing when π 0 (q) < 0, i.e. when we have
π 0 (q) = 20 − 4q < 0 =⇒ 20 < 4q =⇒ 5 < q,
i.e. if q > 5, then the firm’s profit is decreasing as q increases. As such, it would be
unwise for the firm to produce more than five units since this puts them in a
position where their profits are decreasing!
7.1.2 Stationary points

We have seen that positive and negative values of f 0 (x) correspond to points where f (x)
is increasing or decreasing. But, what happens when the derivative is zero? In such
cases we say that the function is stationary because it is neither decreasing nor
increasing as illustrated in Figure 7.2. We say that
114
If f 0 (x) = 0, then the function is stationary at that value of x. We call such values
of x stationary points.
In particular, the key idea is that f 0 (x) tells us the gradient of the tangent line to the
curve at this value of x, labelled T in Figure 7.2, and when f 0 (x) = 0 we find that this
tangent line must be horizontal.
y y
y = f (x)
O x O x
y = f (x)
(a) (b)
Figure 7.2: Two stationary points, i.e. points where the derivative is zero. Notice that in
(a) the stationary point is a maximum and in (b) it is a minimum. 7
Indeed, we can see that at the point in
Figure 7.2(a), as x increases through the point, the function increases until it is
stationary and then it decreases again, i.e. we have
f 0 (x) > 0 until f 0 (x) = 0 and then f 0 (x) < 0,
and in such circumstances we say that the stationary point of the function is a
maximum.
Figure 7.2(b), as x increases through the point, the function decreases until it is
stationary and then it increases again, i.e. we have
f 0 (x) < 0 until f 0 (x) = 0 and then f 0 (x) > 0,
and in such circumstances we say that the stationary point of the function is a
minimum.
That is, the maxima and minima of a function, f (x), will be amongst its stationary
points, i.e. points where f 0 (x) = 0, and we can identify whether we have found a
maximum or minimum by seeing how the sign of f 0 (x) changes as we move through the
stationary point.
A warning: Points of inflection
When looking for stationary points, i.e. finding the values of x where f 0 (x) = 0, there
will be cases where what we find will be neither a maximum nor a minimum. In such
115
cases, we will have a stationary point which is a point of inflection. Indeed, if we look at
the stationary point in:
Figure 7.3(a), as x increases through the point, the function increases until it is
stationary and then it increases again, i.e. we have
f 0 (x) > 0 until f 0 (x) = 0 and then f 0 (x) > 0.
Figure 7.3(b), as x increases through the point, the function decreases until it is
stationary and then it decreases again, i.e. we have
f 0 (x) < 0 until f 0 (x) = 0 and then f 0 (x) < 0.
In both of these cases the stationary point is a point of inflection.

y y
y = f (x)
T T
7 y = f (x)
O x O x
(a) (b)
Figure 7.3: Two more stationary points, i.e. points where the derivative is zero. Notice
that in both (a) and (b) the stationary point is a point of inflection.
We will sometimes refer to stationary points which are maxima or minima as turning
points since the function ‘turns’ (or ‘changes direction’) at these points. However,
stationary points that are points of inflection are not turning points.
7.1.3 Second derivatives

So far, given a function of one variable, f (x), we have used differentiation to find a new
function of x called its derivative and we denote this by
df
, or more compactly, f 0 (x).
dx
However, this new function is also a function of x and so we can differentiate again to
find its derivative. That is, having found

df d df
we now work out ,
dx dx dx
116
i.e. we differentiate the derivative again, and we denote the result of doing this by
d2 f
, or more compactly, f 00 (x).
dx2
Unsurprisingly, perhaps, we call this the second derivative of the original function, f (x).
Of course, we could differentiate again to get the third derivative of f (x) and so on, but
the third and higher derivatives of f (x) are not necessary for this course.
Example 7.2 Given f (x) = x3 + x2 + x, its derivative is given by
df
= 3x2 + 2x + 1, or more compactly, f 0 (x) = 3x2 + 2x + 1.
dx
Now, if we want to find the second derivative of f (x), we want to differentiate f 0 (x),
i.e. we want
d2 f

d df d
3x2 + 2x + 1 = 6x + 2,

2
= =
dx dx dx dx
or, more compactly, f 00 (x) = 6x + 2. Indeed, if we want to calculate the second
derivative at a certain point, say when x = 2, we can now evaluate
d2 f

= 6 · 2 + 2 = 14, or more compactly, f 00 (2) = 6 · 2 + 2 = 14.
dx2 x=2
Of course, we could now differentiate this again to get the third derivative of f (x),
7
but we won’t!
7.1.4 What second derivatives tell us about a function

Second derivatives give us another way of assessing whether a stationary point is a
maximum or a minimum. In particular, we note that given a function, f (x), the sign of
its second derivative, f 00 (x), tells us whether the derivative, f 0 (x), is increasing or
decreasing at a given point. So, if x = a is a stationary point of f (x), i.e. a point where
f 0 (x) = 0, then we find that:
If f 00 (a) < 0, then f 0 (x) is decreasing at x = a, i.e. we must have
f 0 (x) > 0 until f 0 (x) = 0 and then f 0 (x) < 0,
as x increases through the point where x = a. But, this means that:
If f 00 (a) < 0, the stationary point is a maximum.
If f 00 (a) > 0, then f 0 (x) is increasing at x = a, i.e. we must have
f 0 (x) < 0 until f 0 (x) = 0 and then f 0 (x) > 0,
as x increases through the point where x = a. But, this means that:
If f 00 (a) > 0, then the stationary point is a minimum.
117
However, if our stationary point is a point of inflection, we find that f 0 (x) decreases (or
increases) until it is zero and then it increases (or decreases) as x increases through the
point where x = a, i.e. we find that f 0 (x) is neither increasing nor decreasing when
x = a, and this means that we will find f 00 (a) = 0.
Warning! But, having said this, do not think that f 00 (a) = 0 implies that a stationary
point is a point of inflection! The fact is that, in cases where f 00 (a) = 0, second
derivatives fail to tell us anything useful about the nature of a stationary point at
x = a. In particular, f 00 (a) = 0 is compatible with a stationary point being a maximum
or a minimum as well as a point of inflection! To see that this is the case, try the
following activity.
Activity 7.1 Consider the functions
f (x) = x4 , g(x) = x3 , h(x) = −x3 and i(x) = −x4 .
Show that all four of these functions have a stationary point at x = 0 (i.e. that their
first derivatives are zero when x = 0) and that their second derivatives are also zero
when x = 0.
By considering how the derivatives of these four functions change as you go through
the stationary point with x = 0, determine their nature.
Deduce that, at a stationary point, a second derivative of zero tells you nothing
7 about the nature of that stationary point.
7.1.5 A note on the ‘large x ’ behaviour of functions

Sometimes we will want to see what a function, say f (x), is doing for ‘large x’ and, by
this, we mean what the function is doing when |x| is large. That is, what happens when:
x is large [in magnitude] and positive, e.g. when x takes values like 1,000,000 and
even larger positive numbers and we think of this as telling us what happens to
f (x) as x goes to infinity, denoted by x → ∞, which corresponds to places which
are far along the x-axis in the right-hand direction, or
x is large in magnitude and negative, e.g. when x is −1, 000, 000 and even larger [in
magnitude] negative numbers and we like to think of this as telling us what
happens as x goes to minus infinity, denoted by x → −∞, which corresponds to
places which are far along the x-axis in the left-hand direction.
In particular, when dealing with polynomials, such as quadratics like
ax2 + bx + c,
for constants a, b and c where a 6= 0 and cubics like
ax3 + bx2 + cx + d,
for constants a, b, c and d where a 6= 0 we can easily determine how these functions
behave for ‘large x’. The key to this is to isolate the highest power of x in your
polynomial (so that’s ax2 in the quadratic above and ax3 in the cubic) and then
consider that, if xn is this highest power, then:
118
If n is even, your polynomial will become arbitrarily large and positive as x goes to
both +∞ and −∞.
If n is odd, your polynomial will become arbitrarily large and positive as x goes to
+∞ and arbitrarily large [in magnitude] and negative as x goes to −∞.
Of course, multiplying xn by a constant a > 0 will not change this behaviour, but if we
multiply xn by a constant a < 0, then the sign of the large |x| behaviour above will
change.
7.2 Optimisation and curve sketching

We now summarise the method which we will use to optimise a function of one variable
such as f (x). Most of this will follow from what we saw in the previous section, but
there will be some points that won’t become clear until we consider some examples of
how it all works.
Step 1: Find all the stationary points of the function, i.e. all the values of x that
satisfy the equation
f 0 (x) = 0,
and, if necessary, their corresponding values of y using y = f (x).
Step 2: Determine the nature of the stationary points that you have found by using
7
one of the following two methods.
Method A: The first-derivative test: If, as x increases through the stationary
point, we find that f 0 (x) changes from:
positive to positive, then it is a point of inflection,
positive to negative, then it is a local maximum,
negative to positive, then it is a local minimum,
negative to negative, then it is a point of inflection.
This test will always work.
Method B: The second-derivative test: If, at the stationary point, we find that
f 00 (x) is:
negative, then it is a local maximum,
positive, then it is a local minimum.
This test fails if we find that f 00 (x) = 0 at the stationary point and,
in such cases, the stationary point could be a local maximum, a
local minimum or a point of inflection!
Step 3: If necessary, we may need to identify any global maxima or global minima, i.e.
the largest or smallest values that the function can take over its domain. This
identification will involve some or all of the following:
Identifying the largest local maxima and the smallest local minima.
If the domain of the function is:
119
• restricted, we must evaluate the value of the function at

any endpoint(s).
• unrestricted, we must consider its behaviour as x
becomes arbitrarily large in magnitude (i.e. as ‘x → +∞’
or ‘x → −∞’).
In what follows we will see how the first two steps of this method work, how this helps
us when we want to sketch curves, examine what is involved in Step 3 and see how this
method can be used in economics.
7.2.1 Steps 1 and 2: Finding and classifying stationary points

Let’s start by considering a straightforward example of the first two steps of this
method in action.
Example 7.3 Find the stationary points of the function

f (x) = x3 − 3x2 ,
and determine their nature.
For Step 1, we find the stationary points of the function by solving f 0 (x) = 0 and so,
as
7 f 0 (x) = 3x2 − 6x,
we solve the equation
3x2 − 6x = 0 =⇒ 3x(x − 2) = 0 =⇒ x = 0 or x = 2,
to see that stationary points occur when x = 0 and x = 2. Indeed, at these values of
x, we see that the function itself takes the values
f (0) = (0)3 − 3(0)2 = 0,
when x = 0, and
f (2) = (2)3 − 3(2)2 = 8 − 12 = −4,
when x = 2.
For Step 2, we can determine the nature of these points by using the
second-derivative test. To do this, we see that
f 00 (x) = 6x − 6 = 6(x − 1),
so that,
when x = 0 we have f 00 (0) = −6 < 0 and so this stationary point is a local

maximum, and
when x = 2 we have f 00 (2) = 6 > 0 and so this stationary point is a local
minimum.
Thus, the function f (x) has a local maximum when x = 0 and f (x) = 0, and a local
minimum when x = 2 and f (x) = −4.
120
Of course, we can also tackle similar problems for more complicated functions as you
can see if you try the next activity.
Activity 7.2 Find and classify the stationary points of the function g(x) = x2 ex .
And, lastly, if you want to have a look at an example where the second-derivative test
fails at one of the stationary points, try the next activity.
Activity 7.3 Find and classify the stationary points of the function h(x) = x3 e−x .
7.2.2 Curve sketching

Now that we can find the stationary points of functions that are more complicated than
quadratics, we are in a position where we can sketch the curves represented by y = f (x)
for such functions, f (x). This is a useful skill in its own right, but it will also be useful
for us to have a graphical representation of the three functions we considered above
when we come to talk about the third step of our method.
Example 7.4 Sketch the curve with equation y = f (x) where, as above,
f (x) = x3 − 3x2 .
We find the ‘key features’ of this curve, namely: 7

The y-intercept of the curve occurs when x = 0 and so, substituting this into its
equation, we get y = 0 as the y-intercept.
The x-intercepts of the curve occur when y = 0 and so we have to solve the
equation
x3 − 3x2 = 0 =⇒ x2 (x − 3) = 0 =⇒ x = 0 or x = 3.
Thus, x = 0 and x = 3 are the x-intercepts.
The stationary points of the curve were found above, i.e. we found that it had a
local maximum at the point (0, 0) and a local minimum at the point (2, −4).
With this information, we can begin to sketch the curve by roughly indicating these
‘key features’ on some axes as in Figure 7.4(a) and then, joining them up with a nice
smooth curve, we get the sketch itself as in Figure 7.4(b). In particular, notice that
in our sketch we have:
For x > 2, the function is increasing and, as we know that it must cut the x-axis
at x = 3, we expect it to increase in the manner shown as x goes to +∞.
For x < 0, the function is decreasing and as such it will continue decreasing,
going to more and more negative values of y, as x goes to −∞.
As we shall see, it is often useful to spend a moment thinking about what the curve
does away from its ‘key features’ so that we can accurately represent it in our sketch.
121
y y
y = f (x)
2 2
O x O x
3 3
−4 −4
(a) The key features (b) The sketch

Figure 7.4: Sketching the curve y = x3 − 2x2 − 15x in Example 7.4. (a) Using what we
have discovered about the ‘key features’ of the curve, we can begin to see what it must
look like. (b) By joining up these ‘key features’ with a nice smooth curve, we get the
sketch itself.
Example 7.5
7 Sketch the curve y = f (x) where f (x) = 2x4 − 4x3 + 2x2 .
We find the key features of this curve according to the list given above, namely:
x-intercepts: These occur when y = 0 and so we solve the equation given by

f (x) = 0, i.e.
2x4 − 4x3 + 2x2 = 0,
which, on taking out the common factor of 2x2 and factorising the remaining
quadratic, gives us
2x2 (x2 − 2x + 1) = 0 =⇒ 2x2 (x − 1)2 = 0.
Thus, the x-intercepts occur when x = 0 and x = 1.
y-intercept: This occurs when x = 0 and so using y = f (0) we see that the
y-intercept occurs when y = 0. Note, in particular, that this means that the
curve goes through the origin (as we should have expected since one of the
x-intercepts occurs when x = 0).
Finding the stationary points: These occur when f 0 (x) = 0 and so, noting that
f 0 (x) = 8x3 − 12x2 + 4x,
we solve the equation
8x3 − 12x2 + 4x = 0,
which, on taking out a common factor of 4x and factorising the remaining
quadratic, gives us
4x(2x2 − 3x + 1) = 0 =⇒ 4x(2x − 1)(x − 1) = 0,
122
and so the stationary points occur when x = 0, x = 1/2 and x = 1. Then, we

use y = f (x) to find the values of y at these points so that we can locate them
on the sketch. Doing this, we find that
• x = 0 gives y = f (0) = 0,
• x = 1/2 gives y = f (1/2) = 1/8, and
• x = 1 gives y = f (1) = 0.
So, the stationary points have coordinates given by (0, 0), (1/2, 1/8) and (1, 0).
Classifying the stationary points: Let’s use the second-order derivative test here.
We can see that
f 00 (x) = 24x2 − 24x + 4,
and so, looking at the stationary points, we have
• f 00 (0) = 4 > 0 and so (0, 0) is a local minimum;
• f 00 (1/2) = −2 < 0 and so (1/2, 1/8) is a local maximum; and
• f 00 (1) = 4 > 0 and so (1, 0) is a local minimum.
Limiting behaviour: The term with the highest power of x in f (x) is 2x4 and so
f (x) → ∞ as x → ∞ and as x → −∞.
With this information, we begin to sketch this curve by roughly indicating these key
features on some axes as in Figure 7.5(a) and then, joining them up with a nice
smooth curve, we get the sketch itself as in Figure 7.5(b).
7
y y y = f (x)
1 1
8 8
O 1 O 1
2 1 x 2 1 x
(a) The key features (b) The sketch
Figure 7.5: Sketching the curve y = 2x4 − 4x3 + 2x2 in Example 7.5. (a) Using what we
have discovered about the key features of the curve, we can begin to see what it must
look like. (b) By joining up these key features with a nice smooth curve, we get the sketch
itself.
7.2.3 Step 3: Looking for global maxima and global minima

In Step 3, we are concerned with identifying the largest and smallest values a function
can attain, i.e. its global maximum and its global minimum respectively, if they exist! Of
course, if we have sketched the graph of the function, it should be easy for us to identify
any such largest or smallest values of a function. We will use the two curves that we
sketched in Examples 7.4 and 7.5 to illustrate these points in the two key cases.
123
(a) If the domain of the function is unrestricted
In these cases, we are free to consider any value of x and we want to find the largest and
smallest values a function can attain, i.e. its global maximum and its global minimum
respectively, if they exist! In particular, we will need to consider the value of the
function at any stationary points and the behaviour of the function for large |x|.
For the curve sketched in Example 7.4, as sketched in Figure 7.4(b), we see that
although there is a local minimum at (2, −4) and a local maximum at (0, 0), there
is no global minimum as x can take arbitrarily large [in magnitude] negative values
as x goes to −∞ and there is no global maximum as x can take arbitrarily large
positive values as x goes to +∞.
For the curve sketched in Example 7.5, as sketched in Figure 7.5(b), we see that
although there is a local maximum at (1/2, 1/8), there is no global maximum as x
can take arbitrarily large positive values as x goes to −∞ or +∞. We also have
local minima at (0, 0) and (1, 0) and as these both give us the smallest value that
the function can take (i.e. y = 0), we see that both of these points give us a global
minimum.
(b) If the domain of the function is restricted
7 In these cases, the values of x that we are free to consider are restricted to some interval
such as a ≤ x ≤ b and we want to find the largest and smallest values a function can
attain, i.e. its global maximum and its global minimum respectively, if they exist! In
particular, in these cases, we need to consider the value of the function at any
stationary points and its value at the endpoints of the interval.
For the curve sketched in Example 7.4 with x in the interval 1 ≤ x ≤ 3, as sketched
in Figure 7.6(a), we see there is a local minimum at (2, −4) and this is the global
minimum as y = −4 is the smallest value that the function can take. Also, we see
that there is a global maximum at the endpoint where x = 3 as y = f (3) = 0 is the
largest value that the function can take. (Of course, this global maximum is an
endpoint of the interval but not a stationary point of the function!)
For the curve sketched in Example 7.5 with x in the interval −1/4 ≤ x ≤ 5/4, as
sketched in Figure 7.6(b), we see that the local minima at (0, 0) and (1, 0) still both
give us the smallest value that the function can take (i.e. y = 0), and so both of
these points still give us a global minimum. Also, the local maximum at (1/2, 1/8)
is now a global maximum as y = 1/8 is now the largest value that the function can
take.
7.2.4 An economic application: Profit maximisation

Optimisation problems are very common in economics and we know one way in which
they can arise in that subject, namely when a firm wants to find the level of production
that maximises its profit. In particular, when a firm sells an amount, q, it makes a profit
given by
π(q) = R(q) − C(q),
124
y
y = f (x)
1 2
O x
3 y y = f (x)
−2
1
8
−4
O
− 14 1
2 1 5
4
x
(a) (b)
Figure 7.6: (a) The function from Example 7.4 when we only consider values of x in the
interval 1 ≤ x ≤ 3 and (b) the function from Example 7.5 when we only consider values
of x in the interval −1/4 ≤ x ≤ 5/4. In particular, the dotted parts of these curves are
irrelevant here because they correspond to values of x which are not in the given intervals.
7
where R(q) is the revenue generated by selling this amount and C(q) is the cost of
producing this amount. Obviously, when doing this, the firm will want to sell an
amount q that will maximise its profit. Indeed, whereas the costs involved are
determined by factors intrinsic to the firm, the revenue generated is given by
R(q) = pq,
where p, the price per unit, is determined by the market the firm is selling in.
As an example, consider the case where the firm is a monopoly, i.e. it is the only supplier
of this product to the market. Indeed, as it is the only supplier and the amount it is
supplying is q, the price that the consumers will be willing to pay for this is given by
p = pD (q) where pD (q) is, as in Section 4.2.1, the inverse demand function of the market.
As such, in this case, the revenue generated by the sale of an amount q is given by
R(q) = qpD (q),
and this will yield a profit of
π(q) = qpD (q) − C(q).
Thus, in the case of a monopoly, given the firm’s cost function and the inverse demand
function for the market, we should be able to determine the amount, q, that the firm
should be selling by finding the value of q that maximises the firm’s profit. Let’s look at
an example.
125
Example 7.6 Suppose that a firm is a monopoly with a cost function given by
C(q) = q 3 − 10q 2 + 25q + 10,
and the inverse demand function for this good is
pD (q) = 10 − q.
Find the value of q that will maximise the firm’s profit.
Here there is an implicit restriction on the values of q that we can consider because
we must have
q ≥ 0 as q denotes the amount of the good being sold, and
q ≤ 10 as, otherwise, the price that the consumers pay would be negative.
So, we need to maximise the firm’s profit, i.e.
π(q) = qpD (q) − C(q) = q(10 − q) − (q 3 − 10q 2 + 25q + 10) = −q 3 + 9q 2 − 15q − 10,
given that q is in the interval given by 0 ≤ q ≤ 10.
To do this, we note that π 0 (q) is given by

7
π 0 (q) = −3q 2 + 18q − 15,
and so, as the stationary points occur when π 0 (q) = 0, we solve the equation
−3q 2 + 18q − 15 = 0 =⇒ q 2 − 6q + 5 = 0 =⇒ (q − 1)(q − 5) = 0,
to see that the stationary points occur when q = 1 and q = 5. We can then see that
π 00 (q) = −6q + 18,
which, using the second-derivative test, tells us that when:
q = 1, we have π 00 (1) = 12 > 0, and so this is a local minimum.
q = 5, we have π 00 (5) = −12 < 0, and so this is a local maximum.

This means that the point we seek, i.e. the maximum of the profit function, must
occur at q = 5 or at one of the two endpoints of our interval. But, using the profit
function, we see that
π(0) = −10, π(5) = 15 and π(10) = −260,
which means that the maximum occurs at q = 5 because it yields the largest profit.
Thus, q = 5 will maximise the firm’s profit.
Activity 7.4 Sketch the profit function from Example 7.6 and verify that q = 5
does indeed maximise the profit. (Do not try to find the q-intercepts here.)
126
Learning outcomes
explain what a derivative tells us about a function;

find and classify stationary points;
sketch a curve;
solve optimisation problems.
Exercises
Exercise 7.1
Use differentiation to find the stationary point of the following quadratic functions and
determine whether it is a local maximum or a local minimum using (a) the first
derivative test and (b) the second derivative test.
i. f (x) = −3x2 + 6x − 20; ii. g(x) = 3x2 + 6x + 20.
Verify your answers by completing the square.

7
Exercise 7.2
Find the stationary points of the following functions and determine whether they are a
local maximum, a local minimum or neither of these. In each case, determine whether
any of the points you have found are global.
x3
i. f (x) = − 2x2 + 3x − 15; ii. g(x) = 2x3 + 3x2 + 12x − 6.
3
Exercise 7.3
A firm has a monopoly on its market and so it can decide the price at which it sells its
product. If it sells the product for price p, then demand is given by the equation
q = 300 − 2p where q is the amount sold. The cost of producing q is given by the
function
q2
C(q) = 30 + 30q − ,
10
and the revenue is given by the function R(q) = pq.
i. Find the revenue function, R(q), in terms of q and hence find the profit function,
π(q), in terms of q.
ii. Calculate the value of q that will give the firm its maximum profit making sure
that you check that this value of q does indeed give you the maximum profit. What
is the maximum profit that the firm makes and what price, p, will provide this?
iii. If the firm can produce at most 120 units, what price will maximise the profit?
127
8. Calculus IV — Integration
Unit 8: Calculus IV
Integration
Overview
Our last topic in calculus is integration which can be thought of as the ‘opposite’ of
differentiation. In this unit we will see how to find indefinite integrals and explore the
relationship between definite integrals and the area under a curve.
Aims
To see how indefinite integrals are related to derivatives via antiderivatives.

To introduce the techniques for finding simple indefinite integrals.
To examine the relationship between definite integrals and areas.
8 8.1 Indefinite integrals

In Unit 5, we introduced differentiation and saw that a function, f (x), could be
differentiated with respect to x to yield its derivative, which we denoted by
df
or f 0 (x).
dx
And, in particular, we saw how to find such derivatives by using the rules of
differentiation and some standard derivatives. Now, given a function, f (x), we want to
make sense of what it means to integrate it and we start by looking at the indefinite
integral of this function with respect to x, which is denoted by
Z
f (x) dx.
In such cases, as we are integrating the function f (x) with respect to x, we call it the
integrand. And, similarly to what we saw before, we will see how to find such integrals
by using the rules of integration and some standard integrals. In particular, the standard
integrals will be closely related to our standard derivatives since the key idea behind our
method for finding integrals will be the idea that integration is the process that
‘undoes’ (or ‘reverses’) the process of differentiation, i.e. the process of indefinite
integration can be thought of as antidifferentiation and the resulting indefinite integral
can be thought of as an antiderivative.
128
Consider the functions F (x) and f (x) where we know that f (x) is the derivative1 of
F (x), i.e.
dF
= f (x).
dx
Now, using the idea that integration ‘undoes’ differentiation, i.e. if we integrate f (x)
with respect to x we are looking for a function, F (x), whose derivative is f (x), we can
see that Z
f (x) dx must be, more or less, given by F (x).
In such cases, we say that F (x) is an antiderivative of f (x) as opposed to, say, the
indefinite integral.
However, you may wonder why we say that the function, F (x), that we found above is
‘an’, as opposed to ‘the’, antiderivative of f (x). The reason for this is that if, instead of
the function F (x) we had the function F (x) + c where c is a constant, then its
derivative would still be f (x), i.e.

d
F (x) + c = f (x),
dx
and so, using the reasoning above, we would find that
Z
f (x) dx can also, more or less, be given by F (x) + c,
where c is a constant. That is, F (x) + c is also an antiderivative of f (x) for this
constant c.
Example 8.1 Show that 4x2 and 4x2 + 1 are both antiderivatives of 8x.
8
2 2
4x is an antiderivative of 8x as we can differentiate 4x to get 8x. But, similarly, we
can see that 4x2 + 1 is also an antiderivative of 8x as we can differentiate 4x2 + 1 to
get 8x.
As such, because this works for any constant c we add to F (x), we say that the
indefinite integral gives us a whole family of antiderivatives which only differ by a
constant, i.e. the choice of c. In this way, we say that indefinite integration, i.e. the
process of finding Z
f (x) dx,
is antidifferentiation, i.e. it seeks all the functions F (x) + c that can be differentiated to
yield f (x) and, as such, every one of these functions will be an antiderivative of f (x).
Z
Example 8.2 What is 8x dx?
We saw in Example 8.1 that 4x2 is an antiderivative of 8x. This means that
Z
8x dx = 4x2 + c,
1
We say that it is the derivative because differentiation always yields exactly one answer.
129
where c is an arbitrary (i.e. any) constant. Notice that this works because
differentiating 4x2 + c we get 8x.
Generally speaking then, we have the following.
Key concepts in integration
If F (x) is a function whose derivative is the function f (x), then we have

Z
f (x) dx = F (x) + c,
where c is an arbitrary constant. In particular, we call the
function, f (x), the integrand as it is what we are integrating,
function, F (x), an antiderivative as its derivative is f (x),
constant, c, a constant of integration which is completely arbitrary,2 and

Z
integral, f (x) dx, an indefinite integral since, in the result, c is arbitrary.
Now that we have the idea, let’s see how we’re going to actually find the indefinite
integrals of the functions that commonly occur in this course.
8 8.1.1 Finding simple indefinite integrals
We have seen how to find indefinite integrals using antiderivatives, but now we want to
explore a more convenient way of finding them. The key idea is that, similar to what we
saw in Unit 5 when we introduced derivatives, we can introduce standard integrals
which tell us how to integrate our basic functions and once we know how to integrate
these, the rules of integration will allow us to integrate combinations of these functions.
Standard integrals
In Example 8.2, we used the idea that indefinite integration is antidifferentiation to

show that the function f (x) = 8x has an indefinite integral given by
Z
8x dx = 4x2 + c,
where c is an arbitrary constant. We now state some results that will allow us to find
the indefinite integrals of our other basic functions.
2
As we can add any constant to F (x) to account for the fact that F (x) + c, for any constant c, is also
an antiderivative.
130
Constant powers of x
If k 6= −1 is a constant, we have
xk+1
Z
xk dx = + c,
k+1
where c is an arbitrary constant and this works because
k+1
(k + 1)xk

d x
+c = + 0 = xk .
dx k + 1 k+1
In particular, if k = 0, we have
Z Z
1 dx = x0 dx = x + c,
and this works because the derivative of x + c is 1.

However, if we have k = −1, we have
1
Z Z
−1
x dx = dx = ln |x| + c,
x
where we need the modulus sign in ln |x| as x may be negative but the logarithm
function is only defined for x > 0. This works because, if x > 0, we have |x| = x and so
d ln |x| d ln(x) 1
= = ,
dx dx x
8
whereas if x < 0, we have |x| = −x and so
d ln |x| d ln(−x) −1 1
= = = ,
dx dx −x x
if we use the chain rule.
Exponential and logarithm functions
If we are using e, we have Z

ex dx = ex + c,
where c is an arbitrary constant and this works because

d
e + c = ex .
x
dx
However, there is no nice standard integral for ln(x) and so we won’t really discuss the
indefinite integral Z
ln(x) dx,
in this course. But, if you’re interested in what it is, see Exercise 8.2.
131
Sine and cosine functions
For the sine and cosine functions we find that

Z Z
sin(x) dx = − cos(x) + c and cos(x) dx = sin(x) + c,
where c is an arbitrary constant. The former works because

d
− cos(x) + c = − − sin(x) + 0 = sin(x),
dx
whereas the latter works because the derivative of sin(x) is cos(x).
Standard integrals: summary
In summary, if c is an arbitrary constant, we have the following standard integrals.
Standard integrals
xk+1
Z
If k 6= −1 is a constant, then xk dx = + c.
k+1
Z Z
In particular, if k = 0, we have 1 dx = x0 dx = x + c.
Z
x−1 dx = ln |x| + c.
8
Z
ex dx = ex + c.
Z
sin(x) dx = − cos(x) + c.
Z
cos(x) dx = sin(x) + c.
We now look at how we can integrate some simple combinations of these functions.
8.1.2 The basic rules of integration

In Section 4.1.3, we saw that there are five ways of combining given functions to make
new ones and, in Section 5.2.2, we saw how the rules of differentiation could be used to
differentiate these new functions. Here we will see how we can use rules of integration to
integrate some of the simplest new functions that we can make, namely constant
multiples and sums.3
3
In particular, the rules of integration that involve the other new functions that we can create (namely
products, quotients and compositions) are beyond the scope of this course!
132
The constant multiple rule
The constant multiple rule tells us how to integrate a constant multiple of a function
f (x) and it works as follows.
Constant multiple rule

Z Z
If k is a constant and f is a function, then kf (x) dx = k f (x) dx.

Z Z 1
!
− 21 − 21 x2 1
−3x dx = −3 x dx = −3 1 = −6x 2 + c.
2
Z Z
x
2 e dx = 2 ex dx = 2 ex + c.
7
Z Z
dx = 7 x−1 dx = 7 ln |x| + c.
x
Z Z
−4 sin(x) dx = −4 sin(x) dx = −4 − cos(x) + c = 4 cos(x) + c.
So, in these cases, we just integrate as before and then multiply the answer by the
appropriate constant multiple.
8
In particular, observe that when using this rule, we integrate to find one of the
antiderivatives and then just add on an arbitrary constant, c, to take care of the
constant of integration.
Activity 8.1 Use antiderivatives to show that the constant multiple rule works.
The sum rule
The sum rule tells us how to integrate the sum of two functions f (x) and g(x) and it
works as follows.
Sum rule
Z Z Z
If f and g are functions, then [f (x) + g(x)] dx = f (x) dx + g(x) dx.

3
x3 x 2 1 3
Z h i Z Z
1 1 3
2 2
x +x 2 dx = x dx + x dx =
2 + 3 +c= x + 2x + c.
2
3 2
3
133
Z Z Z
[sin(x) + cos(x)] dx = sin(x) dx + cos(x) dx = − cos(x) + sin(x) + c.
Z
1
Z Z
x −1
+ e dx = x dx + ex dx = ln |x| + ex + c.
x
So, in these cases, we just integrate as before and then add the answers together.
In particular, observe that when using this rule, we integrate to find the two
Activity 8.2 Use antiderivatives to show that the sum rule works.
The linear combination rule
It should be clear that, taken together, our two rules of integration enable us to
integrate functions of the form kf (x) + lg(x) as follows.
Linear combination rule
If k and l are constants and f (x) and g(x) are functions then
Z Z Z
[kf (x) + lg(x)] dx = k f (x) dx + l g(x) dx.
8
x2
Z Z Z
[2x + 5] dx = 2 x dx + 5 1 dx = 2 + 5x + c = x2 + 5x + c.
2
Z Z Z
[sin(x) − cos(x)] dx = sin(x) dx − cos(x) dx = − cos(x) − sin(x) + c.
Z
3
Z Z
−1
− 4 e dx = 3 x dx − 4 ex dx = 3 ln |x| − 4 ex + c.
x
x
So, in these cases, we just integrate as before and then combine the answers in the
obvious way.
In particular, observe that when using this rule, we integrate to find the two
Activity 8.3 Show that the constant multiple rule and the sum rule do indeed give
the linear combination rule.
Hence, use the linear combination rule to find the integral of the function
f (x) − g(x) in terms of the integrals of the functions f (x) and g(x).
134
Activity 8.4 Use the rules above to find the following integrals.
Z
3
Z Z
x
(a) −3 cos(x) dx, (b) [ e + cos(x)] dx, (c) 3 sin(x) − dx.
x
There are, of course, more rules of integration which correspond to the product and
chain rules for differentiation, but these are beyond the scope of this course.
8.2 Definite integrals and areas

So far, we have been looking at indefinite integrals and we have been finding them by
using the idea of an antiderivative to deduce standard integrals and rules of integration.
We now turn to the geometric interpretation of an integral and this involves introducing
the idea of a definite integral and seeing what it represents.
Definite integrals and what they represent
In Section 5.1 we saw that the derivative of a function, f (x), gave us the gradient of the
curve y = f (x). We now consider what the integral of a function, f (x), tells us about
the curve y = f (x) and see how this comes about through the idea of a definite integral.
What is a definite integral?
Recall that an indefinite integral is so-called since, given a function, f (x), and one of its
antiderivatives, F (x), i.e. two functions related by the fact that 8
dF
= f (x),
dx
we have Z
f (x) dx = F (x) + c,
where c is an arbitrary constant. And, indeed, it is this arbitrary constant that makes
this integral indefinite as we do not know what c is. In a similar vein, instead of writing,
Z Z b
f (x) dx we could also write f (x) dx,
a
where the constants a and b are called the limits of integration.
In order to work out integrals that look like this we need to know what to do with these
limits and the procedure is:
Firstly: Deal with the integral. Integrating f (x), we take one of its
antiderivatives, F (x), and then write
Z b b
f (x) dx = F (x) .
a a
In particular, as we shall see below, observe that we no longer need a constant of
integration.
135
Secondly: Deal with the limits. By definition, we let

b
F (x) = F (b) − F (a),
a
i.e. we subtract the value of the antiderivative at x = a from its value at x = b.

Notice this means that, if F (x) is an antiderivative of f (x), we have
Z b
f (x) dx = F (b) − F (a),
a
i.e. the value of the integral depends only on the value of the antiderivative at the
points x = a and x = b. Thus, this is now a definite integral as it no longer involves an
arbitrary constant, c.
Activity 8.5 If F (x) is an antiderivative of f (x), show that

Z b b
f (x) dx = F (x) + c = F (b) − F (a),
a a
if c is a constant. Hence explain why we can omit the constant of integration when
evaluating definite integrals.
Another consequence of this discussion is that it allows us to see how to use our basic
rules of integration to evaluate definite integrals. For instance, if k and l are constants
and f (x) and g(x) are functions, then we can see that the linear combination rule gives
8 us Z b Z b Z b
[kf (x) + lg(x)] dx = k f (x) dx + l g(x) dx,
a a a
if we are using definite integrals.
Activity 8.6 Following what we saw in Section 8.1.2, write down the constant
multiple rule and the sum rule for definite integrals.
Activity 8.7 Using what we have seen so far, derive the linear combination rule for
definite integrals.
Now that we have the basic idea, let’s see how we can work out a definite integral.
Z 3
Example 8.6 Evaluate (x + 4) dx.
1
If we follow the two-step procedure above, i.e. integrating to find an antiderivative

and then dealing with the limits, we get
3 3
x2 32
2
1 9 1
Z
(x+4) dx = + 4x = + 4(3) − + 4(1) = + 12 − + 4 = 12,
1 2 1 2 2 2 2
136
which is the value of this definite integral.
Alternatively, we could use the linear combination rule to get

3 3 3 3 3 2
x2 12

3
Z Z Z
(x + 4) dx = x dx + 4 dx = + 4x = − + 4(3) − 4(1)
1 1 1 2 1 1 2 2

9 1
= − + 12 − 4 = 12,
2 2
which is the same answer as before.
What definite integrals with non-negative integrands represent
Definite integrals are useful because they tell us about the area under a curve.
Specifically, if we have the definite integral
Z b
f (x) dx,
a
where f (x) ≥ 0 for all x such that a ≤ x ≤ b,4 we say that we have a non-negative
integrand and find that the value of the integral is the area of the region between the
curve y = f (x), the x-axis and the vertical lines x = a and x = b as illustrated in
Figure 8.1.
y
y = f (x)
8
O x
a b
Figure 8.1: The hatched region is between the curve y = f (x), the x-axis and the vertical
lines x = a and x = b. In cases like this we have a non-negative integrand, i.e. f (x) ≥ 0
Rb
for a ≤ x ≤ b, and so the definite integral a f (x) dx gives us the area of this hatched
region.
Example 8.7 Find the area of the region between the line y = 4 − 2x, the x-axis
and the vertical lines x = 0 and x = 2 which is illustrated in Figure 8.2(a).
There are two ways to find this area:
As this is just a right-angled triangle, the area is just ‘half times base times
height’, i.e.
1
area of triangle = × 2 × 4 = 4.
2
Thus, the area of the region is four.
4
At the moment we will just accept this caveat. The reason why we need f (x) to be non-negative for
values of x between the limits of integration will become clear very soon.
137
As we have y = f (x) with f (x) = 4 − 2x, we can see from Figure 8.2(a) that
f (x) ≥ 0 between x = 0 and x = 2. So, as noted above, the area should be given
by
Z 2 2
2
(4 − 2x) dx = 4x − x = (4 × 2 − 22 ) − (4 × 0 − 02 ) = (8 − 4) − 0 = 4,
0 0
which is, again, four.

Consequently, this confirms that the definite integral does give us the area of the
region between the line y = 4 − 2x, the x-axis and the vertical lines x = 0 and x = 2,
at least when f (x) ≥ 0 between the vertical lines.
y y
11111
00000
4 4
00000
11111
00000
11111
3 y = 4 − 2x 3
00000
11111
00000
11111
2 2
00000
11111
y = 4 − x2
00000
11111
1 1
00000
11111
O
1 2
x
−2 −1
O
1 2
x
(a) (b)
8
Figure 8.2: Non-negative integrands. (a) For Example 8.7, the region between the line
y = 4 − 2x, the x-axis and the vertical lines x = 0 and x = 2. (b) For Example 8.8, the
region between the parabola y = 4 − x2 , the x-axis and the vertical lines x = −1 and
x = 1.
However, generally, we won’t have a simple geometric way of finding the area under a
curve and so we will have to use integration.
Example 8.8 Find the area of the region between the parabola y = 4 − x2 , the
x-axis and the vertical lines x = −1 and x = 1 which is illustrated in Figure 8.2(b).
As we have y = f (x) with f (x) = 4 − x2 , we can see from Figure 8.2(b) that
f (x) ≥ 0 between x = −1 and x = 1. So, as noted above, the area should be given by
1 1
x3 (1)3 (−1)3
Z
2
(4 − x ) dx = 4x − = 4(1) − − 4(−1) −
−1 3 −1 3 3

11 11 22
= − − = ,
3 3 3
i.e. the area is 7 13 .
138
Activity 8.8 Observe that the region in Example 8.8, as illustrated in

Figure 8.2(b), is symmetric about the y-axis. Use this observation to explain why
the area of this region is two times the area represented by the definite integral,
Z 1
(4 − x2 ) dx,
0
and verify that this does indeed give the correct area.
What definite integrals with non-positive integrands represent
We now start to consider what happens to the relationship between definite integrals
and areas when we can not guarantee that the integrand is non-negative. That is, what
happens if we do not have f (x) ≥ 0 for all x such that a ≤ x ≤ b? To simplify matters,
we will start by asking: What happens when this condition always fails? That is, what
happens when the integrand is non-positive as f (x) ≤ 0 for all x such that a ≤ x ≤ b?
Consider the area of the region bounded by the curve y = f (x), the x-axis and the
vertical lines x = a and x = b when we have a non-positive integrand, i.e. when f (x) ≤ 0
for a ≤ x ≤ b, as illustrated in Figure 8.3. Now, if we note that
If f (x) ≤ 0 for all a ≤ x ≤ b, then −f (x) ≥ 0 for all a ≤ x ≤ b,
we see that the function −f (x) does give us a non-negative integrand and so, following
what we saw above, the area, A, of the region in question is given by
A=
Z b
−f (x) dx = −
Z b
f (x) dx =⇒
Z b
f (x) dx = −A.
8
a a a
That is, for non-positive integrands, the definite integral gives us minus the area. Thus,
in the case of non-positive integrands, the area is given by the magnitude of the definite
integral. Let’s have a look at an example.
a b
O x
y = f (x)
lines x = a and x = b. In cases like this we have a non-positive integrand, i.e. f (x) ≤ 0 for
Rb
a ≤ x ≤ b, and so the definite integral a f (x) dx gives us minus the area of this hatched
region.
139
and the vertical lines x = 2 and x = 4 which is illustrated in Figure 8.4(a).
There are two ways to find this area:
As this is just a right-angled triangle, the area is just ‘half times base times
height’, i.e.
1
area of triangle = × 2 × 4 = 4.
2
Thus, the area of the region is four.
As we have y = f (x) with f (x) = 4 − 2x, we can see from Figure 8.4(a) that
f (x) ≤ 0 between x = 2 and x = 4. So, looking at the definite integral we get,
Z 4 4
2
(4−2x) dx = 4x−x = (4×4−42 )−(4×2−22 ) = (16−16)−(8−4) = −4,
2 2
which is minus the answer we would expect. As such, we take the magnitude of
this answer and so the area is, again, four.
Consequently, if f (x) ≤ 0 between the vertical lines, the definite integral gives us
minus the area and so we take the magnitude of the definite integral to find the area.
y y
8
4
11111
00000
4
00000
11111
3 y = 4 − 2x
00000
11111
3 y = 4 − 2x
00000
11111
00000
11111
2 2
00000
11111
00000
11111
1 1
1111
0000
O x 000001111
11111O
0000 x
0000
1111 0000
1111
1 2 3 4 1 2 3 4
−1
0000
1111
0000
1111
−1
0000
1111
1
0
1 1111
−2
0 0000 0101 1111

−2 0000
0000
1111
0000
1111 0000
1111
0000
1111 0000
1111
−3 −3
1 1111
0000
111111111
000000000
0
−4
01 1111
0000
111111111
000000000
−4
(a) (b)
Figure 8.4: Negative integrands and their relationship to area. The region between the
line y = 4 − 2x, the x-axis and the vertical lines (a) x = 2 and x = 4 for Example 8.9,
and (b) x = 0 and x = 4 for Example 8.10.
140
What definite integrals with general integrands represent
We now consider what happens to the relationship between definite integrals and areas
when we can not guarantee that the integrand is non-positive or non-negative. That is,
what happens if f (x) ≥ 0 for some x such that a ≤ x ≤ b but not others? Let’s start by
considering the simple case where we have an integrand which is neither non-positive
nor non-negative because there is some number c such that a ≤ c ≤ b where
f (x) ≥ 0 for all x such that a ≤ x ≤ c, and
f (x) ≤ 0 for all x such that c ≤ x ≤ b,

as illustrated in Figure 8.5. Indeed, following on from what we saw above, we see that
the area, say A1 , of the hatched region between the vertical lines x = a and x = c is
given by the definite integral Z c
f (x) dx,
a
Z c
i.e. A1 = f (x) dx, and
a
the area, say A2 , of the hatched region between the vertical lines x = c and x = b is
given by minus the definite integral
Z b
f (x) dx,
c
b
8
Z
i.e. A2 = − f (x) dx.
c
As such, the area, say A, of the hatched region between the lines x = a and x = b is
now given by Z c Z b

A = A1 + A2 = f (x) dx + f (x) dx .
a c
In particular, note that in this case we will need to find two different definite integrals
to find the area and not one like we did in the earlier cases!
y
y = f (x)
b
O x
a c
lines x = a and x = b. In cases like this we have a non-negative integrand for a ≤ x ≤ c
and a non-positive integrand for c ≤ x ≤ b, we need to find two different definite integrals
to find the area of the region.
141
Thus, for general integrands, the procedure for finding the area of the region bounded
by the curve y = f (x), the x-axis and the vertical lines x = a and x = b is as follows:
Firstly, determine all the points where the curve crosses the x-axis with
x-coordinates between x = a and x = b.
Secondly, use these points to determine (possibly via a sketch) where the curve is
positive and where the curve is negative.
Thirdly, use this information to determine the areas by finding the appropriate
definite integrals (bearing in mind that the integrands will now be either
non-negative or non-positive).
Fourthly, add up all the areas to find the total area.

To see how this works, let’s consider a couple of examples.
and the vertical lines x = 0 and x = 4 which is illustrated in Figure 8.4(b).
As indicated in Figure 8.4(b), the line y = 4 − 2x crosses the x-axis when x = 2 and
this lies between x = 0 and x = 4. We can also see that the function is non-negative
for 0 ≤ x ≤ 2 and non-positive for 2 ≤ x ≤ 4. As such, using our earlier workings in
Examples 8.7 and 8.9, we split the total region into two sub-regions to see that:
Between x = 0 and x = 2 we evaluate the definite integral,

8 Z 2
(4 − 2x) dx,
0
which gives us 4 as we saw in Example 8.7. Thus, the area is four here as we
have a non-negative integrand.

Z 4
(4 − 2x) dx,
2
which gives us −4 as we saw in Example 8.9. Thus, the area is four here as we
have a non-positive integrand.
Consequently, the total area is eight.
We also note, in passing, that the definite integral

Z 4 4
2
(4 − 2x) dx = 4x − x = (4 × 4 − 42 ) − (4 × 0 − 02 ) = (16 − 16) − 0 = 0,
0 0
and, as this is zero, it most definitely is not giving us the area we seek!
142
Activity 8.9 Verify that the answer to the previous example is correct by finding
the areas of the triangles involved.
Example 8.11 Find the area of the region between the parabola y = 1 − x2 , the
x-axis and the vertical lines x = −2 and x = 2 which is illustrated in Figure 8.6.
As indicated in Figure 8.6, the parabola y = 1 − x2 crosses the x-axis when x = ±1

and these points lie between x = −2 and x = 2. We can also see that the function is
non-negative for −1 ≤ x ≤ 1 and non-positive for −2 ≤ x ≤ −1 and 1 ≤ x ≤ 2. As
such, we split the total region into three sub-regions to see that:
Between x = −2 and x = −1 we evaluate the definite integral,

−1 −1
x3 (−1)3 (−2)3
Z
2
(1 − x ) dx = x − = −1 − − −2 −
−2 3 −2 3 3

1 8 4
= −1 + − −2 + =− .
3 3 3
4
Thus, the area is 3
here as we have a non-positive integrand.
Between x = −1 and x = 1 we evaluate the definite integral,

1 1
x3 13 (−1)3
Z
2
(1 − x ) dx = x − = 1− − −1 −
−1

3 −1

3 3 8
1 1 4
= 1− − −1 + = .
3 3 3
4
Thus, the area is 3
here as we have a non-negative integrand.

2 1
x3 23 13

8 1 4
Z
2
(1 − x ) dx = x − = 2− − 1− = 2− − 1− =− .
1 3 −1 3 3 3 3 3
4
Thus, the area is 3
here as we have a non-positive integrand.
4
Consequently, the total area is 3
+ 43 + 4
3
which is four.
We also note, in passing, that the definite integral,

2 2
x3 23 (−2)3

8 8 4
Z
2
(1−x ) dx = x − = 2− − (−2) − = 2 − − −2 + =− ,
−2 3 −2 3 3 3 3 3
and this is most definitely not giving us the area we seek!
143
1 y = 1 − x2
−2 2
O x
−1 1
−1
−2
−3
Figure 8.6: Negative integrands and their relationship to area (continued). For
Example 8.11, the region between the parabola y = 1 − x2 , the x-axis and the vertical
lines x = −2 and x = 2.
Learning outcomes
explain the relationship between differentiation and indefinite integration;
find simple indefinite integrals by using standard integrals and the rules of
integration;
explain the relationship between a definite integral and an area;

8
find areas using definite integration.
Exercises
Exercise 8.1
Find the following indefinite integrals and use differentiation to verify your answer.
Z Z
i. −17 dx; vi. 5 ex dx;
Z Z
ii. 27x dx; vii. (3x2 − 5x + 7) dx;
Z Z
3
iii. 2x dx; viii. (3x10 + 8x5 + 4 ex ) dx;
Z √
√
Z
1
iv. 5 x dx; ix. 3 x3 − 2x− 2 dx;
Z
4 2 3
Z
v. dx; x. 3
+ dx.
x x 2x
144
Exercise 8.2
Differentiate the function F (x) = x ln(x) − x. (You will need to use the product rule!)
Hence find Z
ln(x) dx,
by thinking of the indefinite integral in terms of antiderivatives.
Exercise 8.3
Use the ‘adding powers’ power law and the constant multiple rule to show that
Z
ex+k dx = ex+k + c,
where c is an arbitrary constant.
Exercise 8.4
Find the indefinite integral Z
(2x − 1)2 dx,
by multiplying out the brackets and integrating term-by-term.
Exercise 8.5
Evaluate the following definite integrals.
Z 15 4 √
Z
i. 2 dx; v. 3 x dx;
7 1 8
Z 1 Z 2
ii. x5 dx; vi. (4x3 + 3) dx;
0 −1
8 9 √
1
Z Z
iii. dx; vii. x x dx;
2 2x 0
Z 0 Z π
x
iv. 2 e dx; viii. sin(x) dx.
−1 0
Exercise 8.6
Find the areas between the following curves, the x-axis and the vertical lines x = 1 and
x = 3. (You may find it useful to sketch the curve in each case.)
i. y = x2 − x − 2, ii. y = x2 − 3.
145
9. Financial Mathematics I — Compound interest and its uses
Unit 9: Financial Mathematics I

Compound interest and its uses
Overview
In this unit we look at some of the basic ideas behind financial mathematics. The key
concept here is compound interest and how this adds value to our savings. We will also
look at different compounding intervals and see how we can use Annual Percentage
Rates (or APRs) to compare investments with different interest rates and compounding
intervals. Lastly, we will see how these ideas also allow us to model the depreciation of
assets over time.
Aims
To see how different kinds of interest work.
To see how certain investments can be compared using APRs.
To see how we can model depreciation using compounding.

9 9.1 Interest
Suppose you deposit a certain amount, called the principal, in a savings account that
offers you a certain rate of interest. Let’s say, for example, that you want to invest £500
in a savings account which pays 12% interest annually (i.e. every year). This means
that, after a year, you will receive 12% of £500 in interest. How much will this be?
Well, we recall that
12
12% = = 0.12,
100
and so we can see that 12% of £500 is given by
500 × 0.12 = 60.
That is, you will accrue (or receive) £60 in interest from investing this principal in this
account for a year and the amount in your account, called the balance, will now be £560.
If we were to leave this money in the account for a second year, it then becomes
necessary to know how the interest is being calculated and there are two types of
interest that we may wish to consider:
146
Simple interest is where the bank always pays you interest on your principal.
That is, even though the balance is £560 at the end of the first year, you still only
accrue 12% of £500 in interest. As such, under simple interest, your balance at the
end of the second year will be £620 as you will have your original deposit of £500
plus two lots of £60 in interest.
Compound interest is where the bank always pays you interest on your balance.
That is, at the end of the second year you will accrue
£560 × 0.12 = £67.20,
in interest. As such, under compound interest, your balance at the end of the
second year will be £627.20 as you will now have an additional £67.20 to add to
your previous balance of £560.
Notice, in particular, that compound interest gives us a higher balance at the end of the
second year than simple interest because we also get interest on the interest from the
previous year. That is, at the end of the first year our balance is
principal + first year’s interest = £500 + £60,
and, after the second year, we get 12% interest on both of these amounts which gives us
an additional £60 from the principal and an additional
£60 × 0.12 = £7.20,
from the first year’s interest yielding, as expected from above, a total of £67.20 in
interest. In this course, we will mainly focus on compound interest as that is most
widely used, but we will occasionally mention simple interest in the activities or when it
provides a useful application.
Of course, this process of calculating simple or compound interest can continue for any
number of years and so, instead of working these things out year-by-year we want to be 9
able to work with a formula that will tell us the balance of the account after any given
number of years. In particular, to find these formulae, we need to generalise our
discussion so that we are now dealing with the following variables.
P , the principal, i.e. the amount that is initially invested.
r, the annual interest rate written in decimal form.1
n, the number of years in which we are interested.

In what follows we will find a formula that will allow us to calculate the balance of an
account in terms of these variables under compound interest. We will leave you to find
the corresponding formula for simple interest in Activity 9.1.
1
In particular, even though interest rates will usually be given as a percentage, we want to work
with the decimal. That is, when speaking generally, we will specify an interest rate of 100r% as this
corresponds to the decimal r. (For example, we had 12% above and, as 100(0.12) = 12, this gives us the
decimal r = 0.12 that we used in our calculations.)
147
Activity 9.1 Suppose that you invest P in an account that pays simple interest at
a rate of 100r% per year. Explain why the balance of this account will be P (1 + nr)
after n years.
9.1.1 A formula for balances under annually compounded

interest
We have seen how to calculate the compound interest accrued on £500 over two years
at 12% interest per year and we have used this to calculate the balance of the account
at the end of this time period. However, this method of calculating the balance is quite
tricky to generalise and so, instead of using the method above, we want to look at
another way of finding the balance at the end of each year. In particular, observe that
the balance at the end of the first year can be written as
500 + 12% of 500 = 500 + 500 × 0.12 meaning of ‘12% of’
= 500(1 + 0.12) common factor of 500
= 500(1.12) simplifying the bracket
which is, again, £560. That is, writing 12% as the decimal 0.12, we can see that the
effect of applying interest at 12% per year is the same as multiplying our principal by
1.12. Similarly, since the balance at the end of the first year is £500(1.12), at the end of
the second year, we have
500(1.12) + 12% of 500(1.12) = 500(1.12) + 500(1.12) × 0.12 meaning of ‘12% of’
= 500(1.12)(1 + 0.12) common factor of 500(1.12)
= 500(1.12)(1.12) simplifying the bracket
= 500(1.12)2 combining the brackets
which is, again, £627.20. Then, to see how much money will be in the account at the
end of three years, we just multiply again by (1.12) to get 500(1.12)3 , i.e. £702.46 (to
9 the nearest penny) and, clearly, this generalises.
The key, then, is to think of our interest rate of 100r% per year as the decimal number
r so that, given a principal P , we can see that:
After one year, the balance of the account will be P from the initial investment plus
P r from the interest accrued on this investment, i.e. after one year we will have
P + P r = P (1 + r).
After two years, the balance of the account will be P (1 + r) from the balance at the
end of the first year plus P (1 + r)r from the interest accrued on this balance, i.e.
after two years we will have
P (1 + r) + P (1 + r)r = P (1 + r)(1 + r) = P (1 + r)2 .
After three years, the balance of the account will be P (1 + r)2 from the balance at
the end of the second year plus P (1 + r)2 r from the interest accrued on this
balance, i.e. after three years we will have
P (1 + r)2 + P (1 + r)2 r = P (1 + r)2 (1 + r) = P (1 + r)3 .
148
and so on, until . . .
After n years, the balance of the account will be P (1 + r)n−1 from the balance at
the end of the (n − 1)th year plus P (1 + r)n−1 r from the interest accrued on this
balance, i.e. after n years we will have
P (1 + r)n−1 + P (1 + r)n−1 r = P (1 + r)n−1 (1 + r) = P (1 + r)n .
Thus, we have the following result.
Annually compounded interest
A principal, P , invested in an account that pays 100r% interest per year under annual
compounding will give a balance of
P (1 + r)n ,
after n years.
9.1.2 Other compounding intervals

We have seen how to calculate the balance of an account where interest is compounded
annually, but often, the interest is calculated more frequently than this. For example,
the interest may be calculated on a quarterly basis and we call this quarterly
compounding. To see how this works, let’s consider a variation on our earlier example.
Let’s say that we invest £500 in an account which pays 12% interest per year
compounded quarterly. To find the balance of this account after a year, we start by
dividing the annual interest rate by four to get the quarterly rate, i.e.
annual rate 0.12

quarterly rate =
4
=
4
= 0.03, 9
as there are four quarters in a year. Now, working this out as before, this means that
after the first quarter the balance is 500 × (1.03) = 515,
after the second quarter the balance is 500 × (1.03)2 = 530.45,
after the third quarter the balance is 500 × (1.03)3 = 546.36,
after the fourth quarter the balance is 500 × (1.03)4 = 562.75,

and so £562.75 is the balance of the account after one year (to the nearest penny).
Indeed, carrying on with this argument, we see that the balance of this account after n
years, given that we are using quarterly compounding, is
500 × (1.03)4n ,
since n years is the same as 4n quarters and in each quarter we compound at the
quarterly interest rate. Thus, thinking of this more generally, we get the following result.
149
Quarterly compounded interest
A principal, P , invested in an account that pays 100r% interest per year under
quarterly compounding will give a balance of
r 4n
P 1+ ,
4
after n years. Note that r/4 is the quarterly interest rate and 4n is the number of
quarters in n years.
Activity 9.2 Explain why quarterly compounded interest works in this way.
There are, of course, other periods over which compounding can occur, for example:
monthly compounding uses a monthly rate of r/12 and, after the first year, the
balance is r 12
P 1+ ,
12
due to the twelve monthly compoundings. After n years this yields a balance of
r 12n
P 1+ ,
12
as there are 12n months in n years.
weekly compounding uses a weekly rate of r/52 and, after the first year, the balance
is r 52
P 1+ ,
52
9 due to the fifty-two weekly compoundings. After n years this yields a balance of
r 52n
P 1+ ,
52
as there are 52n weeks in n years.
Activity 9.3 Explain why monthly and weekly compounded interest work in this
way.
Activity 9.4 Say I invest £500 at 12% interest per year. What is the balance after
one year if the interest is compounded annually? Quarterly? Monthly? Weekly?
What do you notice about these numbers?
In each case, what is the balance after three years?
In each case, how much interest will you have received after six months?
26
[Note that, to 5dp, (1.01)6 = 1.06152 and 1303
1300
= 1.06176.]
150
Thinking about these examples more generally leads us to the following general result
for compounding over a given interval.
Compound interest over a given interval
A principal, P , in an account that pays 100r% interest per year where interest is
compounded over m equal intervals in each year will give a balance of
r mn
P 1+ ,
m
after n years. Note that r/m is the interest rate for each compounding and mn is
the number of compoundings in n years.
And, using this result, we can easily recover all of the compounding results that we have
seen so far.
Activity 9.5 Explain why the general result for compound interest over a given
interval works.
Daily compounded interest is always calculated on the assumption that a year has 365
days. But, as we know, in reality, every four years we have a leap year that has 366
days. In the next activity, just for fun, we consider how this would affect the calculation
of daily compounded interest if we were to take this into account.
Activity 9.6 Say I invest £1, 000, 000 at 12% interest per year at the beginning of
a common (i.e. non-leap) year. If the interest is compounded daily, what is the
balance at the end of the year?
Say I invest £1, 000, 000 at 12% interest per year at the beginning of a leap year. If
the interest is compounded daily, what is the balance at the end of the year? 9
What is the balance after four years?
9128 365 3051 366

[Note that, to 8dp, 9125 = 1.12747462 and 3050
= 1.12747468.]
9.1.3 Continuous compounding

As we saw in Activity 9.4, given a fixed principal and a fixed annual interest rate, we
get a larger balance if we have a larger number of compoundings in a year. In
particular, we have seen that after one year, if we have an interest rate of 100r% per
year and we compound m times in a year, the balance of the account will be
r m
P 1+ ,
m
where r/m is the interest applied during each compounding. Indeed, as m increases,
even though r/m, the rate at which interest is earned in each period, decreases the
number of periods increases and, overall, the effect of these two changes is an increase in
the balance at the end of that year. So, one might wonder what the balance after one
151
year would be if we were to make m, the number of compoundings in a year, arbitrarily

large. Would the balance continue to increase? Or would it level off at some maximum
value? In fact, it turns out that we get the latter and the maximum value we get
involves the number e that we first saw in Section 4.1.2. In particular, we get the
following result.
The exponential constant
As m gets larger and larger, the value of

r m
1+
m
gets closer and closer to
er ,
where the number e, which we call the exponential constant, is approximately 2.71828
(5dp).
Thus we can see that if the bank was to compound continuously (or, speaking loosely, if
the value of m was ‘infinitely large’ so that the interest was effectively being
compounded at ‘every instant’), the balance of the account at the end of
one year would be P er ,
two years would be (P er ) er = P e2r ,
three years would be (P e2r ) er = P e3r ,
and so on until, at the end of
n years we would have (P e(n−1)r ) er = P enr in the account.

9 Thus, in general, we have the following result.
Continuously compounded interest
A principal, P , in an account that pays 100r% interest per year under continuous
compounding will give a balance of
P enr ,
after n years. Here e is the exponential constant.
Clearly, this means that if I invest £500 at 12% interest per year with continuous
compounding, then given that e0.12 = 1.127497 to 6dp, we can see that the balance of
the account after one year will be given by,
500 e0.12 = 500(1.127497) = 563.7485,
or £563.75 (to the nearest penny) which is, we note, more than we would get from any
finite number of compoundings.
152
Activity 9.7 If I invest £500 at 12% interest per year with continuous
compounding, what will be the balance of the account after (i) two years, (ii) six
years and (iii) n years?
[Note that, to 6dp, e0.24 = 1.271249.]
9.2 Problems involving interest rates

The balance of a bank account is specified by four pieces of information:
How much do I invest? This is the principal, P .

What is the interest rate? This is the annual interest rate, 100r%.
How long do I invest for? This is the number of years the investment is going to
last for, n.
How often is the interest compounded? This is the number of compoundings in a
year, m, if we are compounding a finite number of times every year or, if we are
continuously compounding, this is telling us to use er .
Often, mathematical problems concerning such investments supply you with two of the
first three bits of information (together with information about how often the interest is
compounded) and ask you to find the third. Let’s look at some examples.
9.2.1 How much do I need to invest to get...?

For example, consider that you are investing in an account which pays 12% interest per
year compounded annually and you want to get £10, 000 after five years. How much do
you need to invest to get this return? In this case, we seek the smallest principal, P ,
that will satisfy the inequality 9
10, 000
P (1.12)5 ≥ 10, 000 =⇒ P ≥ = 5, 674.268,
(1.12)5
to 3dp if we use the fact that (1.12)5 = 1.762342 to 6dp. This means that, if I invest
£5, 674.27, we will meet, or rather just exceed, our target.
9.2.2 What interest rate do I need to get...?

For example, consider that you are investing £5, 000 and you want to get £6, 000 after
a five year period. If the interest is compounded annually, what interest rate do you
require the bank to have? In this case, we need to find the smallest interest rate 100r%
per year that will satisfy the inequality
5, 000(1 + r)5 ≥ 6, 000,
which we can rearrange to get
15 15
5 6, 000 6 6
(1 + r) ≥ =⇒ 1+r ≥ =⇒ r≥ − 1 = 0.0371,
5, 000 5 5
153
to 4dp if we use the fact that (6/5)1/5 = 1.0371 to 4dp. This means that the interest
rate needs to be at least 3.8% (to 1dp) if we want to ensure that we meet, or rather just
exceed, our target.
9.2.3 How long do I need to invest to get...?

For example, consider that you are investing £500 at 12% interest per year compounded
annually. How long do you need to invest for in order to get a balance of £1, 000? In
this case, we need to find the smallest number of years, n, that will satisfy the inequality
1, 000
500(1.12)n ≥ 1, 000 =⇒ (1.12)n ≥
= 2.
500
That is, we need to find the value of n that makes (1.12)n greater than or equal to 2, a
problem that is easily solved using logarithms. For instance, if we take common
logarithms of both sides of this inequality, we get
log[(1.12)n ] ≥ log(2),
and so, using the power law for logarithms, the left-hand side of this equation gives us
log(2)
n log(1.12) ≥ log(2) =⇒ n≥ ,
log(1.12)
as log(1.12) > 0. Now, if we were given that log(2) = 0.301 and log(1.12) = 0.049, both
to 3dp, this gives us
log(2) 0.301
= = 6.14,
log(1.12) 0.049
to 2dp. Thus, as this gives us n ≥ 6.14, we need to invest for at least 6.14 years if we
want to ensure that we meet, or rather just exceed, our target. Consequently, as interest
is calculated at the end of each year, this means that we must invest for seven years to
get the desired return.
9 9.2.4 Annual percentage rates

Suppose that we are given a choice between two bank accounts. One offers an interest
rate of 10% per year with daily compounding and the other offers an interest rate of
10.1% per year with quarterly compounding. The question is, which of these accounts
will give you the best return on your money? The one with the higher interest rate or
the one where the interest is compounded more often?
One way of comparing these accounts is to calculate the Annual Percentage Rate (or
APR). This gives us a way of comparing the returns by asking, if I invested £1 in the
account for one year, what interest rate with annual compounding would I need to get
the same return? That is, by converting the returns into an equivalent interest rate that
uses a standard number of compoundings (in this case, one) over one year, we can use
the APRs to decide which account gives the higher return.
So, if we let 100r∗ % be the APR, in the case of the account where we have an interest
rate of 10% per year with daily compounding, investing £1 would give us
365
0.1
return from account −→ 1 + = 1+r∗ ←− return from annual compounding,
365
154
given that we are comparing the investments over one year. Then, if we were given the
relevant information, say that
365
0.1
1+ = 1.1052
365
to 4dp, we would find that
1 + r∗ = 1.1052 =⇒ r∗ = 0.1052 = 10.52%,
is the APR. And, similarly, in the case of the account where we have an interest rate of
10.1% per year with quarterly compounding, investing £1 would give us
4
0.101
return from account −→ 1 + = 1+r∗ ←− return from annual compounding,
4
given that we are comparing the investments over one year. Then, if we were given the
relevant information, say that
4
0.101
1+ = 1.1048
4
to 4dp, we would find that
1 + r∗ = 1.1048 =⇒ r∗ = 0.1048 = 10.48%,
is the APR. Thus, as we get a better return (i.e. a higher APR) from the account where
I have an interest rate of 10% per year with daily compounding, we should opt for this
one. In particular, notice that here, the higher number of compoundings
overcompensates for the fact that this account has a slightly lower interest rate! To
summarise then, we have the following result.
Annual percentage rate (APR)
An account that pays 100r% interest per year where interest is compounded over m
equal intervals in each year has an APR of
9
r m
1+ − 1,
m
as a decimal.
If the interest is continuously compounded, then the APR is
er − 1,
as a decimal. Here e is the exponential constant.
Activity 9.8 You want to invest some money for a year and are given the choice
between two accounts that use monthly compounding. Given that one account offers
an interest rate of 5% per year and the other account offers an interest rate of 6%
per year for the first three months and 4% per year for the rest of the year, find their
APRs and decide which gives the best return.
12 201 3 301 9
[Note that, to 5dp, 241

240
= 1.05116, 200
= 1.01508 and 300
= 1.03040.]
155
9.3 Depreciation
Often, when you buy an asset, e.g. a car, its value depreciates (or goes down) over time.
For example, if you buy a car for £10, 000 and you know that a car depreciates at a
rate of 5% per year, its value after one year is given by
10, 000 − 5% of 10, 000 = 10, 000 − 10, 000 × 0.05 = 10, 000(1 − 0.05) = 10, 000 × 0.95,
which is £9, 500. To find the value after two years we follow a similar procedure to get
10, 000 × (0.95)2 ,
which is £9, 025. Clearly, this generalises, so that after n years the value of the car is
given by 10, 000(0.95)n pounds.
Thus, the idea behind depreciation is that the rate of depreciation acts like a compound
interest rate, but whereas with compound interest we add the effect of the interest rate,
when we look at depreciation we need to subtract to get the effect of the rate of
depreciation as the value is decreasing over time. And, as we saw above, this means that
we can use the same formulae, but now the rate of depreciation which is the positive
number, r, needs to be replaced by the negative number −r. As such, we have the
following result.
Compound depreciation
If the initial value of an asset is V and it depreciates at a rate of 100r% per year
where depreciation is compounded over m equal intervals in each year, then its value
will be r mn
V 1− ,
m
after n years. Here r/m is the rate of depreciation for each compounding and mn is
9 the number of compoundings in n years.
If this asset depreciates continuously, its value after n years is
V e−nr ,
where e is the exponential constant.
Example 9.1 A computer is bought for £1, 000 and its value depreciates
continuously at a rate of 40% per year. How much will the computer be worth after
six months?
As the value of the computer is depreciating continuously at a rate of 40% for six
months, which is half of a year, its value after that time is given by
1, 000 e−(0.5)(0.4) = 1, 000 e−0.2 = 1, 000(0.81873) = 818.73,
where we have used the fact that e−0.2 = 0.81873 to 5dp. That is, the computer will
be worth £818.73 after six months.
156
Learning outcomes
solve problems that involve simple and compound interest;
use APRs to compare investments;
solve problems that involve depreciation.
Exercises
Exercise 9.1
You invest P pounds in a savings account that pays 5% interest per year using annual
compounding.
(i) Write down, in terms of P , the amount that will be in the account after one, two
and three years.
(ii) If, after two years, the amount in the account is £1, 764, how much did you
initially invest?
Exercise 9.2
Find the value of a principal sum of £10, 000 invested at an interest rate of 12% per
year for three years when the interest is compounded (i) annually, (ii) quarterly, (iii)
monthly, and (iv) continuously.
What is the APR of each of these investments?
[Note that, to 7dp, (1.03)4 = 1.1255088, (1.01)12 = 1.1268250 and e0.12 = 1.1274969.]
9
Exercise 9.3
Two investments are made and it is given that the principal of one is 80% of the other.
If the smaller principal is put into an account where interest is paid at 5% per year
using continuous compounding and the larger principal is put into an account where
interest is paid at 2% per year using continuous compounding, how long will it take for
the two accounts to have the same balance?
[Note that, to 5dp, ln(0.8) = −0.22314.]
Exercise 9.4
A car is worth £20, 000 brand-new, but its value depreciates continuously at a rate of
20% per year.
(i) How much will the car be worth in three years?
(ii) When will it be worth half of its initial value?

[Note that, to 7dp, e−0.2 = 0.8187308 and ln(2) = 0.6931472.]
157
10. Financial Mathematics II — Applications of series
Unit 10: Financial mathematics II

Applications of series
Overview
In this final Mathematics unit, we look at some more complicated ideas in financial
mathematics. The key concept here is a geometric series and how this allows us to deal
with regular savings plans and annuities. We will also see how to compare the value of
different investment strategies using the idea of present values.
Aims
To see how arithmetic and geometric series can be summed.
To see how geometric series can be used to model certain kinds of investment.
To see how certain investments can be compared using present values.

10.1 Sequences and series

In general, a sequence is an ordered list of numbers such as
2, 5, 8, 11, . . .
where here, the list is ordered because we consider 2 to be the first term, 5 to be the
10 second term and so on. Indeed, we could think of this sequence of numbers as what we
get when we start with two and then add three to the previous term to get each
successive term. Indeed, as we could continue to do this indefinitely, we use the ‘. . .’ to
indicate that this list of numbers goes on forever. In this course, we will be interested in
two special types of sequence and what we get when we add up some (or all) of the
terms in such a sequence.
10.1.1 Arithmetic sequences and series

An arithmetic sequence is a sequence where each term is found by adding a common
difference to the previous term. That is, if the first term is a and the common difference
is d, then we get the arithmetic sequence given by
a, a + d, a + 2d, a + 3d, . . .
158
which is generated by adding the number d to each term to get the next term. Observe
that we call d the common difference because we move from one term of the sequence to
the next by adding d.
Of course, we have seen this kind of thing before since taking the first term to be P and
the common difference to be P r, we get the arithmetic sequence
P, P + P r, P + 2P r, P + 3P r, . . .
which is, for principal P and an interest rate of 100r% per year, the initial balance
followed by the balance after one year, two years, three years, . . . under simple interest.
Of course, this means that the balance after n years will be given by P + nP r, or
P (1 + nr), as we saw in Activity 9.1.
Summing an arithmetic series
An arithmetic series is what we get when we add up a certain number of successive

terms from an arithmetic sequence. For instance, if we were to add up the first three
terms of the arithmetic sequence
a, a + d, a + 2d, a + 3d, . . .
we would want to find the sum of the arithmetic series
a + (a + d) + (a + 2d).
We can easily find this sum, let’s call it S, by writing it as
S = a + (a + d) + (a + 2d),
and then rewriting the series in reverse order to get
S = (a + 2d) + (a + d) + a,
so that, adding the corresponding terms in these two expressions for S together we get
2S = [a + (a + 2d)] + [(a + d) + (a + d)] + [(a + 2d) + a], 10

which gives us
2S = [2a + 2d] + [2a + 2d] + [2a + 2d].
Now, since there are three occurrences of (2a + 2d) on the right-hand side of this
expression, we get
3
2S = 3[2a + 2d] =⇒ S = [2a + 2d] = 3[a + d],
2
as the sum of this arithmetic series.
In fact, this procedure can be used to sum any arithmetic series and if we apply it to
the first n terms of the arithmetic sequence
a, a + d, a + 2d, a + 3d, . . .
159
we can find a formula for the sum of any arithmetic series. If we do this, we get the
following result.
Sum of an arithmetic series
The sum of the arithmetic series
a + (a + d) + (a + 2d) + · · · + (a + [n − 1]d),
where a is the first term of the series, d is the common difference and n is the number
of terms is
n
(2a + [n − 1]d).
2
A useful way of remembering this formula is to note that we can write
2a + [n − 1]d as a + (a + [n − 1]d),
and so we have
a + (a + [n − 1]d)

n
a+(a+d)+(a+2d)+· · ·+(a+[n−1]d) = a+(a+[n−1]d) = n .
2 2
So, noting that n is the number of terms, a is the first term of the series and a + [n − 1]d
is the last term in the series, this means that the sum of the arithmetic series
a + (a + d) + (a + 2d) + (a + 3d) + · · · + (a + [n − 1]d),
can be thought of as ‘the number of terms multiplied by the average of the first and last
terms of the series’.
Example 10.1 Find the sum of the arithmetic series 1 + 4 + 7 + 10 + 13.
Here the first term is 1 and we have to add three to get each successive term and so
that is the common difference. So, as there are five terms in the series, we can use
the formula to see that
10 1 + 4 + 7 + 10 + 13 =
5

5
2(1) + [5 − 1](3) = × 14 = 35,
2 2
as you can easily verify with your calculator.

Alternatively, we see that the first term is 1, the last term is 13 and so their average
is 7. Multiplying this by the number of terms, i.e. 5, we again get a sum of 35.
Activity 10.1 Find the sum of the whole numbers from 1 to 100.
If n is a whole number, what is the sum of the whole numbers from 1 to n?
Activity 10.2 Suppose that you have an eccentric aunt who, starting in 2000, gives
you a cash gift every year and the amount you get (in pounds) is given by the year.
(So, in 2000 you get a gift of £2, 000 and in 2001 you get a gift of £2, 001, etc.) If
160
you save all of these gifts in your money box, how much will you have after you have
received the gift in 2013?
How much will you have in your money box after you have received n of these gifts?

Use the procedure described above to derive the formula for the sum of an
arithmetic series.
10.1.2 Geometric sequences and series

A geometric sequence is a sequence where each term is found by multiplying the
previous term by a common ratio. That is, if the first term is a and the common ratio is
r, then we get the geometric sequence given by
a, ar, ar2 , ar3 , ar4 , . . .
which is generated by multiplying each term by the common ratio to get the next term.
Observe that we call r the common ratio because we move from one term of the
sequence to the next by multiplying by r.
Of course, we have seen this kind of thing before since taking the first term to be P and
the common ratio to be 1 + r, we get the sequence
P, P (1 + r), P (1 + r)2 , P (1 + r)3 , . . .
which is, for a principal P and a 100r% per year interest rate, the initial balance
followed by the balance after one year, two years, three years, . . . under annual
compounding.
Summing a geometric series with a finite number of terms
A geometric series is what we get when we ‘add up’ a certain number of successive
terms from a geometric sequence. For instance, if we were to add up the first three
terms of the geometric sequence
a, ar, ar2 , ar3 , ar4 , . . .
10
we would want to find the sum of the geometric series
a + ar + ar2 .
We can easily find this sum, let’s call it S, by writing
S = a + ar + ar2 ,
and then multiplying this whole expression by the common ratio, r, to get
rS = ar + ar2 + ar3 ,
so that, subtracting the second expression from the first we get
S − rS = a − ar3 ,
161
as all the intermediate terms cancel. Taking out the common factor of S on the
left-hand side and a on the right-hand side then gives
S(1 − r) = a(1 − r3 ).
So, assuming that r 6= 1,1 we have

1 − r3
S=a ,
1−r
as the sum of this geometric series.
But, what happens if r = 1? This case is not covered by the formula that we have just
found and so we must treat it separately. To do this, consider what happens to this
geometric series if r = 1 and notice that, in this case, all of the terms just become a, i.e.
we just have
S = a + a + a = 3a,
and so, this is the sum of this geometric series if r = 1.
In fact, this procedure can be used to sum any geometric series and if we apply it to the
first n terms of the geometric sequence
a, ar, ar2 , ar3 , ar4 , . . .
we can find a formula for the sum of any geometric series with a finite number of terms.
If we do this we get the following result.
Sum of a finite geometric series
The sum of the finite geometric series,
a + ar + ar2 + · · · + arn−2 + arn−1 ,
where a is the first term, r is the common ratio and n is the number of terms is
1 − rn
a ,
1−r
provided that r 6= 1. If r = 1, the sum is an instead.
10
Example 10.2 Sum the geometric series 1 + 2 + 22 + 23 + 24 + 25 .
This geometric series has six terms where the first term is 1 and the common ratio is
2. Thus, using the formula, we have
1 − 26 1 − 26
1 + 2 + 22 + 23 + 24 + 25 = 1 × = = 26 − 1 = 63,
1−2 −1
as the sum of this series. This can be verified by adding up the terms on your
calculator.
1
So that we are not dividing by zero!
162
Example 10.3 Sum the geometric series 3 + 6 + 12 + 24 + 48.
We notice that each successive term of this series is multiplied by two and so this
geometric series, which has five terms, can be written as
3 + 6 + 12 + 24 + 48 = 3 + 3 × 2 + 3 × 22 + 3 × 23 + 3 × 24 ,
which means that the first term is 3 and the common ratio is 2. Thus, using the
formula, we have
1 − 25 −31
3 + 6 + 12 + 24 + 48 = 3 × =3× = 93,
1−2 −1
calculator.
1 1 1 1
Example 10.4 Sum the geometric series − + − .
2 4 8 16
1
We note that this geometric series has four terms where the first term is 2
and, as
we can write it as
2 3
1 1 1 1 1 1 1 1 1 1 1
− + − = + − + − + − ,
2 4 8 16 2 2 2 2 2 2 2
we can see that the common ratio is − 12 . Thus, using the formula, we have
1 1 − (− 21 )4 1 1− 1
16 1 1 1 15 5
× 1 = × 3 = 1− = × = ,
2 1 − (− 2 ) 2 2
3 16 3 16 16
calculator.

Use the procedure described above to derive the formula for the sum of a geometric
series with a finite number of terms.
10
Summing a geometric series with an infinite number of terms
Sometimes, we can make sense of what happens when we have an infinite number of
terms in our geometric series. In such cases, we want to find the value of
a + ar + ar2 + ar3 + · · · ,
and here, the absence of a last term in the series is supposed to indicate that it ‘goes on
forever’ or that it has an infinite number of terms. To see what the sum of such an
infinite geometric series would be, we recall that if we just took the first n terms of this
series, the sum would be given by
1 − rn
a ,
1−r
and we want to see what happens to this formula if we let n go off to infinity.
163
In particular, if |r| < 1, then rn gets smaller as n gets larger. This means that, as n goes
to infinity, rn will go to zero, and so our formula will give us
1−0 a
= a ,
1−r 1−r
as the sum of our geometric series with an infinite number of terms.
However, if |r| > 1, then rn gets larger [in magnitude] as n gets larger. This means that,
as n goes to infinity, rn will go to infinity too and so we will not be able to make any
sense of the formula. In such cases, we say that the sum of the infinite geometric series
‘does not exist’.2
To summarise, we have the following formula which allows us to sum an infinite
geometric series when |r| < 1.
Sum of an infinite geometric series
The sum of the infinite geometric series,
a + ar + ar2 + ar3 + · · · ,
where a is the first term and r is the common ratio is

a
,
1−r
provided that |r| < 1. If |r| ≥ 1, the sum of this series does not exist.
1 1 1 1
Example 10.5 Sum the infinite geometric series + + + + ···.
2 4 8 16
This geometric series has an infinite number of terms, the first term is 21 and we can
write it as 2 3
1 1 1 1 1 1 1
+ + + + ··· ,
2 2 2 2 2 2 2
so the common ratio is 1 . As 1 < 1, we can use the formula to see that

2 2
10 1 1
2 2
= =1
1 − ( 12 ) 1
2
is the sum of this infinite geometric series.
1 1 1 1
Example 10.6 Sum the infinite geometric series − + − + ···.
2 4 8 16
1
This geometric series has an infinite number of terms, the first term is 2
and we can
write it as 2 3
1 1 1 1 1 1 1
+ − + − + − + ··· ,
2 2 2 2 2 2 2
2
For reasons we won’t go into here, the sum of an infinite geometric series doesn’t exist when |r| = 1
either. The r = 1 case is obvious, but the r = −1 case is harder to understand.
164
so the common ratio is − 12 . As − 21 < 1, we can use the formula to see that

1 1
2 2 1
= =
1− (− 12 ) 3
2
3
is the sum of this infinite geometric series.
10.2 Financial applications of geometric series

We now look at how geometric series can be used to model the value of different
investment schemes. We start with a regular saving plan where, after an initial deposit,
a saver chooses to invest a certain additional amount every year and we ask, what is his
balance after a certain number of years? We then look at annuities. These involve an
initial lump sum investment which provides the investor with a certain income at the
end of each year. In this case, we are interested in how large this annual income can be
for a certain number of years given the size of the initial investment. Lastly, we look at
present values which, given a choice of several different investment opportunities, allow
us to determine which one is the best.
10.2.1 Regular saving plans

Often, when one invests in a bank account, it is common to invest a certain amount
over regular time periods instead of just making one lump sum payment. For example,
if you decide to invest £600 at the beginning of each year in an account which pays
annually compounded interest at a rate of 12% per year, what would the balance of the
account be at the beginning of the fourth year of the investment (just after that year’s
£600 has been invested)? We can work this out as follows.
At the end of the first year, the balance of the account is 600(1.12).
At the beginning of the second year, another £600 is added to the account making
the balance 600 + 600(1.12) and so, at the end of the second year, the balance is
[600 + 600(1.12)](1.12) = 600(1.12) + 600(1.12)2 . 10
At the beginning of the third year, another £600 is added to the account making
the balance 600 + 600(1.12) + 600(1.12)2 and so, at the end of the third year, the
balance is
[600 + 600(1.12) + 600(1.12)2 ](1.12) = 600(1.12) + 600(1.12)2 + 600(1.12)3 .
Now, if we add another £600 at the beginning of the fourth year, this means that the
balance of the account is now
600 + 600(1.12) + 600(1.12)2 + 600(1.12)3 .
This is a geometric series of four terms with a first term of 600 and a common ratio 1.12
which means that, using the formula above, the balance we seek is given by
1 − (1.12)4 1 − (1.12)4
600 × = 600 × = 5, 000[(1.12)4 − 1] = 2, 867.595,
1 − 1.12 −0.12
165
or £2, 867.60 (to the nearest penny) if we use the fact that, to 6dp, 1.124 = 1.573519.
Similarly, if we wanted to follow this investment scheme for a longer period of time, for
example if we wanted to calculate the balance at the beginning of the twenty-sixth year
(just after that year’s £600 has been invested), we would need to sum the geometric
series
600 + 600(1.12) + 600(1.12)2 + 600(1.12)3 + · · · + 600(1.12)25 ,
which has 26 terms. So, again using our formula, the balance is given by
1 − (1.12)26
600 × = 5, 000[(1.12)26 − 1] = 90, 200.36,
1 − 1.12
pounds (to the nearest penny) if we use the fact that, to 6dp, 1.1226 = 19.040072.
Indeed, more generally, we can see that if we wanted to calculate the balance at the
beginning of the nth year (just after that year’s £600 has been invested), we would
need to sum the geometric series
600 + 600(1.12) + 600(1.12)2 + 600(1.12)3 + · · · + 600(1.12)n−1 ,
which has n terms. And so, using the formula again, we see that
1 − (1.12)n
600 × = 5, 000[(1.12)n − 1],
1 − 1.12
is the balance of the account at the beginning of the nth year.
10.2.2 Annuities
If we invest a certain amount of money, P , in a bank account that pays annually
compounded interest at a rate of 100r% per year, we may want to set up an annuity.
This is where, at the end of each of the next n years, we receive a payment of I from
the account. The question then is, under these circumstances, how much can we afford
to withdraw each year? If we withdraw too much or for too long a time, the money in
the account will run out. But, if we withdraw too little or for too short a time, we have
put too much money in the account. How can we model an annuity so that we can be
sure that we are investing in a wise and sustainable way?
10 For example, suppose that we decide to invest £10, 000 in an account which pays
annually compounded interest at a rate of 5% per year in order to set up an annuity
that will pay £I at the end of each year for the next ten years. What, we may ask, is
the balance of the account after this annuity’s last payment?
Well, consider that the balance in the account can be modelled as follows. Given an
initial investment of £10, 000, we can see that:
At the end of the first year, the balance in the account is 10, 000(1.05) and so, if we
make our first withdrawal of I, the balance is now 10, 000(1.05) − I.
At the end of the second year, the balance of the account is
[10, 000(1.05) − I](1.05) − I = 10, 000(1.05)2 − I(1.05) − I,
after we have made our second withdrawal of I.
166
At the end of the third year, the balance of the account is

[10, 000(1.05)2 − I(1.05) − I](1.05) − I = 10, 000(1.05)3 − I(1.05)2 − I(1.05) − I,
after we have made our third withdrawal of I.
And so on until. . .
At the end of the tenth year, the balance of the account is
10, 000(1.05)10 − I(1.05)9 − I(1.05)8 − · · · − I(1.05) − I,
after we have made our tenth withdrawal of I.
Now, this is the balance of the account after the annuity’s last payment and so, if we
call this B, we have

10 9 8
B = 10, 000(1.05) − I 1.05 + 1.05 + · · · + 1.05 + 1 ,
and, in particular, if we consider the series in the big square brackets we see that we
have
1 + 1.05 + · · · + 1.058 + 1.059 ,
which is a geometric series with first term one, common ratio 1.05 and ten terms. So,
using the formula above, we see that this gives us
1 − 1.0510 1 − 1.0510
1 = = −20(1 − 1.0510 ),
1 − 1.05 −0.05
and so the balance we seek is given by

B = 10, 000(1.05) − I − 20(1 − 1.05 ) = 10, 000(1.05)10 − 20I[1.0510 − 1].
10 10
We can now ask, with this annuity, how big can the withdrawals be? The key to
answering this question is to note that if our annual withdrawal, I, is too big, then at
some point before this ten year period has elapsed, the account will run out of money
and the balance will become negative. That is, if I is too big, the bank will stop
allowing us to make the withdrawals and the annuity will fail to achieve its purpose. So,
we need to see what values of I give us a balance, B, which is still non-negative after
ten years. But, if we need B ≥ 0, this means that we must have 10
10, 000(1.05)10 − 20I[1.0510 − 1] ≥ 0,
and this can be rearranged to give us
10, 000(1.05)10
10, 000(1.05)10 ≥ 20I[1.0510 − 1] =⇒ ≥ I,
20[1.0510 − 1]
as 1.0510 − 1 > 0. This means that we have
500(1.05)10
I≤ = 1, 295.0453,
1.0510 − 1
if we use the fact that, to 6dp, 1.0510 = 1.628895. That is, the maximum withdrawal we
can make each year is £1, 295.04.
167
Activity 10.5 Assuming that we make this maximum withdrawal at the end of
each year, what is the balance of the account after the last of these withdrawals?
Activity 10.6 Alternatively, suppose that we want this annuity to pay out £1, 500
at the end of each year. How many of these withdrawals will we be able to make?
[Note that, to 2dp, log1.05 23 = 8.31.]

10.2.3 Future and present values

The last thing we want to consider about investments is how to compare them. In
particular, we want to be able to compare investments which give us different returns at
different times, i.e. different future values, by considering what we shall call their
present value. This is the value of an investment at the present time, i.e. now, and the
general idea is that the investment with the largest present value is the one that is
giving us the best return. Let’s consider how this works.
Suppose that we have a principal, P , to invest for n years and the interest rate available
to us is 100r% per year compounded annually. In this case, we have
V = P (1 + r)n ,
and V is the future value of this investment, i.e. how much it will be worth after n
years. We can also see that P is the present value of this investment since that is what
it is worth to us now.
But, what if we have been promised an amount V at some point in the future? What,
we may ask, is this worth to us now? To be more specific, let’s assume that we will get
the money after n years and that an interest rate of 100r% per year compounded
annually is available to us. To see what it is worth to us now, i.e. its present value, we
ask how much we would have to invest now in order to get V at that time in the future.
In this case, we would need to invest an amount, P , such that
V
V = P (1 + r)n which means that P = ,
(1 + r)n
is the present value of an amount V available to us after n years assuming that the
10 interest rate is 100r% per year compounded annually.
This is useful to us because present values allow us to compare different amounts of
money which we may get at different times in the future by considering what they are
worth to us at some specific time, i.e. now. Let’s look at an example of how this works.
Example 10.7 Suppose that you have to choose between a gift of £20, 000 in ten
years’ time or a gift of £30, 000 in twenty years’ time. Which should you choose
given that an interest rate of 10% per year compounded annually is available to you?
Given that an interest rate of 10% per year compounded annually is available to
you, the present value of £20, 000 in ten years’ time is
20, 000 20, 000
10 = = 7, 710.8672
10 (1.1)10

1 + 100
168
or £7, 710.87 (to the nearest penny) if we use the fact that, to 6dp, 1.110 = 2.593742
whereas the present value of £30, 000 in twenty years’ time is
30, 000 30, 000
=
10 20
= 4, 459.3088
1 + 100 (1.1)20
or £4, 459.31 (to the nearest penny) if we use the fact that, to 6dp, 1.120 = 6.727500.
Thus, you should choose the £20, 000 in ten years’ time as it is worth more to you
now.3
Present values can also be used to see what an annuity is worth as we can find the
present value of each payment and hence the present value of the annuity as a whole.
Let’s look at an example.
Example 10.8 You win a competition and you can claim a prize of £10, 000 now
or an annuity which pays £1, 000 at the end of each year for ten years. Which
should you choose given that an interest rate of 5% per year compounded annually is
available to you?
The present value of the first annuity payment is 1, 000/1.05, the second is
1, 000/1.052 , and so on until the tenth which has a present value of 1, 000/1.0510 .
Thus, the present value of all the annuity payments is
1, 000 1, 000 1, 000
+ + · · · + .
1.05 1.052 1.0510
This is a geometric series with a first term of 1,000/1.05, a common ratio of 1/1.05
and ten terms which means that, using the formula for the sum of a geometric series,
we see that the present value of this annuity is
10
1
1− 1− 1
1, 000 1.05 1, 000 1.0510
=
1.05 1 1.05 0.05
1−
1.05 1.05

1, 000 1
= 1−
0.05 1.0510
10
1
= 20, 000 1 −
1.0510
= 7, 721.74
pounds (to the nearest penny) if we use the fact that, to 6dp, 1.0510 = 1.628895. As
such, when choosing your prize, you should opt for the £10, 000 lump sum as that is
worth more to you now.
3
For example, you could take the £20, 000 in ten years’ time and invest it for the following
10 10
ten years to get a return of 20, 000(1 + 100 ) = 20, 000(1.1)10 = 51, 874.84 pounds (to the nearest
penny) in twenty years’ time. This is far better than just receiving £30, 000 after the same amount of time!
We also observe, in passing, that £51, 874.85 is the future value, in twenty years’ time, of getting
£20, 000 in ten years’ time and investing it. So, in terms of future values over a common period of time,
we should, again, opt for the £20, 000 in ten years’ time!
169
Activity 10.7 Use present values to determine how many years they would have to
pay the annuity for in order for it to be a better prize than the lump sum.
[Note that, to 2dp, log1.05 (2) = 14.21.]
Activity 10.8 Suppose that the annuity was a perpetuity, i.e. you would get
£1, 000 at the end of each year forever. What is the present value of this perpetuity?
Activity 10.9 Why is your answer to the previous activity not a surprise?
Learning outcomes
identify an arithmetic sequence and sum an arithmetic series;
identify a geometric sequence and sum a finite geometric series;
find the sum of an infinite geometric series when it exists;
solve problems that involve regular savings plans and annuities;
use present values to compare investments.
Exercises
Exercise 10.1
Find the sums of the following arithmetic series.
i. 1 + 2 + 3 + · · · + 10; ii. 1 + 2 + 3 + · · · + n;
10 iii. 5 + 0 − 5 − 10; iv. 5 + 0 − 5 − · · · − 5n.
Exercise 10.2
Find the sums of the following geometric series.
1 1 1 1 1 1
i. 1+ + 2 + 3; iv. 1+ + + + · · ·;
2 2 2 3 9 27
1 1 1
ii. 3 − 6 + 12 − 24 + 48 − 96; v. 1− + − + · · ·;
2 4 8
1 1 1 1
iii. 3 − 6 + 12 − 24 + · · · + 3(−2)n ; vi. − + − + · · ·.
4 16 64 256
170
Exercise 10.3
Suppose that, at the beginning of each year you pay £500 into a savings account paying
7% interest per year. How much will be in the account at the end of the eighth year?
[Note that, to 6dp, 1.078 = 1.718186.]
Exercise 10.4
Suppose that you invest £10, 000 in a bank account that pays 5% interest per year. If
you want to withdraw £900 at the end of each year, how many years will you be able to
do this for?
[Note that, to 2dp, log1.05 49 = 16.62.]

Exercise 10.5
You win a competition and can choose between the following prizes.
(i) £50, 000 now.
(ii) £10, 000 at the end of each year for seven years.
(iii) £100, 000 in ten years’ time.

Given that an interest rate of 8% per year compounded annually is available to you,
which one should you choose?
[Note that, to 6dp, 1.087 = 1.713824 and 1.0810 = 2.158925.]
Exercise 10.6
You borrow £1, 200 from your bank which requires that you repay the loan in monthly
instalments over two years. If interest is charged at 12% per annum using monthly
compounding, how much will you have to pay back each month?
[Note that, to 6dp, 1.0124 = 1.269735.]
171
Part 2
Statistics
172
Introduction to Statistics
Syllabus
This half of the course introduces some of the basic ideas of theoretical statistics,
emphasising the applications of these methods and the interpretation of tables and
results. The Statistics part of this course has the following syllabus.
Data exploration. The statistics part of the course begins with basic data
analysis through the interpretation of graphical displays of data. Univariate,
bivariate and categorical situations are considered, including time series plots.
Distributions are summarised and compared and their patterns discussed.
Descriptive statistics are introduced to explore measures of location and dispersion.
Probability. The world is an uncertain place and probability allows this

uncertainty to be modelled. Probability distributions are explored to describe how
likely different values of a random variable are expected to be. The normal
distribution is introduced and its importance in statistics is discussed. The concept
of a sampling distribution is explored.
Sampling and experimentation. An overview of data collection methods is

followed by how to design and conduct surveys and experiments in the social
sciences. Particular attention is given to sources of bias and conclusions which can
be drawn from observational studies and experiments.
Fundamentals of regression. An introduction to modelling a linear relationship

between variables. Interpretation of computer output to assess model adequacy.
Aims of the course 10

The aims of the Statistics part of this course are to provide:
a basic knowledge of how to summarise, analyse and interpret data
an insight into the concepts of probability and the normal distribution
an overview of sampling and experimentation in the social sciences
an introduction to modelling a linear relationship between variables.
Treatment is at an elementary mathematical level throughout, so you should be

comfortable with the material covered in ‘Unit 1: Review I – A review of some basic
mathematics’ of the Mathematics part of the subject guide.
173
Learning outcomes for the course (Statistics)
At the end of the Statistics part of the course, you should be able to:
interpret and summarise raw data on social science variables graphically and
numerically
appreciate the concepts of a probability distribution, modelling uncertainty and the

normal distribution
design and conduct surveys and experiments in a social science context
model a linear relationship between variables and interpret computer output to

assess model adequacy.
Textbook
As previously mentioned in the main introduction, this subject guide has been designed
to act as your principal resource. The following textbook is referenced throughout the
Statistics part of the course.
Swift, L. and S. Piff Quantitative methods for business, management and finance.
This has been indicated as ‘background reading’ meaning it is not essential, but you
could benefit from reading it if you find any of the material in the subject guide difficult
to follow.
10
174
11. Data exploration I – The nature of statistics
Unit 11: Data exploration I

The nature of statistics
Overview
We begin the Statistics section of the course with data exploration, arguably the single
most important part of any data analysis. To make sense of any data, we must first
‘understand’ the basic features of each variable under consideration. Visualising data
communicates a wealth of information to even non-technical audiences. Data
exploration presents different ways of presenting data graphically depending on the type
of variable(s) being explored. We then move on to descriptive statistics (measures of
location and measures of dispersion) which are commonly-used statistics in the social
sciences whose roles are to ‘describe’ or ‘summarise’ data numerically.
Aims
This unit explains the nature of statistics providing a gentle introduction to the
discipline. The concept of ‘data’ is explored including the different types of data which
may be obtained. The role of statistics in the research process is also discussed.
Particular aims are:
to demonstrate how social scientists familiarise themselves with datasets prior to

further analysis
to introduce the different types of data which can occur
to explain how statistics can be used to conduct social research.
Background reading
(Palgrave, 2014) fourth edition [ISBN 9781137376558] ‘Statistics’ Chapter 8.
11
11.1 Introduction
So just what is ‘Statistics’ ? Well, there are several possible definitions. A good working
one is:
‘the study of data, involving the collection, classification, summary, display,

analysis and interpretation of numerical information.’
We consider each of these briefly.
175
Statistics is largely concerned with data. This is a plural noun meaning ‘given things’
or more loosely ‘information’ or ‘facts’.
Sometimes we look at non-numerical data such as sex (‘gender’) or social class, but
usually we are concerned with numerical information. The primary objective is to
determine what the data tell us about the underlying context in business, economics,
society, medicine etc.
11.1.1 Data collection

We can do this in several ways.
Direct observation. For example, driver behaviour on a motorway.
Simulation of data, by computer, using certain assumptions. For example, what is

the likely effect on traffic flow if the speed limit is changed?
An experiment. For example, some patients are given an active drug and others a
placebo.
A survey. For example, to find out more about consumers or voters (or computers
or cars).
The main distinction between an experiment and a survey is that, in the former case,
there is some sort of intervention by the researcher. Most, although not all, of the
statistics you may go on to carry out (in finance, politics etc.) are likely to be based on
survey data.
11.1.2 Data classification

We mention the types of data shortly – this will have an important impact on how they
should be analysed. It is a very good idea to check and ‘clean’ data in practice to make
sure there are no obvious outliers (anomalous values – more on these later in the unit)
which may need to be excluded, and to ensure there are no recording errors. Of course,
computers play a vital role in all areas of statistics, although they are not used
explicitly in this course.
11.1.3 Data summary

11
This is discussed in more detail later on in this unit. The idea is to get a quick picture
of what the ‘typical’ data value is, as measured by an average such as the mean; to
assess the spread of the data, as measured typically by the variance; and to see if the
data are symmetric, as measured by the skewness.
11.1.4 Data display

This refers to tables, graphs and charts. The purpose is not to produce a ‘pretty’
picture but to gain insight into the data and their context. A simple display is often
176
clearest and best. In some cases, the display alone is sufficient – there is no need for any
formal mathematical or statistical study.
11.1.5 Analysis
This is the heavy part of Statistics. Most of the time, the methods used are
well-established, so it is only necessary to learn the relevant technique. It is important
to understand that most methods depend on certain assumptions about the data. If
these assumptions fail to hold, the conclusions are likely to be invalid.
11.1.6 Interpretation
Outside a few universities and research institutes, clear interpretation is vital!
Interpretation should be understandable by managers and others without formal
statistical training. For example, do not say ‘the p-value of 0.02 shows that the result of
the t test is significant’, but rather ‘there is evidence that men and women differ in their
attitudes to a policy of lowering taxes’.1
11.1.7 Uncertainty
In general, what is being measured is subject to uncertainty, or random variation. For
example, two randomly-chosen groups of 100 voters will typically not give exactly the
same outcomes.
We often wish to establish whether a change or a difference (between men and women,
left-wing and right-wing voters etc.) can be attributed to chance, or whether it is the
result of some real effect. We study probability largely in order to measure this
uncertainty.
11.1.8 Descriptive and inferential statistics

It is convenient to distinguish two approaches.
Descriptive statistics comprises those methods concerned with describing a set

of data so as to yield meaningful interpretations.
Statistical inference comprises those methods concerned with analysing a subset

of data so as to draw conclusions about the entire set of data.
11
While we are defining things, let us formalise a little. The population is the collection
of all individuals or items under consideration. A sample is that part of the population
from which information is obtained when inference is used.
Example 11.1 Consider the following scenarios.
A manufacturer of tyres wants to estimate the average life of a tyre. This is an

1
Note that p-values occur in hypothesis testing, which is a form of statistical inference which you will
meet briefly in the final unit of this course.
177
inferential study – the population consists of all tyres produced, the sample
consists of 100 (or 50, or 500, or 5,000) tyres which are examined.
A sports writer wants to list the times taken to run 100 metres in Olympic
Games over 60 years. This is a descriptive study.
A politician wants to know how many votes were cast for her party in her region
at a recent election. This is a descriptive study.
An economist estimates the average income of all residents of California. This is

an inferential study – the population consists of all residents, the sample
consists of the subset examined.
Notice that in an inferential study it is the properties of the population which we wish
to determine. You could argue that it would be better to examine all population
members. This is known as conducting a census. However, this will usually be slower,
more costly and may sometimes be impossible. Consider a census of all the trees in the
UK, or all the fish in the Atlantic Ocean!
The main thing to ensure is that the sample is representative of the population. This is
most easily done using a ‘simple random sample’, where each population member has
an equal chance of inclusion in the sample, although there are alternatives. We will
explore this further in the ‘Sampling and experimentation’ section of the course.
It may not come as a surprise that, generally speaking, descriptive statistics are more
easily carried out than inferential statistics. Sadly, descriptive statistics are often poorly
done, or even omitted completely in practical contexts, as well as student work. This is
a shame because they can tell us a great deal about the data, and can even render
inferential statistics unnecessary. As a rule, any data analysis should start with
descriptive statistics.
11.2 Types of data

There are several types of data, and it is important to know which one we are dealing
with, so that the correct statistical procedure is used.
Categorical data (also known as qualitative data) give information about the
discrete groups into which a population, or sample, is divided. These may be
11 nominal or ordinal.
• Nominal data are unranked. For example, a group of individuals may be
classified by gender, eye colour or blood type (A, B, AB or O).
• Ordinal data are ranked. They give information about order or rank on a
scale. For example, a group of students may be classified by the letter grades
they receive in an examination (A, B, C etc.). So-called Likert scales are
ordinal (this course is ‘very interesting’, ‘interesting’, ‘quite interesting’, ‘not
very interesting’, ‘boring’). Investments can be graded by risk on an ordinal
scale (for example, ‘high risk’, ‘moderate risk’ or ‘low risk’).
178
Metric data are numerical values on some continuous scale. They may be interval
or ratio data.
• Interval data are measured on a continuous scale and have the property that
the differences between numbers have a meaning. For example, centigrade
temperatures are interval data – the difference between 150 and 160 is the
same as the difference between 250 and 260, but both are different from the
difference between 150 and 200. The current time (for example, 19:34) is also
measured on an interval scale.
• Ratio data are similar to interval data but now there is an absolute zero, and
hence the ratio of two numbers can be given a meaning. For example, height,
weight and the length of time an individual has been alive all constitute ratio
data. In each of these cases, there is a fixed zero – nobody can have a negative
height or weight, or have lived a negative amount of time. In contrast, the zero
for centigrade temperatures or the current time is merely a matter of
convention. (Note that Kelvin temperatures do have an absolute zero and are,
therefore, measured on a ratio scale.)
Example 11.2 In a household survey the following data are collected.
Sex (gender) of the head of household – nominal.
Age of the head of household – ratio.
Thermostat setting in winter – interval.
Household income – ratio.
Time when heating is switched on – interval.
Rating of electricity providers on a 10-point scale – ordinal.
Finally, we mention that many datasets considered are on a single attribute, such as
weight. Such data are called univariate data. Sometimes we wish to consider two
variables together, say the height and weight of a group of individuals. Such data are
called bivariate data. Multivariate data arise when we consider three or more
variables together – perhaps height, weight, age and pulse rate.
There are other ways to classify data and the classification is not always precise.
However, in most cases it is fairly clear and is sufficient for most applications – in 11
particular, the choice of correct statistical method.
11.3 The role of statistics in the research process

First some definitions:
Research: trying to answer questions about the world in a systematic (scientific)

way.
179
Empirical research: doing research by first collecting relevant information (data)

about the world.
Research may be about almost any topic – physics, biology, medicine, economics,
history, literature etc. Most of our examples will be from the social sciences, i.e. from
economics, management, finance, sociology, political science, psychology etc. Research
in this sense is not just what universities do. Government, business, and all of us as
individuals do it too. Statistics is used in essentially the same way for all of these.
Example 11.3 It all starts with a question.
Can labour regulation hinder economic performance?
Understanding the gender pay gap – what has competition got to do with it?
Does racism affect health?
Children and online risk – powerless victims or resourceful participants?
Refugee protection as a collective action problem – is the European Union

shirking its responsibilities?
Do directors perform for pay?
Heeding the push from below – how do social movements persuade the rich to
listen to the poor?
Does devolution lead to regional inequalities in welfare activity?
The childhood origins of adult socio-economic disadvantage – do cohort and

gender matter?
Parent care as unpaid family labour – how do spouses share?
We can think of the empirical research process as having five key stages.
1. Formulating the research question.
2. Research design – deciding which kinds of data to collect, how and from where.
3. Collecting the data.

11 4. Analysis of the data to answer the research question.
5. Reporting the answer and how it was obtained.
We conclude this section with an example of how statistics can be used to help answer a
research question.
Example 11.4 CCTV, crime and fear of crime

Our research question is ‘what is the effect of closed-circuit television (CCTV)
surveillance on:
180
the number of recorded crimes?
the fear of crime felt by individuals?’
We illustrate this using part of the following study:
Gill and Spriggs (2005): Assessing the impact of CCTV. Home Office Research
Study 292.
The research design of the study comprised of the following.

Target area: a housing estate in northern England.
Control area: a second, comparable housing estate.
Intervention: CCTV cameras installed in the target area but not in the
control area.
Compare measures of crime and fear of crime in the target and control areas,
in the twelve months before and twelve months after the intervention.
The data and data collection were as follows.
Level of crime: the number of crimes recorded by the police in the twelve
months before and twelve months after the intervention.
Fear of crime: a survey of residents of the areas.

• Respondents: random samples of residents in each of the areas.
• In each area, one sample before the intervention date and one about twelve
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242
• The question considered here is ‘in general, how much, if at all, do you
worry that you or other people in your household will be victims of crime?’
(from 1 = ‘worry all the time’ to 5 = ‘never worry’).
Statistical analysis of the data:

11
% of respondents who worry ‘sometimes’, ‘often’ or ‘all the time’:
Target Control
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
26 23 −3 53 46 −7 0.98 (0.55, 1.74)
It is possible to calculate various statistics such as the Relative Effect Size

(RES = ([d]/[c])/([b]/[a]) = 0.98), which is a summary measure for comparing
the changes in the two areas.
181
Here RES < 1 which means that the observed change in the reported fear of
crime has been a bit less good in the target area.
However, there is uncertainty because of sampling – only 168 and 242

individuals were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1 which means that changes in the
self-reported fear of crime in the two areas are not statistically significantly
different from each other.
The number of (any kind of) recorded crimes:

Target area Control area
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
112 101 −11 73 88 15 1.34 (0.79, 1.89)
Now RES = 1.34 > 1 which means that the observed change in the number of
crimes has been worse in the control area than in the target area.
However, the numbers of crimes in each area are fairly small which means that
these estimates of the changes in crime rates are fairly uncertain.
The confidence interval for RES again includes 1 which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.
In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.
If you want to read more about research regarding this question, see:
Welsh and Farrington (2008) Effects of closed circuit television surveillance on

crime. Campbell Systematic Reviews 2008:17.
See also http://www.campbellcollaboration.org/library.php.
Many of the statistical terms and concepts mentioned above were not explained.
However, it serves as an interesting example of how statistics can be employed in the
social sciences to investigate research questions.
11 Activities 11.1, 11.2 and 11.3 are not concerned with any technicalities of statistics, and
they do not ask you to do any calculations yourself (except, perhaps, a little bit in
Activity 11.2). Instead these exercises invite you to think about various topics related to
the use of statistics, and to research design more generally. These include such issues as
the definition and measurement of variables, the selection of subjects for studies and the
justifiability of claims about causes and effects.
You are asked to think of answers to the questions, using your own reasoning and
common sense. You are welcome to discuss the questions with friends.
You do not need to worry about getting the answers right or wrong – the only point is
to start thinking!
182
Activity 11.1 Consider the following statements. Do you think the conclusions are
valid? If so, say why. If not, indicate why not – because the logic used is faulty,
because any assumptions made are dubious, because the data collection method is
inappropriate, or for any other reason.
(a) ‘10% of drivers involved in 100 car accidents had previously taken substance X.
A parallel study of drivers not involved in accidents showed that only 1% had
taken substance X. Therefore, substance X is a contributory cause of car
accidents.’
(b) ‘Five years ago, the average stay of patients in this hospital was 21 days. Now it
is 16 days. We now cure our patients more quickly.’
(c) ‘We wanted to see if the public approved of our plans to transfer resources to
elderly patients. We carried out a large-scale survey based on 800 daytime city
centre interviews. We found 79% of respondents approved of our plans.
Therefore, we have public backing.’
(d) ‘Nugro is the revolutionary hair restorer for men. A sample of 100 men with
thinning hair was selected to apply Nugro lotion every day for a month. Of
these, 77 reported new hair growth. Nugro is proven to be effective in the
treatment of male baldness.’
Activity 11.2 The following cross-tabulation shows data on the 3,593 people who
applied to graduate study at the University of California, Berkeley, in 1973. The
table classifies the applicants according to their sex, and whether or not they were
admitted to the university.
Admitted
Sex No Yes % Yes Total
Male 1,180 686 36.8 1,866
Female 1,259 468 27.1 1,727
Total 2,439 1,154 32.1 3,593
The table shows that 36.8% of male applicants, but only 27.1% of female applicants,
were admitted.
Bob observes this and concludes that in that year Berkeley practised discrimination 11
against female applicants.
Amy, however, decides to take another look at the statistics. She adds one more piece
of data, the department to which each person applied, and creates cross-tabulations
separately for each department (which are labelled A, B, C, D and E). These tables
are shown below. For example, the first table cross-classifies the sex and admission
status of just those 585 people who applied to Department A, and so on.
Amy examines her tables and states that she disagrees with Bob – there is no
evidence of discrimination. Why does she conclude this? Why do Amy and Bob
come to different conclusions? Which one do you agree with?
183
Admitted
Department Sex No Yes % Yes Total
A Male 207 353 63.0 560
Female 8 17 68.0 25
Total 215 370 63.2 585
B Male 205 120 36.9 325

Female 391 202 34.1 593
Total 596 322 35.1 918
C Male 279 138 33.1 417

Female 244 131 34.9 375
Total 523 269 34.0 792
D Male 138 53 27.7 191

Female 299 94 23.9 393
Total 437 147 25.2 584
E Male 351 22 5.9 373

Female 317 24 7.0 341
Total 668 46 6.4 714
Total 2,439 1,154 32.1 3,593
Activity 11.3 Each of the statements below mentions a piece of statistical

evidence, and a claim based on it. Do you agree with the claims? Why or why not?
Are there any fallacies in the claims, or complications which are being glossed over?
The questions marked with (†) are a bit more subtle and complex than the rest.
(a) A public consultation exercise on attitudes to genetically modified (GM) food

was carried out in the UK in 2002–03. This involved various events where
interested members of the public could come and take part in discussions about
GM food. After the events, the participants were asked to complete a
questionnaire, which was also available on a website. Around 37,000 people
completed the questionnaire, and 90% of those expressed opposition to GM
11 foods. Therefore, a very large majority of the people in the UK oppose GM
foods.
(b) In a study of the ages and professions of people who had died, it was found that
the profession with the lowest average age of death was ‘student’. Therefore,
being a student is the most dangerous of professions.
(c) In 2007, the official suicide rate in Sweden was 15.8 per 100,000 people per year.
This was much higher than in many other countries, some of which even had a
rate of 0.0. This indicates that suicide is a much more serious problem in
Sweden than in those other countries.
184
(d) Data over the past 10 years in a country show that the number of deaths from
drowning tends to be higher in months when the total consumption of ice cream
is high. Therefore, eating ice cream before going swimming increases the risk of
drowning.
(e)† A country has two kinds of secondary schools – private schools and state-owned
schools. Statistics show that 40% of those graduating from private schools, but
only 20% of those graduating from state schools, go on to study at a university.
Therefore, private schools are twice as good as state schools.
(f)† Sociologists conduct a study where they select a random sample of people and
ask these people for a list of their close friends. A random sample of the people
named as friends is then contacted and the survey is repeated. The people
sampled at the second stage have, on average, many more friends than do the
people in the original sample. Therefore, your friends have more friends than
you do.
11.4 Summary
This introductory unit has outlined the purpose of statistics and the role the discipline
plays in the research process. Preliminary considerations of issues relating to data
collection and analysis were discussed, as well as the different types of data which exist.
Having spent some time thinking about the nature of statistics, you are now ready to
start doing statistics, beginning with data visualisation in the next unit.
11.5 Key terms and concepts

Bivariate data Categorical data
Census Data
Descriptive statistics Direct observation
Experiment Interval data
Metric data Multivariate data
Nominal data Ordinal data
Population Probability
Ratio data Research
Sample Simulation
Statistical inference Survey
11
Univariate data
185
Learning outcomes
outline issues relating to data collection and analysis
describe the different types of data
explain the role of statistics in the research process
discuss the key terms and concepts introduced in this unit.
Exercises
Exercise 11.1
The given working definition of ‘Statistics’ was:
‘the study of data, involving the collection, classification, summary, display,

analysis and interpretation of numerical information.’
What does this mean?
Exercise 11.2
Briefly discuss the distinction between descriptive statistics and inferential statistics.
Exercise 11.3
Explain the different types of data which can occur.
Exercise 11.4
What is the measurement level for each of the following variables?
(a) The quality ranking of a newspaper.

11
(b) The classification of an examination result as ‘Distinction’, ‘Merit’, ‘Pass’ or ‘Fail’.
(c) Country of birth.
(d) Favourite music.
(e) Income measured by percentiles (for example, if someone’s income is above the
20th percentile, this means 20% of the population earn less).
186
Exercise 11.5
In 2009 the UK government reclassified cannabis from a Class C drug to a Class B
drug, thereby introducing the threat of arrest for possession of the drug. The following
table cross-classifies age and agreement with the reclassification.
Agree with the reclassification?

Age Disagree Unsure Agree Total
18–39 50% 30% 20% 100%
40–59 ? ? ? 100%
60 and over ? ? ? 100%
Complete the table in such a way that there is a weak positive association between age
and agreement. (Assume the measurement scale of agreement as given in the table is an
ordinal one.)
11
187
12. Data exploration II – Data visualisation
Unit 12: Data exploration II

Data visualisation
Overview
Graphical representations of data provide us with a useful view of the distribution of

variables. In this unit, we shall cover a selection of approaches for displaying data
visually – each being appropriate in certain situations. In the next unit we consider
descriptive statistics, whose main objective is to interpret key features of a dataset
numerically. Graphs and charts have little intrinsic value per se, rather their main
function is to bring out interesting features of a dataset. For this reason, simple
descriptions should be preferred to complicated graphics.
Aims
This unit explains the importance of data visualisation and its role in communicating
the underlying distribution of data. Particular aims are:
to provide a basic knowledge of how to summarise, analyse and interpret data

visually
to recommend appropriate graphical methods for different types of variables.
Background reading
(Palgrave, 2014) fourth edition [ISBN 9781137376558] ‘Describing data’ Chapter 1.
12.1 Grouping data

Consider the monthly expenditure, in pounds, on credit cards by 300 individuals.
141.24 −25.00 82.23 233.90 0.00

12 79.50 0.00 6.41 59.63 102.71 etc.
The second observation of −£25 indicates negative expenditure – presumably a refund

on a previously-purchased item. It is difficult to interpret the data when just in the
form of a lot of numbers. However, we can first group the data into classes and then
find out how many data points are in each class.
188
Expenditure Number of individuals Expenditure Number of individuals

[−25, 25) 87 [575, 625) 3
[25, 75) 55 [625, 675) 2
[75, 125) 30 [675, 725) 3
[125, 175) 24 [725, 775) 1
[175, 225) 23 [775, 825) 4
[225, 275) 22 [825, 875) 2
[275, 325) 8 [875, 925) 0
[325, 375) 10 [925, 975) 0
[375, 425) 7 [975, 1025) 0
[425, 475) 6 [1025, 1075) 3
[475, 525) 3 [1075, 1125) 0
[525, 575) 6 [1125, 1175) 1
This is much better! We can see, for example, that 172 individuals (slightly over half
those surveyed) spend less than £125.
There is some arbitrariness in the grouping used and the choice of classes is often down
to common sense, but as a guide:
there should be between 5 and 25 classes
each piece of data should belong to one and only one class
in general, all classes should have the same width (but we can sometimes have
open-ended classes at the extreme, such as < 0 or > 1000).
Some terms associated with grouping data are the following.
Classes are categories for grouping data.
The frequency is the number of data values in a class.
The frequency distribution is a listing of classes and their frequencies.
The lower class limit is the smallest value which can go in a class.
The upper class limit is the largest value which can go in a class.
The class mark is the midpoint of a class.
The class width is the difference between the upper and lower class limits for a
class.
12.2 Histograms 12
Diagrams are a particularly useful way of illustrating data as ‘a picture is worth a
thousand words’. We can illustrate the frequency data for the credit cards as shown in
Figure 12.1. From the histogram it is clear that most credit card holders (in the sample)
spend moderate amounts each month, with a few spending large amounts.
189
Histogram of Monthly Credit Card Expenditure
80
60
Frequency
40
20
0
0 200 400 600 800 1000 1200
Expenditure in pounds
Figure 12.1: Histogram of credit card data.
Some points to note are the following.

The height of each bar equals the frequency of the class it represents.
Each ‘bar’ extends from the lower class limit of its class to the lower class limit of
the next class.
The axes are labelled.
The histogram has an informative title.
Histograms are only used for continuous (i.e. interval or ratio) data.
It is also often useful to calculate and tabulate cumulative frequencies, that is

counting frequencies up to and including a given class, as follows.
Cumulative Cumulative
Expenditure Frequency frequency Expenditure Frequency frequency
[−25, 25) 87 87 [575, 625) 3 284
[25, 75) 55 142 [625, 675) 2 286
[75, 125) 30 172 [675, 725) 3 289
[125, 175) 24 196 [725, 775) 1 290
[175, 225) 23 219 [775, 825) 4 294
12 [225, 275) 22 241 [825, 875) 2 296
[275, 325) 8 249 [875, 925) 0 296
[325, 375) 10 259 [925, 975) 0 296
[375, 425) 7 266 [975, 1025) 0 296
[425, 475) 6 272 [1025, 1075) 3 299
[475, 525) 3 275 [1075, 1125) 0 299
[525, 575) 6 281 [1125, 1175) 1 300
190
Now we can quickly see, say, that just under two-thirds of credit card holders spend less
than £175 on a monthly basis.
Having determined the cumulative frequencies, we can construct a cumulative
frequency polygon. The horizontal axis is labelled with the class endpoints and the
vertical axis with the cumulative frequencies. A point of zero frequency is placed at the
beginning of the first class and a point is plotted at the end of each class interval for the
cumulative value. The points are then joined up and, for the credit card data, we get
Figure 12.2.
Cumulative Frequency Polygon of Monthly Credit Card Expenditure
300
x x x x x x x
x x x
x x x x
x x
x
250
x
x
x
200
Cumulative frequency
x
x
150
x
100
x
50
x
0
0 200 400 600 800 1000 1200
Expenditure in pounds
Figure 12.2: Cumulative frequency polygon of credit card data.
So, for example, from the graph, we can see that only about 16 of the 300 credit card
holders spend more than £600 in a month.
Next we look at some other types of graphical display. Recall that the type of diagram
used will depend on the type of data, and the objective of any diagram is to illustrate
the key features of the dataset.
Histograms (and some other forms of diagrams) are suitable for (univariate) interval or
ratio data. For categorical data, other alternatives are more appropriate.
Activity 12.1 At a university computing centre, the daily numbers of computer

stoppages due to machine malfunctions were recorded for a period of 70 successive
working days and the following data obtained.
0
1
0
8
2
5
0
0
0
0
0
4
3
3
3
0
0
6
0
2
12
0 3 1 1 0 1 0 1 1 0
2 2 0 0 0 17 1 2 1 2
0 1 6 4 3 3 1 2 4 0
0 3 15 2 0 0 0 0 0 1
1 0 2 0 2 4 4 0 2 2
191
(a) Produce a cumulative frequency polygon for these data.
(b) Assuming a year consists of 255 working days, on how many days would you
expect 5 or more stoppages to occur?
(c) Discuss in a couple of sentences what the data tell you. What recommendations
would you make?
12.3 Pie charts and bar graphs

These two familiar diagrams will often be the methods of choice for categorical data.
Both will quickly give the observer essential features of the data in a way the raw data
cannot. Consider information on toothpaste sales in $000s for the 10 top brands in the
US in a recent year.
Brand Sales Brand Sales

Crest 370,437 Rembrandt 52,067
Colgate 321,084 Sensodyne 50,133
Aquafresh 177,989 Listerin 40,107
Mentadent 170,630 Closeup 32,009
Arm & Hammer 109,512 Ultrabrite 25,358
We can represent this dataset using a pie chart, as in Figure 12.3.
Pie chart of Toothpaste Sales in $000s in the US
Crest 27%
Colgate 24%
Ultrabrite 2%
Closeup 2%
Listerin 3%
Aquafresh 13%
Sensodyne 4%
Rembrandt 4%
Arm & Hammer 8%

12 Mentadent 13%
Figure 12.3: Pie chart of toothpaste sales data.
Alternatively, we can construct a bar chart, as in Figure 12.4. It is similar to a

histogram except that the bars are separated.
192
Bar chart of Toothpaste Sales in $000s in the US
350000
300000
250000
200000
150000
100000
50000
Listerin
Ultrabrite
Crest
Colgate
Aquafresh
Mentadent
A&H
Sensodyne
Closeup
Rembrandt
Figure 12.4: Bar chart of toothpaste sales data.
12.4 Line graphs

Histograms, pie charts and bar charts are only a few of the many ways to portray data
visually. Fortunately, most displays are based on common sense and are easy to
understand. Line graphs are an additional method, but they should generally only be
used for time series data, where the horizontal axis represents time. Let us consider
the sales of a commodity recorded at three-monthly (seasonal) intervals as follows.
Year Season Sales Year Season Sales

1 Spring 8.3 3 Spring 9.5
1 Summer 13.1 3 Summer 14.3
1 Autumn 9.2 3 Autumn 10.4
1 Winter 6.1 3 Winter 7.1
2 Spring 8.9 4 Spring 10.1
2 Summer 13.7 4 Summer 14.9
2 Autumn 9.8 4 Autumn 11.1
2 Winter 6.6 4 Winter 7.4
The line graph of this dataset is shown in Figure 12.5. Note the clear ‘seasonal variation’
and small, but probably significant, upward trend. (What might the commodity be?)
12
12.5 Scatter plots

Scatter plots are used to illustrate the association between bivariate data points. For
193
Commodity Sales by Season
x
x
14
x
x
12
x
Sales
x
x
10 x
x
x
x
x
8
x
x
x
x
6
5 10 15
Season number
Figure 12.5: Line graph of commodity sales data.
example, we might have data on the salary and age of a number of employees of a
company, as depicted in Figure 12.6. Think about what this scatter plot tells us about
the relationship between salary and age. (Note the anomalous point is called an outlier
– more on this later in the unit.)
Scatter plot of Salary against Age
x
140
120
Salary (in £000s)
100
80
60
x x
x x x
x xx
40
x x
x
x xxx x
x
20
x x
12 20 30 40 50 60
Age
Figure 12.6: Scatter plot of ‘Salary’ against ‘Age’.
194
We now consider a slightly more elaborate example which illustrates the potential
power of relatively simple descriptive statistics. Assume we have data for advertising
and sales (both in £ millions) for 60 companies of similar size in a given year. Each
company is in one of three sectors: A, B or C. How is advertising related to sales? First,
let us look at the data.
Advertising Sales Sector Advertising Sales Sector Advertising Sales Sector
38 77 A 66 77 B 93 67 C
10 57 A 43 71 B 86 68 C
60 65 A 54 73 B 20 47 C
80 77 A 46 74 B 10 43 C
68 73 A 6 29 B 37 49 C
86 55 A 25 64 B 91 87 C
1 63 A 87 30 B 89 88 C
41 77 A 59 53 B 68 66 C
86 70 A 80 26 B 7 32 C
14 76 A 31 49 B 35 44 C
25 54 A 10 18 B 42 50 C
5 49 A 94 26 B 21 40 C
3 72 A 68 68 B 28 42 C
16 84 A 41 67 B 77 77 C
22 76 A 69 72 B 53 60 C
2 63 A 6 20 B 30 39 C
29 76 A 93 24 B 24 37 C
34 77 A 3 19 B 95 91 C
55 71 A 34 47 B 84 80 C
36 92 A 100 20 B 66 75 C
Scatter plot of Sales against Advertising for 60 companies
x x
xx
x
80
x
x x x x x x x x x
x x
x x xx
x x x
Sales (in £ millions)
x x x x
x x
xx x
60
x
x
x x
x
x x x x
x x
x x
x
40
x x
x
x
x x
x x
x
12
20
x x x x
0 20 40 60 80 100
Advertising (in £ millions)
Figure 12.7: Scatter plot of ‘Sales’ against ‘Advertising’ for 60 companies.
195
Clearly, it is very difficult to say anything interesting about the dataset by looking at
the raw data in a table. So, first we plot sales against advertising while ignoring the
sector. The scatter plot is shown in Figure 12.7 and this suggests increasing advertising
may lead to higher sales, but it is not very clear.
Suppose we produce scatter plots for each sector separately. These are shown in Figure
12.8. Advertising appears to have no effect on sales in Sector A. Advertising appears to
have an increasing effect on sales in Sector B, after which it has a decreasing effect. This
quite often happens – the market has become saturated, or the advertising campaign
becomes less effective. Finally, advertising appears to have a steadily increasing effect
on sales in Sector C.
Sector A Sector B Sector C
x x x
90
90
x x xx
x x
70
x x
80
x x
x x
x
80
60
x xx x
70

x x x x x
x x
x
50
x x
x
70
x x
60
x
40
x
xx
50
xx
60
x
30
x x x x x
x
x
40
x x x x
x x x
20
xx x x
50
x x
30
0 20 40 60 80 0 20 40 60 80 100 20 40 60 80
Advertising (in £ millions) Advertising (in £ millions) Advertising (in £ millions)
Figure 12.8: Scatter plot of ‘Sales’ against ‘Advertising’ for 60 companies, by sector.
Now consider data on sales, in thousands of units, of a small electronics firm over 10
years.
Year 1 2 3 4 5 6 7 8 9 10
Sales 2.51 2.72 3.22 3.19 4.09 4.76 5.23 6.36 7.28 9.28
What can we deduce? First, we plot the data as shown in Figure 12.9.
The data appear to be increasing exponentially (literally, i.e. according to a law of the
general form y = a + becx for some constants a, b and c). Note the precise use of the
12 word ‘exponentially’ !
However, perhaps the data points are increasing according to a quadratic, rather than
an exponential, law, so we would be better looking for a relation of the general form
y = a + bx + cx2 . Statistical modelling can be used to determine the curve best fitting a
set of data, according to some criterion. In Unit 19 we consider how to find the best
fitting line using a technique called ‘linear regression’.
196
Annual sales for a small electronics firm
9
8
Sales (in 000s) x
7
6 x
x
5
x
x
4
x x
3
x
x
2 4 6 8 10
Year
Figure 12.9: Scatter plot of ‘Sales’ against ‘Time’ for a small electronics firm.
12.6 Summary
This unit has looked at different ways of presenting data visually. Which type of
diagram is most appropriate will depend on the type of data being analysed. You
should be able to interpret any important features which are apparent from a diagram.

Bar chart Classes
Cumulative frequencies Cumulative frequency polygon
Distribution Frequency
Frequency distribution Line graph
Outlier Pie chart
Scatter plot Time series
Learning outcomes
interpret and summarise raw data on social science variables graphically
12
distinguish between univariate and bivariate situations
distinguish between categorical and continuous (including time series) variables
197
Exercises
Exercise 12.1
A pie chart is most suitable for a variable measured using which of the following scales:
(a) nominal scale, (b) ordinal scale, or (c) interval scale? What about a bar chart?
What about a histogram?
Exercise 12.2
Name one possible advantage and one possible disadvantage of histograms.
Exercise 12.3
The table below gives the numbers of people killed or seriously injured in the UK for
different categories of road user during 1982 and 1984. These two years, 1982 and 1984,
represent a complete year before and a complete year after the introduction of the seat
belt law.
1982 1984
Car drivers 19,460 16,421
Front seat passengers 9,458 7,047
Rear seat passengers 4,706 5,062
Pedestrians 18,963 19,168
Cyclists 5,967 6,506
(a) What is the percentage change in the number of people killed or seriously injured
for each category of road user between 1982 and 1984?
(b) What was the percentage of car drivers and car front seat passengers killed or
seriously injured, out of all cases, each year?
(c) Write a brief commentary on your findings (a few sentences), with any suggestions
as to additional information you would require for a fuller investigation as to why
there were percentage changes.
Exercise 12.4
The following table shows the weekly visits for five health, fitness or nutrition websites.
Display the data using a suitable graph and comment on the results, giving possible
reasons for any trends which you notice.
Site Visitors in April 2017 Visitors in April 2018

eDiets 472,000 936,000
Weight Watchers 445,000 876,000
12 WebMD 524,000 853,000
AOL Health 448,000 713,000
Yahoo! Health 396,000 590,000
198
13. Data exploration III – Descriptive statistics: measures of location, dispersion and skewness
Unit 13: Data exploration III

Descriptive statistics: measures of
location, dispersion and skewness
Overview
Although data visualisation is useful to get a ‘feel’ for the data, in practice we also need
to be able to summarise data numerically. This unit introduces descriptive statistics and
distinguishes between measures of location, measures of dispersion and skewness. All
these statistics provide useful summaries of raw datasets.
Aims
This unit introduces and explains the importance of descriptive statistics. Particular
aims are:
to calculate simple numbers which will summarise the most important
characteristics of a dataset
to explain the use and limitations of various descriptive statistics.
Background reading
(Palgrave, 2014) fourth edition [ISBN 9781137376558] ‘Describing data’ Chapter 2.
13.1 Summation notation

Very often in statistics we need to add up a set of numbers. Here we introduce the
notation which statisticians use to describe the sum of some numbers. By using this
notation we are able to write many things more concisely, and hence they become easier
to read. Let us begin with N numbers, denoted as:
x1 , x2 , . . . , xN .
Here, x1 is the first number, x2 is the second number, and so on with xN being the last
number in the dataset. For example, if the numbers are 7, 4, 12 and 6, we write:
x1 = 7, x2 = 4, x3 = 12 and x4 = 6.
Suppose we want to add up these N numbers, i.e. we want to find: 13
x1 + x2 + x3 + · · · + xN .
199
Summation operator
X
To shorten this we use the symbol (known as the summation operator),
writing:
XN
xi = x1 + x2 + x3 + · · · + xN .
i=1
We can ‘translate’ this notation as follows – ‘the sum of the values, whose typical
member is xi , beginning with the number x1 and ending with the number xN ’. So, using
the above example, if x1 = 7, x2 = 4, x3 = 12 and x4 = 6, we have:
4
X
xi = x1 + x2 + x3 + x4 = 7 + 4 + 12 + 6 = 29.
i=1
X
As you might expect, it is possible to write down other expressions involving . For
example, we might be interested in the sum of the squares of the values
x1 , x2 , x3 , . . . , xN , which would be written as:
N
X
x2i = x21 + x22 + x23 + · · · + x2N .
i=1
Quite often the value of N will be clear and in such cases it is common to write simply
X XN
xi instead of xi . With practice, using the summation operator should not pose
i=1
any difficulties. However, it is essential that you properly understand its interpretation
since the summation operator is used extensively in many areas of statistics.
Example 13.1 Suppose x1 = 1, x2 = 2, x3 = 3, y1 = 4, y2 = 5 and y3 = 6. We then

have the following.
3
X
xi = 1 + 2 + 3 = 6
i=1
3
X
yi = 4 + 5 + 6 = 15
i=1
3
X
3 xi = 3 × 6 = 18
i=1
3
X
3xi = 3 + 6 + 9 = 18
i=1
13 3
X 3
X
xi + yi = 6 + 15 = 21
i=1 i=1
200
3
X
(xi + yi ) = (1 + 4) + (2 + 5) + (3 + 6) = 21
i=1
3
X
x2i = 12 + 22 + 32 = 14
i=1
3
!2
X
xi = 62 = 36
i=1
3
X
xi yi = (1 × 4) + (2 × 5) + (3 × 6) = 32
i=1
3
X 3
X
xi yi = 6 × 15 = 90
i=1 i=1
3
X
3 60 3 60 3 60 3 60
xi + = 1 + + 2 + + 3 + = 16 + 20 + 37 = 73
i=1
yi 4 5 6
3
X
8 = 8 + 8 + 8 = 24.
i=1
Some key points to note:
3
X 3
X 3
X
We saw that xi + yi = 21 and (xi + yi ) = 21. It is true, in general, that:
i=1 i=1 i=1
N
X N
X N
X
xi + yi = (xi + yi ).
i=1 i=1 i=1
3
X 3
X
We also saw that 3 xi = 18 and 3xi = 18. It is true, in general, that:
i=1 i=1
N
X N
X
c xi = cxi
i=1 i=1
whatever the value of the constant c.

3 3
!2
X X
We also saw that x2i = 14 and xi = 36. In general:
i=1 i=1
N
X N
X
!2 13
x2i 6= xi .
i=1 i=1
201
3
X 3
X 3
X
We also saw that xi yi = 32 and xi yi = 90. In general:
i=1 i=1 i=1
N
X N
X N
X
xi yi 6= xi yi .
i=1 i=1 i=1
Activity 13.1 A dataset contains the observations 1, 1, 1, 2, 4, 8, 9 (so here,

N = 7). Find:
N
X
(a) 2xi
i=1
N
X
(b) x2i
i=1
N
X
(c) (xi − 2)
i=1
N
X
(d) (xi − 2)2
i=1
N
!2
X
(e) xi
i=1
N
X
(f) 2.
i=1
Activity 13.2 Can you explain why:
N N
!2 N N N
X X X X X
x2i 6= xi and xi yi 6= xi yi ?
i=1 i=1 i=1 i=1 i=1
N
!2 N N
X X X
What are xi and xi yi ? (Consider the case where N = 2.)
i=1 i=1 i=1
13.2 Measures of location

We now consider some ways of summarising a dataset numerically, rather than visually.
Although the graphical methods presented earlier are extremely useful for getting a
13 ‘feel’ for and organising the data, they lack precision. Measures of location (also
called measures of central tendency) are statistics which provide a typical value for a
collection of numbers. Clearly, such a typical, or average, value is an important property
202
of a dataset. We shall encounter three ways of defining the ‘average’. In general, the
average is a value typical, or representative, of a dataset.
The (arithmetic) mean is the sum of all the members of a dataset divided by the
number of values in the dataset. Sometimes the mean is denoted by µ (pronounced
‘mew’) and sometimes it is denoted by x̄ (pronounced ‘x-bar’). The distinction between
µ and x̄ is very important. µ refers to the mean of a population, whereas x̄ refers to the
mean of a sample. For now we shall not concern ourselves too much about this
distinction – we shall return to it when we cover ‘sampling distributions’ in Unit 16.
(Arithmetic) sample mean
Suppose we have a sample of n values, x1 , x2 , . . . , xn . Using the summation operator,

we write the sample mean mathematically as:
n
P
xi
i=1 x1 + x2 + · · · + xn
x̄ = = .
n n
So, for example, if a student scored 62, 74, 49, 37 and 58 in a sample of five tests, the
mean mark achieved is:
62 + 74 + 49 + 37 + 58 280
= = 56.
5 5
Another measure of location is the median. This is the central value of the dataset
when the numbers are arranged in ascending order. If no single such central value exists
(this occurs when there is an even number of values), then the mean of the two middle
numbers is taken.
For 1, 3, 5, 8, 12, the median is 5.
For 2, 5, 10, 3, 7, 6, we first arrange the values in order to give 2, 3, 5, 6, 7, 10,

which gives a median of (5 + 6)/2 = 5.5.
The mode of a set of numbers is the most frequently-occurring value. In some cases it
may not exist, or indeed it may not be unique.
The mode of 3, 3, 3, 3, 3, 7, 7, 8, 9, 9 is 3.
The set of values 2, 5, 10, 13, 16 has no mode.
The set of values 4, 4, 30, 50, 50, 90 is bimodal – there are two modes, 4 and 50.
Sometimes datasets have extreme observations, called outliers (a more formal

definition is provided shortly). By construction, the median and mode are ‘insensitive’
or resistant to outliers since, respectively, they will be at the end(s) of the ordered
datasets (hence do not affect the median) or values which only occur once (hence are 13
not modes). However, they may have a large influence on the mean and hence give a
misleading value.
203
One possible remedy is to calculate the trimmed mean. This involves dropping k
observations (where k is a number, typically 1 or 2) from each end of the ordered
dataset and calculate the mean of the remaining observations.
Trimmed (sample) mean
With ordered values x(1) , x(2) , . . . , x(n) , where x(i) indicates the ith ordered value, the
trimmed (sample) mean, denoted x̄tr , is:
n−k
P
x(i)
i=k+1
x̄tr = .
n − 2k
So the trimmed mean may be useful if we are concerned about extreme values being
present in the dataset.
For example, suppose the ordered dataset is 1, 32, 37, 38, 41, 192. Clearly, the largest
value is extreme and to a lesser extent the smallest value is too, so we set k = 1 and
compute the trimmed mean to be:
n−k
P
x(i)
i=k+1 32 + 37 + 38 + 41
x̄tr = = = 37.
n − 2k 6−2
13.2.1 Which ‘average’ should be used?

Given these three measures of location (mean, median and mode), a natural question to
ask is ‘which one should we use?’. The mean is usually the preferred choice but, due to
its sensitivity to outliers, it is not always appropriate. If a cricketer scored 15, 4, 0, 9,
148, 2, 0, 3, 6 runs over nine innings the cricketer is probably not very good – the high
score of 148 may be attributable to a weak opposition. In this case, the median, 4, may
be more representative than the mean, 20.8. Similarly, if 25 employees at a small
company have annual salaries between £20,000 and £50,000, with a single salary of
£200,000 for the managing director, again the median may better reflect typical salaries
than the mean.
In most cases the mean is the most common ‘average’ and is frequently used in many
statistical applications. One positive feature of the mean is that it uses all the data
points. However, as we have already seen, the mean is sensitive to extreme values,
unlike the median and mode. Therefore, whenever extreme points exist, the median or
trimmed mean should be considered instead.
The mode is particularly useful when the data values represent categories. For example,
if the values 6, 7, 8, . . . are the sizes of shoes sold in a shop and we want the typical size
of shoe sold.
Activity 13.3 Consider again the data on computer stoppages in Activity 12.1.
13 (a) Compute the mean, median, mode and a suitable trimmed mean for these data.
(b) Discuss the relative advantages of each of these measures.
204
13.2.2 Frequency tables

Sometimes we may have data in the form of a frequency table, such as:
Observation, xi Frequency, fi
2 4
3 2
4 3
5 1
This corresponds to the ordered dataset: 2, 2, 2, 2, 3, 3, 4, 4, 4, 5. So the mean is:
(4 × 2) + (2 × 3) + (3 × 4) + (1 × 5)
= 3.1.
4+2+3+1
This leads to the following more general result. If the numbers x1 , x2 , x3 , . . . , xk occur
with respective frequencies f1 , f2 , f3 , . . . , fk , then:
k
P
f i xi
f 1 x1 + f 2 x2 + f 3 x3 + · · · + f k xk i=1
x̄ = = k . (13.1)
f1 + f2 + · · · + fk P
fi
i=1
Let us now re-visit the credit card data from Unit 12. The classes are −25 to 25, 25 to
75 etc. and we can use the frequency table to estimate the mean. You should appreciate
that when we do this, we lose some information when the table is constructed as we no
longer have the raw (original) data. Consequently, if we calculate the mean using (13.1)
we shall expect to lose some precision, but we still expect a reasonable estimate. So we
face a trade-off – although we lose some precision (a disadvantage), we have the
convenience of summarising the data in a frequency table (an advantage). So our
frequency table is:
Expenditure Number of individuals Midpoint

[−25, 25) 87 0
[25, 75) 55 50
[75, 125) 30 100
[125, 175) 24 150
.. .. ..
. . .
In the table above, the ‘Midpoint’ column is simply the centre value of the interval in
the ‘Expenditure’ column and we take this to be the expenditure value for each class.
Now we are able to estimate the mean using (13.1). The estimate is:
(87 × 0) + (55 × 50) + (30 × 100) + (24 × 150) + · · · + (1 × 1150)

= £166.94.
87 + 55 + 30 + 24 + · · · + 1
Compare this with the ‘true’ mean calculated from the ungrouped data (not provided), 13
which is £168.30. As expected, grouping the data loses some precision, but nevertheless
we see the mean estimate is close to the true mean.
205
13.3 Measures of dispersion

Consider the two sets of numbers:
0, 1, 5, 8, 9, 19 and 6.8, 6.9, 6.9, 7.0, 7.1, 7.3.
Both have a mean of 7, but the datasets are clearly very different. The second dataset is
more ‘compact’ while the first dataset is more ‘spread out’. Just because two datasets
have the same mean is not sufficient to fully describe them as the mean is unable to
distinguish between the difference in the spread of the data. So we seek precise ways for
measuring spread, or dispersion. Just as there are several measures of location, there are
also several measures of dispersion.
We begin with the range, which is defined as the difference between the maximum and
minimum observations.
For the dataset 0, 1, 5, 8, 9, 19, the range is 19 − 0 = 19.
For the dataset 6.8, 6.9, 6.9, 7.0, 7.1, 7.3, the range is 7.3 − 6.8 = 0.5.
The first dataset has a much larger range owing to the greater dispersion.
An alternative is to divide up a dataset into quartiles. In fact, we have already met

one quartile – the median. Recall the median splits up a dataset into the bottom 50% of
values and the top 50%. However, we can also divide a dataset into four equal parts. The
first quartile, denoted Q1 , is the value which splits the bottom 25% of observations from
the top 75%; the second quartile, denoted Q2 , is simply the median; the third quartile,
denoted Q3 , is the value which splits the bottom 75% of observations from the top 25%.
Example 13.2 The marks for 20 students in an introductory statistics class are:
88 67 64 76 86 85 82 39 75 34
90 63 89 89 84 81 96 100 70 96
We first arrange the data in ascending order:
34 39 63 64 67 70 75 76 81 82
84 85 86 88 89 89 90 96 96 100
Hence Q1 = 68.5, Q2 = 83 and Q3 = 89, obtained by taking the average of the

numbers either side of each ‘|’.
Finding these quartiles posed no great difficulties here because the number of
observations, 20, is divisible by 4. When the number of observations is not divisible by 4
things can become slightly more complicated, although for our purposes it will suffice to
13 take the average of the values either side of where each quartile is located. However, in
practice most datasets are large which means the differences between alternative
methods which exist become negligible.
206
Analogously to quartiles, it is possible to divide datasets into deciles (10 equal parts) or
even percentiles (100 equal parts). For example, we can express the median as Q2 , the
5th decile or even the 50th percentile. However, we shall not consider deciles and
percentiles any further.
Having introduced quartiles, we are now in a position to discuss another measure of
dispersion – the interquartile range (IQR).
Interquartile range
We define the IQR as the difference between the third and first quartiles, that is:
IQR = Q3 − Q1 .
So for the dataset 0, 1, 5, 8, 9, 19 the median is 6.5. Q1 lies somewhere between 0

and 1 which, taking the midpoint (average) for simplicity, is 0.5. Similarly Q3 lies
somewhere between 8 and 9 which, again taking the midpoint (average) for
simplicity, is 8.5. So the IQR is 8.5 − 0.5 = 8.
For the dataset 6.8, 6.9, 6.9, 7.0, 7.1, 7.3, the quartiles are similarly estimated to be
Q1 = 6.85, Q2 = 6.95 and Q3 = 7.05. Hence the IQR is 7.05 − 6.85 = 0.2.
As we found with the range, the first dataset has a greater IQR reflecting the
greater dispersion in the dataset.
5-number summary
The 5-number summary provides a useful set of features for a distribution. It is

given by:
x(1) , Q1 , Q2 , Q3 , x(n)
where x(1) and x(n) denote the smallest and largest observations, respectively (which
are not necessarily the first and last observations, denoted x1 and xn , respectively).
We are now able to provide a more formal definition of an outlier. Earlier we described
it as an ‘extreme observation’. Now we define an outlier to be a data value which is
more than 1.5 times the interquartile range above Q3 or below Q1 , that is less than
Q1 − 1.5 × IQR or greater than Q3 + 1.5 × IQR. Extreme outliers are more than 3 times
the interquartile range above Q3 or below Q1 .
For example, for the dataset 6.8, 6.9, 6.9, 7.0, 7.1, 7.3, we have found Q1 = 6.85,
Q3 = 7.05 and IQR = 0.2. So outliers are any data points which are either less than
6.85 − 1.5 × 0.2 = 6.55, or greater than 7.05 + 1.5 × 0.2 = 7.35. Hence there are no
outliers.
Another, less often-used, measure of dispersion is the mean absolute deviation
(MAD). For a dataset containing the points xi , for i = 1, 2, . . . , n, we define it to be:
n
P
|xi − x̄|
13
i=1
MAD =
n
207
that is, we use the absolute value of the differences between the observations and the
(sample) mean. Using the absolute value sign gives equal weight to values either side of
the mean. Although it is easy to calculate the MAD, it is used far less in practice than
the more common, and important, measures of dispersion known as the variance and
standard deviation.
13.3.1 Variance and standard deviation

Variance, and its square root the standard deviation, are the most popular measures of
dispersion. Given (population) data values x1 , x2 , . . . , xN , with (population) mean µ,
the variance is defined as:
N
(xi − µ)2
P
σ 2 = i=1 . (13.2)
N
We can think of this as the average squared deviation from the mean. Due to the
squared term, data values which are distant from the mean have correspondingly large
values of xi − µ and, therefore, contribute a great deal to the variance, regardless of
whether values lie far above or far below the mean. Similarly, data values which lie close
to the mean (above or below) contribute comparatively little to the variance.
Note the notation for the variance is σ 2 , which is pronounced ‘sigma-squared’. In fact σ 2
– as defined in (13.2) – is the notation used for the population variance, i.e. when the
data values cover the entire population under consideration. If, instead, the dataset
represents a sample drawn from an underlying population, we refer to the sample
variance, which we denote by s2 . Clearly, if we only have sample data, not only is the
population variance, σ 2 , unknown, but so is the population mean, µ.
Sample variance
For a sample of size n, with data values x1 , x2 , . . . , xn , the sample variance is

calculated as: n
(xi − x̄)2
P
2 i=1
s = . (13.3)
n−1
Notice that (13.3) is similar to (13.2) except we replace the population mean, µ, with
the sample mean, x̄, and divide by n − 1 instead of N . It should be intuitively clear why
we use x̄ (µ is, of course, unknown). The n − 1 in the denominator is present for reasons
which are beyond the scope of this course.
Consider again the datasets 0, 1, 5, 8, 9, 19 and 6.8, 6.9, 6.9, 7.0, 7.1, 7.3, each with a
mean of 7. We shall treat these as population datasets, hence µ = 7 for each dataset.
The first dataset has deviations about µ of −7, −6, −2, 1, 2 and 12. Therefore, the
squared deviations are 49, 36, 4, 1, 4 and 144, with a sum of 238. The variance,
using√(13.2), is then σ 2 = 238/6 = 39.67 and the standard deviation is
σ = 39.67 = 6.30.
13
The second dataset has deviations about µ of −0.2, −0.1, −0.1, 0, 0.1 and 0.3.
Therefore, the squared deviations are 0.04, 0.01, 0.01, 0, 0.01 and 0.09, with a sum
208
2
of 0.16. The variance, again
√ using (13.2), is then σ = 0.16/6 = 0.027 and the
standard deviation is σ = 0.027 = 0.16.
As before, the first dataset has a greater variance (and hence standard deviation)
due to the greater dispersion in the dataset.
Clearly, using (13.2) for population datasets and (13.3) for sample datasets becomes
onerous when working them out by hand. It can be shown that (13.2) and (13.3) can be
equivalently expressed, respectively, as:
N
x2i
P
i=1
σ2 = − µ2 (13.4)
N
and: n
x2i − nx̄2
P
i=1
s2 = . (13.5)
n−1
For example, using the dataset 6.8, 6.9, 6.9, 7.0, 7.1, 7.3 we have:
N
X
x2i = (6.8)2 + (6.9)2 + · · · + (7.3)2 = 294.16
i=1
so the variance, using (13.4), is:

294.16
σ2 = − 72 = 0.027
6
as before.

Show that (13.2) and (13.3) can be equivalently expressed as (13.4) and (13.5),
respectively.
13.3.2 Variance using frequency distributions

If data are presented as a frequency distribution with k classes, then the equivalent
forms of (13.2) and (13.4) are:
k k
fi (xi − µ)2 fi x2i
P P
i=1 i=1
σ2 = = − µ2 . (13.6)
N N
For example, suppose we have the following frequency distribution for ages of students.
xi (age) 18 19 20 21 22 23 24 25 26
fi (frequency) 1 5 8 12 10 7 4 1 2
For these data we first find, using (13.1), the (population) mean:
13
(1 × 18) + (5 × 19) + (8 × 20) + · · · + (2 × 26)
µ= = 21.58.
1 + 5 + 8 + ··· + 2
209
Using the first method in (13.6), the variance is:

1 × (−3.58)2 + 5 × (−2.58)2 + 8 × (−1.58)2 + · · · + 2 × (4.42)2
σ2 = = 3.20
1 + 5 + 8 + ··· + 2
√
giving a standard deviation of σ = 3.20 = 1.79. Alternatively, we could use the second
expression in (13.6) which gives us:
(1 × 182 ) + (5 × 192 ) + (8 × 202 ) + · · · + (2 × 262 )
− (21.58)2 = 3.20.
50
√
The standard deviation would then be σ = 3.20 = 1.79, as before.
13.4 Skewness
We conclude our look at descriptive statistics with one further quantity since the mean
and variance, while very useful, do not provide complete information about a dataset.
Skewness of a distribution quantifies the departure from symmetry. By definition a
symmetric distribution has zero skewness. Although various methods exist to quantify
skewness, for this course we shall only be concerned with describing skewness
qualitatively, that is whether the skewness is positive (to the right) or negative (to the
left). This can be achieved by either comparing the mean and median, or visually by
consulting a distribution plot of a dataset, such as a histogram.
Skewness
If the mean is greater than the median, then this indicates a positively-skewed
distribution (also referred to as ‘right-skewed’).
If the mean is less than the median, then this indicates a negatively-skewed
distribution (also referred to as ‘left-skewed’).
If the mean equals the median, then this indicates a symmetric distribution.
In the case of skewed distributions, we have already said that the mean is sensitive to
outliers and so the mean is ‘pulled’ in that direction leading to the above relationships
between the mean and median.
Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward increasingly positive values on the
horizontal axis (i.e. on the right-hand side), then this indicates a positively-skewed
(right-skewed) distribution. Similarly, if the long tail is heading toward increasingly
negative values (i.e. on the left-hand side) then this indicates a negatively-skewed
(left-skewed) distribution, as illustrated in Figure 13.1.
Finally, a boxplot (sometimes called a box-and-whisker plot) is a graph which shows
the 5-number summary as well as any outliers and extreme outliers. Boxplots are useful
for displaying a dataset’s distribution. Unlike histograms, these explicitly depict the
13 quartiles. From a boxplot it is easy to obtain the following: median, quartiles, IQR,
range, skewness and outliers. An example of a (not-to-scale) boxplot can be seen in
Figure 13.2.
210
Positively-skewed
distribution
Negatively-skewed
distribution
Figure 13.1: Different types of skewed distributions.
x Values more than 3 boxlengths from Q3 (extreme outlier)
o Values more than 1.5 boxlengths from Q3 (outlier)
Largest observed value that is not an outlier
Q3
50% of cases
have values Q2 = Median
within the box
Q1
Smallest observed value that is not an outlier
o Values more than 1.5 boxlengths from Q1 (outlier)
x Values more than 3 boxlengths from Q1 (extreme outlier)
Figure 13.2: An example of a boxplot (not to scale).
The key features of a boxplot are the following.
The median, Q2 , is the middle line of the ‘box’.
The lower and upper quartiles, Q1 and Q3 , respectively, are represented as the ends
of the box.
‘Whiskers’ are drawn from Q1 and Q3 to the observations furthest from the median
which are not more than 1.5 times the IQR (i.e. excluding outliers).
The whiskers are terminated by small lines. 13

Any points beyond the whiskers (i.e. outliers) are plotted individually.
211
In the example in Figure 13.3, it can be seen that the median is around 74, Q1 is about
63, and Q3 is approximately 77. The numerous outliers provide a useful indicator that
this distribution is negatively-skewed as the long tail covers lower values of the variable.
Note also that Q3 − Q2 < Q2 − Q1 .
Figure 13.3: A boxplot showing a negatively-skewed distribution.
Activity 13.5 A group of cows was fed one of three experimental diets A, B or C.
After two weeks, the gain or loss in weight was recorded in kilograms.
Weight gain Diet Weight gain Diet Weight gain Diet

15 A 5 B 35 C
−10 A 25 B 55 C
0 A 15 B 30 C
−5 A 0 B −15 C
10 A −10 B 45 C
20 A 30 B 35 C
−55 A −10 B 35 C
5 A 15 B 20 C
15 A 5 B 35 C
25 A 0 B 25 C
(a) Produce boxplots of the data (one representing each diet).
(b) Interpret your diagram.
13.5 Summary
This unit has introduced some quantitative approaches to summarising data, known as
descriptive statistics. We have distinguished measures of location, dispersion and
skewness. Although descriptive statistics serve as a very basic form of statistical
13 analysis, they nevertheless are extremely useful for capturing the main characteristics of
a dataset. Therefore, any statistical analysis of data should start with visualising the
data (covered in Unit 12) and the calculation of descriptive statistics.
212

5-number summary (Arithmetic) mean
Boxplot Frequency table
Interquartile range Mean absolute deviation
Measures of dispersion Measures of location
Median Mode
Outliers Quartiles
Range Skewness
Standard deviation Summation operator
Trimmed mean Variance
Learning outcomes
interpret and summarise raw data on social science variables numerically
calculate basic measures of location and dispersion
describe the skewness of a distribution and interpret boxplots
Exercises
Exercise 13.1
The tables below show the scores of two groups of students in a test.
(a) Determine the value of x if the median of these marks is 2.5.
Mark 0 1 2 3 4
Frequency 3 x 8 5 10
(b) Determine the value of y if the mean of these marks is 2.5.
Mark 0 1 2 3 4
Frequency 3 y 8 5 10
Exercise 13.2
For variables measured at which measurement level(s) (nominal, ordinal, interval or
ratio) is the arithmetic mean the most appropriate?
Exercise 13.3
13
Asked whether they agree with a proposed increase in university tuition fees, the
following counts were obtained from a group of 75 respondents:
213
1. Strongly disagree 30
2. Disagree 15
3. Neither agree nor disagree 15
4. Agree 5
5. Strongly agree 10
Total 75
(a) What are the median and mode of the responses? Using the numerical scores in the
left-hand column (i.e. the scores 1 to 5), calculate the mean response. Briefly
discuss whether the mean is appropriate for this type of data.
(b) Do these data indicate that there is widespread dissatisfaction with the proposed
increase in university tuition fees? Justify your answer briefly.
Exercise 13.4
Display the data below using a boxplot and provide the 5-number summary.
3 2 4 8 7 19 2 5 3 4 10 12.
Exercise 13.5
(a) Do you expect the income distribution of the UK population to be symmetric,

positively skewed, or negatively skewed? Briefly explain your answer.
(b) Discuss the relative merits of the mean, the median and the mode for summarising
the income distribution of the UK population.
Exercise 13.6
A service station sells both unleaded petrol and diesel. It has recorded the following
frequency distribution for the number of gallons sold per car for the two fuels in a total
sample of 1,000 vehicles.
Unleaded (gallons) Frequency Diesel (gallons) Frequency
[0, 4.99] 74 [0, 4.99] 22
[5, 9.99] 192 [5, 9.99] 68
[10, 14.99] 280 [10, 14.99] 153
[15, 19.99] 105 [15, 19.99] 57
[20, 24.99] 23 [20, 24.99] 11
[25, 29.99] 6 [25, 29.99] 9
Total 680 Total 320
(a) Estimate the mean for these grouped data for unleaded and diesel separately.
(b) Do drivers of unleaded vehicles, or of diesel vehicles, fill up with more fuel, on
average? Give a possible reason for your answer.
(c) Suppose the service station expects to refuel 240 cars in a day, in the same
proportions as given in the above table. Suppose unleaded petrol costs $5.97 a
13 gallon and diesel costs $6.24 a gallon. Estimate the total daily income from the sale
of fuel.
214
14. Probability I – Introduction to probability theory
Unit 14: Probability I

Introduction to probability theory
Overview
The world around us is an uncertain place. Will GDP growth next year be positive or
negative? Which political party will win a general election? What will the weather be
tomorrow? These are just a few examples. Yes, we know what could happen (for
example, positive GDP growth, negative GDP growth or no GDP growth) but we do
not know with certainty in advance what will happen. ‘Probability’ allows us to model
uncertainty and in this unit we explore probability theory.
Aims
This unit introduces the concept of probability and its role in modelling uncertainty.
to provide an insight into the concept of probability
to apply some common results from probability theory
to introduce statistical independence.
Background reading
(Palgrave, 2014) fourth edition [ISBN 9781137376558] ‘Probability’ Chapter 1.
14.1 Probability theory

Probability theory is used to determine how likely various events are. A few examples
include the likelihood of:
a machine in a factory breaking down
a person chosen at random being left-handed
when rolling a die, the upper face showing a ‘6’
when tossing a coin, the upper face showing ‘tails’.
Although probability is an interesting and important subject in its own right, our main
interest in probability arises due to its role in statistical inference. Inference is
215 14
surrounded by uncertainty (for example, to what extent is a sample representative of

the population?) and probability provides a sound theoretical basis for quantifying the
uncertainty involved.
14.1.1 Assigning probabilities

We would like to be able to carry out calculations with probabilities, and to work out
slightly harder probabilities from simpler ones. However, first, we need to know how to
work out these simpler probabilities. We describe three broad approaches. Fortunately,
these are consistent – there are no inherent contradictions between them. As we develop
these, we shall find some general laws which are true for all probabilities.
14.1.2 The classical method

This method involves an experiment, i.e. a process which produces outcomes, and an
event, which is an outcome of an experiment. The probability of an event, E, is the
ratio of the number of items in the population containing the event, say n, and the total
number of items in the population, say N . We write:
n
P (E) =
N
and say ‘the probability of the event E is n/N ’.
For example, a factory has 200 workers, of whom 70 are female. The experiment
consists of randomly selecting an employee – the event consists of randomly
selecting a female employee. In this case we have n = 70 and N = 200. We would
write P (female) = 70/200 = 0.35.
If we toss a fair (i.e. an unbiased ) coin, the experiment is the toss of the coin, and
the event is obtaining a tail, say. Now n = 1 and N = 2, so P (tail) = 1/2.
Even at this stage we can draw some general conclusions. Clearly, n ≥ 0 and n ≤ N , i.e.
0 ≤ n ≤ N . It follows that:
0 n N
≤ ≤
N N N
i.e. we have:
0 ≤ P (·) ≤ 1
where P (·) denotes the probability of some event. From this, we deduce that:
any probability lies between 0 and 1
if an event E is certain to occur, then P (E) = 1 (corresponding to n = N )
if an event E is certain not to occur, then P (E) = 0 (corresponding to n = 0).
It is also true that events which are equally likely to occur or not occur (for example,
tossing a fair coin and getting ‘tails’) have probabilities of 0.5. Sometimes probabilities
are converted to percentages, such as ‘there is a 70% chance of rain tomorrow’.
14 216
14.1.3 The relative frequency approach

This approach depends on historical data. It might be used if the classical method is
difficult, or impossible, to apply. The probability of an event is the number of times the
event has occurred divided by the number of opportunities for the event to have
occurred.
Suppose all the eggs leaving a farm are examined. If 100 are checked and 6 are found to
be cracked, we estimate the probability of a cracked egg to be 6/100 = 0.06. We cannot
use the classical method. An argument such as ‘there could be 0, 1, 2, . . . , 100 bad eggs,
so the probability of 6 bad eggs is 1/101’ would be neglecting the fact that there are
more ways to obtain 6 (or 50, say) bad eggs than to obtain 0 (or 1, say) bad eggs. A
complete listing of all outcomes contains many more than 101 outcomes.
The relative frequency approach can often be used when the classical method is
unsuitable. For example, it could be used to assess the probability of a biased coin
landing ‘tails’. If such a coin is tossed 500 times and there are 310 tails, then we
estimate the probability of getting tails as 310/500 = 0.62.
It could also be used to estimate the probability of a randomly-selected adult being
left-handed. Note that for a good estimate the denominator should be large. We would
find a better approximation if we checked a thousand eggs (or a million).
14.1.4 Subjective probabilities

This is the method which would be used when neither of the two other approaches is
viable. It could be used, for example, to estimate the probabilities that:
Party A wins an election
a student passes an examination
the defendant is guilty of a crime.
Sometimes it will be based on knowledge and experience; sometimes it will be little

more than a blind guess.
Such subjective probabilities can, and should, be updated in light of new information.
For the examples above new information might be:
Party A elects a new leader
a student’s performance in a mock examination
a witness giving evidence in court.
In order to extend our study of probability, we now develop some helpful terms and
symbols.
14.2 Terminology
As we have seen, an experiment is a process which produces outcomes – for example,
rolling a die or selecting 100 eggs from a farm and seeing how many are cracked.
217 14
An event is an outcome of an experiment – obtaining a ‘6’ with the die, obtaining an

even number with the die, finding 99 good eggs followed by one cracked egg, finding a
total of 99 good eggs etc.
A sample space, S, is a complete listing of all possible events, called elementary
events. The sample space for the roll of a single die is written S = {1, 2, 3, 4, 5, 6}.
Sometimes – when there are very many, or infinitely many possible events – it is
difficult or impossible to list all the elements of the sample space, although it can often
be described in other ways. However, if we can list all of the possible events, the sample
space can help us find probabilities.
Example 14.1 Suppose an experiment involves rolling two fair dice. What is the
probability that the sum of the scores is 7?
We can set out the sample space as follows:
(1, 1) (2, 1) (3, 1) (4, 1) (5, 1) (6, 1)

(1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2)
(1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3)
(1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4)
(1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5)
(1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6)
We can see that there are six outcomes with a sum of 7 (highlighted in bold), out of
36 possible events. Hence the probability that the sum is 7 is 6/36 = 1/6.
14.3 Sets
Numbers or objects enclosed in braces can be thought of as sets. Sets themselves are
simply collections of objects. For example:
the collection of all outcomes when a die is rolled is the set {1, 2, 3, 4, 5, 6}
the collection of all positive integers is the set {1, 2, 3, . . .}
the colours of the rainbow form the set {red, orange, yellow, green, blue, indigo,
violet}.
Sets can be represented by Venn diagrams, like the one in Figure 14.1. The set A is
shown as an oval, and the rectangle is the sample space, or ‘universal set’. If
A = {1, 2, 3, 4, 5, 6}, the universal set might be all positive integers; if A = {red, orange,
yellow, green, blue, indigo, violet}, the universal set might be all possible colours.
Union of sets
The union of two sets A and B consists of those elements in A or B or both, and is
written:
A ∪ B.
14 218
Figure 14.1: A Venn diagram showing the set A in the universal set (sample space).
No value is listed more than once in the union. For example, if A = {1, 4, 7, 9} and
B = {2, 3, 4, 5, 6}, then A ∪ B = {1, 2, 3, 4, 5, 6, 7, 9}. If A and B are represented by two
(overlapping) shaded ovals, the union is represented by the entire shaded region, as
shown in Figure 14.2.
A B
Figure 14.2: A Venn diagram showing the union of the sets A and B (the shaded region).
Intersection of sets
The intersection of two sets A and B consists of those elements common to both
A and B, and is written:
A ∩ B.
No value is listed more than once in the intersection. If A = {1, 4, 7, 9} and

B = {2, 3, 4, 5, 6}, then A ∩ B = {4}. If A and B are represented by two (overlapping)
ovals, the intersection is where they intersect! Figure 14.3 shows an example. Usually
the union is a larger set than the intersection (it can never be a smaller set).
We sometimes need to talk about a set with no elements. This is called the empty set
and it is denoted ∅. For example:
{1, 3, 5, 7} ∩ {2, 4, 6, 8} = ∅.
Two events are mutually exclusive if the existence of one precludes the other. The
events ‘male’ and ‘female’ are mutually exclusive when we observe gender. The
outcomes ‘cracked’ and ‘not cracked’ are mutually exclusive when we sample eggs.
219 14
A B
Figure 14.3: A Venn diagram showing the intersection of the sets A and B (the shaded
region).
Mutually exclusive events
In general, if A and B are mutually exclusive, the event A ∩ B is certain not to occur.
Hence:
P (A ∩ B) = 0
if A and B are mutually exclusive. Or, equivalently, P (A ∩ B) = 0 if A ∩ B = ∅.
14.4 Independence
Two events are independent if the occurrence or non-occurrence of one does not affect
the occurrence or non-occurrence of the other. For example:
whether Party A wins the election is independent of whether your car breaks down
today
coin tosses are independent of each other – the event of getting ‘heads’ on the first
toss is independent of getting ‘heads’ on the second toss.
However:
whether Party A wins the election is not independent of who the party leader is
whether a die shows a ‘6’ is not independent of whether it shows an even number.
We let P (A | B) denote the probability of A occurring given that B has occurred, this is
called a conditional probability. If A and B are independent, then the probability of
A occurring given that B has occurred is just the probability of A occurring. That is, if
A and B are independent, then:
P (A | B) = P (A) and P (B | A) = P (B).
Hence if A and B are not independent (i.e. are dependent), then P (A | B) 6= P (A) and
P (B | A) 6= P (B).
For example, a person’s handedness is presumably independent of whether they prefer
tea of coffee, so:
P (prefers tea | person is right-handed) = P (prefers tea).
14 220
14.5 Complementary events

The complement of event A is denoted Ac .1 The complement of A contains all the
elementary events which are not in A.
If, in rolling a fair die, event A is getting an even number, Ac is the event getting
an odd number.
If event A is getting a cracked egg from a sample, Ac is finding a good
(non-cracked) egg.
If A is the event that it rains tomorrow, Ac is the event that there is no rain
tomorrow.
If the occurrence of event A corresponds to one of n elementary events out of a total of
N elementary events, then Ac corresponds to N − n elementary events. Note:
N −n N n
= −
N N N
and we deduce the following important result.
Complementary events
If A is some event whose complement is Ac , then:
P (Ac ) = 1 − P (A).
Example 14.2 Consider the following.
If the probability of rain tomorrow is 0.6, then the probability of no rain

tomorrow is 1 − 0.6 = 0.4.
If the probability of selecting a female employee from a workforce is 0.35, the

probability of selecting a male employee is 1 − 0.35 = 0.65.
14.6 Additive laws

We now look at several ways of combining probabilities. Suppose that event A arises
from n elementary events, event B arises from p elementary events, with A and B
mutually exclusive. Therefore, the event A or B, A ∪ B, arises from n + p elementary
events. Hence:
n+p n p
P (A ∪ B) = = + = P (A) + P (B)
N N N
if there are N elementary events altogether. From this we deduce the important result
that if A and B are mutually exclusive:
P (A ∪ B) = P (A) + P (B). (14.1)
1
Other accepted forms of notation for the complement of A include Ā and A0 .
221 14
Example 14.3 We saw previously that the probability of obtaining a sum of 7

with two fair dice is 1/6. The probability of obtaining a sum of 12 is 1/36, since only
one event, (6, 6), out of 36 will achieve this. Obviously, totals of 7 and 12 are
mutually exclusive, so the probability of 7 or 12 is 1/6 + 1/36 = 7/36.
Example 14.4 Suppose movies are classified at a store. Let C and H be the events
that the movie rented by the next customer is comedy and horror, respectively.
Suppose P (C) = 0.26 and P (H) = 0.18. Using (14.1), we have:
P (C ∪ H) = 0.26 + 0.18 = 0.44.
Hence the probability that it is neither comedy nor horror is 1 − 0.44 = 0.56.
(Of course, we are assuming that there are no horror comedies!)
Let us look again at the Venn diagram for the union of two sets in Figure 14.2. We shall
use the notation n(A ∪ B) for the number of elements in A ∪ B, and similarly for n(A),
n(B) and n(A ∩ B). Now count n(A ∪ B). It is not n(A) + n(B) because the part in
A ∩ B will have been counted twice. If we subtract n(A ∩ B) we will be right. So:
n(A ∪ B) = n(A) + n(B) − n(A ∩ B).
Hence:
n(A ∪ B) n(A) n(B) n(A ∩ B)
= + −
N N N N
where there are N possible events altogether. However, n(A ∪ B)/N = P (A ∪ B),
n(A)/N = P (A), and so on. We deduce the general law of addition.
General law of addition
For any events A and B, then:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B). (14.2)
(14.2) holds for any events A and B, regardless of whether or not they are mutually
exclusive. (14.1) is a special case of (14.2) where A ∩ B = ∅, i.e. when P (A ∩ B) = 0.
Example 14.5 What is the probability of drawing a queen or a heart from a

well-shuffled deck of 52 playing cards?
If the two events are denoted Q and H, respectively, then P (Q) = 4/52 and
P (H) = 13/52. The answer is not 4/52 + 13/52 because they are not mutually
exclusive – consider the Queen of Hearts! However, we can use (14.2), noting that
P (Q ∩ H) = 1/52, to obtain:
4 13 1 16 4
P (Q ∪ H) = P (Q) + P (H) − P (Q ∩ H) = + − = = .
52 52 52 52 13
14 222
Example 14.6 If 16% of the population are left-handed, 30% are overweight, but
only 25% of left-handers are overweight, what is the probability of a
randomly-selected person being left-handed or overweight?
Let L and O be the events of the person being left-handed and overweight,
respectively. We have P (L) = 0.16, P (O) = 0.3 and P (L ∩ O) = 0.04. Applying
(14.2) gives:
P (L ∪ O) = P (L) + P (O) − P (L ∩ O) = 0.16 + 0.3 − 0.04 = 0.42.
14.7 Multiplicative laws

Consider the following.
The probability of rolling a ‘6’ with a fair die is 1/6.
The probability of tossing a head with a fair coin is 1/2.
The probability of getting a ‘6’ and a head is 1/12, because (6, H) is just one
elementary outcome out of 12, written in full as:
(1, H), (2, H), (3, H), (4, H), (5, H), (6, H),
(1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T ).
Notice that 1/6 × 1/2 = 1/12. We have been able to multiply the probabilities because
the two events are independent.
Independent events
If A and B are independent events, then:
P (A ∩ B) = P (A) P (B). (14.3)
Example 14.7 Suppose 90% of UK adults drive a car and 60% have a broadband
connection. What is the probability that a UK adult drives and has a broadband
connection?
Well, it is tempting to use (14.3) to say that the probability is 0.9 × 0.6 = 0.54.
However, this depends on the two events being independent. If they are, the answer
is correct; if not, we do not have enough information to solve the problem.
Activity 14.1 A chain is formed from n links. The strengths of the links are
mutually independent, and the probability that any one link fails under a specified
load is q. What is the probability that the chain fails under the load?
223 14
Recall that P (A | B) is the probability of event A given that event B has occurred. The
event A ∩ B (A and B) will occur if A occurs and if B occurs, so once we know B has
occurred with probability P (B) we can use P (A | B) to find the probability that A ∩ B
occurs, i.e. we have:
P (A ∩ B) = P (B) P (A | B).
We could also argue that P (A ∩ B) = P (A) P (B | A).
General law of multiplication
For any events A and B, we have:
P (A ∩ B) = P (A) P (B | A) = P (B) P (A | B). (14.4)
(14.3) is a special case of (14.4). If A and B are independent, then P (B | A) = P (B)

and P (A | B) = P (A), hence the general law reduces to (14.3).
Example 14.8 Suppose a company has 140 employees, of which 30 are supervisors.
80 of the employees are married, 20% of the married employees are supervisors.
What is the probability that a random employee is a married supervisor?
We let M denote married, S denote supervisor and we require P (M ∩ S). We know
that P (M ) = 80/140 = 4/7 and P (S | M ) = 20/100 = 1/5. So, applying (14.4), we
obtain:
4 1 4
P (M ∩ S) = P (M ) P (S | M ) = × = ≈ 0.1143.
7 5 35
Do not confuse ‘mutually exclusive’ and ‘independent’ ! ‘Mutually exclusive’ means two
events cannot occur simultaneously. ‘Independent’ means that the occurrence or
non-occurrence of one event does not affect the occurrence or non-occurrence of the
other. These are not the same thing at all!
Activity 14.2 Suppose A and B are events with P (A) = 0.2, P (B) = p and
P (A ∪ B) = 0.6.
(a) Evaluate p and P (A | B) if A and B are mutually exclusive events.
(b) Evaluate p and P (A | B) if A and B are independent events.
14.8 Bayes’ theorem

Bayes’ theorem has far-reaching consequences for statistical inference. We can
consider it in two ways.
As a tool for handling conditional probabilities, specifically the connection between
P (A | B) and P (B | A).
As a means for modifying probabilities in light of new information.
14 224
We state the theorem in three forms, of increasing generality, illustrating both ideas.
The first two will be justified, but not the third (the proof is beyond the scope of the
course).
14.8.1 Version 1
Bayes’ theorem (version 1)
For any events A and B, where P (A) 6= 0, then:
P (A | B) P (B)
P (B | A) = . (14.5)
P (A)
This is easily justified. We know that P (A ∩ B) = P (A) P (B | A) and also that

P (A ∩ B) = P (B) P (A | B). Hence:
P (A) P (B | A) = P (B) P (A | B).
Dividing both sides by P (A) gives the desired result, noting that P (A) 6= 0.
Example 14.9 Suppose we know that 5% of companies in a certain sector go

bankrupt, while 10% of companies in the sector are unable to meet demand. Of
those which go bankrupt, 20% have been unable to meet demand. A company is
unable to meet demand, what is the probability of bankruptcy?
We can think of this as an exercise in conditional probability, or as modifying the
probability of bankruptcy (5%) in light of the new information about unmet demand
for their products.
Let the event A denote when a company is unable to meet demand and let the event
B denote when a company goes bankrupt.
We know P (A) = 0.1, P (B) = 0.05 and P (A | B) = 0.2. We require P (B | A) and,
using (14.5), this is:
P (A | B) P (B) 0.2 × 0.05 0.01

P (B | A) = = = = 0.1.
P (A) 0.1 0.1
So the unconditional probability of 5% has been changed to 10% in light of the new
information about unmet demand.
14.8.2 Version 2
Let B be an event and B c be its complement. If another event A occurs, then:
P (A | B) P (B)
P (B | A) = . (14.6)
P (A | B) P (B) + P (A | B c ) P (B c )
225 14
This follows from the first version, which has denominator P (A). Since either B or B c
must occur, we can deduce that P (A) = P (A ∩ B) + P (A ∩ B c ) (a Venn diagram may
make this clear). Also, P (A ∩ B) = P (A | B) P (B) and P (A ∩ B) = P (A | B c ) P (B c ) by
two applications of (14.4), so:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
hence the result.
Example 14.10 Suppose a department store is considering new arrangements for

credit. The credit manager has suggested that credit should be discontinued to any
customer who has twice or more been late with repayments. She supports her claim
by noting that past records show that 90% of those defaulting were late with their
repayments at least twice. Separate investigations show that 2% of customers
actually default on their repayments, and 45% of those not defaulting have had at
least two late payments. What is the probability that a customer with two or more
late payments actually defaults? Comment on the manager’s proposals.
Let L denote a credit customer who is late with repayments at least twice and let D
denote a customer who defaults on payments.
We require P (D | L), so using (14.6) we express this as:
P (L | D) P (D)
P (D | L) = .
P (L | D) P (D) + P (L | Dc ) P (Dc )
We know that P (L | D) = 0.90, P (D) = 0.02, P (L | Dc ) = 0.45 and:
P (Dc ) = 1 − P (D) = 1 − 0.02 = 0.98.
(14.6) now gives us:

0.90 × 0.02
P (D | L) = = 0.0392 ≈ 0.04.
0.90 × 0.02 + 0.45 × 0.98
This is a surprising result. If the credit manager’s plan is adopted, only about 1
customer in 25 who loses credit would actually have defaulted. So we would lose 24
good credit customers to detect one defaulter. This is bad business! So we should
reject the proposal.
14.8.3 Version 3
Suppose the events B1 , B2 , . . . , Bn are mutually exclusive and collectively exhaustive,

that is some Bi must occur but no two distinct Bi s can occur together. For any
i = 1, 2, . . . , n, then:
P (A | Bi ) P (Bi )
P (Bi | A) = . (14.7)
P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + · · · + P (A | Bn ) P (Bn )
14 226
As previously mentioned, we omit the proof of this version. However, note that:
n
X
P (A) = P (A | Bi ) P (Bi ).
i=1
Example 14.11 Machines 1, 2 and 3 all produce the same two parts A and Z. Of
all the parts produced, machine 1 produces 60%, machine 2 produces 30% and
machine 3 produces 10%. In addition, 40% of parts made by machine 1 are part A,
50% of parts made by machine 2 are part A, and 70% of parts made by machine 3
are part A. A part is randomly selected and is found to be an A part. With this
knowledge, what are the revised probabilities that it came from machines 1, 2 and 3,
respectively?
Let A be the event that we have randomly selected an A part. We can usefully put
calculations in a table:
Event P (Bi ) P (A | Bi ) P (A | Bi ) P (Bi ) P (Bi | A)

B1 (came from 1) 0.6 0.4 0.24 0.24/0.46 = 0.52
B2 (came from 2) 0.3 0.5 0.15 0.15/0.46 = 0.33
B3 (came from 3) 0.1 0.7 0.07 0.07/0.46 = 0.15
P (A) = 0.46
We have used (14.7) in the form:
P (A | Bi ) P (Bi )
P (Bi | A) =
P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + P (A | B3 ) P (B3 )
and we have been able to find revised probabilities of 0.52, 0.33 and 0.15 of the part
having come from machines 1, 2 and 3, respectively, rather than the probabilities of
0.6, 0.3 and 0.1 using the knowledge that the part was an A part. The unmodified
and modified probabilities are sometimes called prior and posterior probabilities,
respectively.
14.9 Summary – a listing of probability results

We conclude with a summary of the key probability results presented in this unit.
1. 0 ≤ P (A) ≤ 1, for any event A.
2. When combining events, ∪ means ‘or’ and ∩ means ‘and’.
3. P (A ∩ B) = 0 if A and B are mutually exclusive events.
4. P (Ac ) = 1 − P (A), where Ac is the complement of event A.
5. P (A ∪ B) = P (A) + P (B) for mutually exclusive events A and B.
6. P (A ∪ B) = P (A) + P (B) − P (A ∩ B) for any events A and B.
7. If A and B are independent, then P (A | B) = P (A) and P (B | A) = P (A).
227 14
8. P (A ∩ B) = P (A) P (B) if A and B are independent.
9. P (A ∩ B) = P (A) P (B | A) = P (B) P (A | B) for any events A and B.
10. Bayes’ theorem – three versions:
Version 1: For any events A and B, then:
P (A | B) P (B)
P (B | A) = .
P (A)
Version 2: For events A and B, then:
P (A | B) P (B)
P (B | A) = .
P (A | B) P (B) + P (A | B c ) P (B c )
Version 3: For mutually exclusive and collectively exhaustive events

B1 , B2 , . . . , Bn , for any i = 1, 2, . . . , n, then:
P (A | Bi ) P (Bi )
P (Bi | A) = .
P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + · · · + P (A | Bn ) P (Bn )
Example 14.12 Five bonds are rated A+, A, B+, B or C, depending on the
stability of the issuing firm. An inexperienced bond buyer selects two different bonds
at random from these five bonds (i.e. without replacement).
(a) What is the probability that she does not select the C-rated bond?
(b) What is the probability that she selects only the A+ and A-rated bonds?
(a) Let C1c be the event that the first selected bond is not the C-rated bond, and C2c
be the event that the second selected bond is not the C-rated bond. We require
P (C1c ∩ C2c ) = P (C1c ) P (C2c | C1c ), and we know that P (C1c ) = 4/5. To find
P (C2c | C1c ), we note that if the first bond is not the C-rated one, there are four
remaining: one is the C-rated bond, the others are not. So P (C2c | C1c ) = 3/4,
hence:
4 3 3
P (C1c ∩ C2c ) = P (C1c ) P (C2c | C1c ) = × = .
5 4 5
(b) The investor only selects the A+ and A-rated bonds if the first one is A-rated
and the second A+-rated, or the first is A+-rated and the second is A-rated.
The first of these probabilities is 1/5 × 1/4, by a similar argument to the first
part. The second is also 1/5 × 1/4. So the probability of the investor selecting
these two bonds is:
1 1 1
+ = .
20 20 10
Example 14.13 A recent survey of 1,700 companies showed that 49% performed
studies of marketing effectiveness, 61% conducted short-term sales forecasts, and
38% undertook both activities. Let A denote the firm studies marketing effectiveness
14 228
and let B denote the firm produces short-term sales forecasts. Find P (A ∪ B),
P (A | B) and determine how many of the firms undertook both A and B.
Note that P (A) = 0.49, P (B) = 0.61 and P (A ∩ B) = 0.38 directly. So:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.49 + 0.61 − 0.38 = 0.72
and:
P (A ∩ B) 0.38
P (A | B) = = ≈ 0.62.
P (B) 0.61
We would say 0.38 × 1700 = 646 firms undertook A and B.
Example 14.14 The table below gives the marital status of adults in a country by
sex in terms of proportions of the total population.
Single Married Widowed Divorced Total

Male 0.116 0.319 0.012 0.028 0.475
Female 0.093 0.325 0.066 0.041 0.525
Total 0.209 0.644 0.078 0.069 1.000
We can make the following observations, comments and deductions.
There are many more widowed women than widowed men.
There are more women than men overall.
There are more women than men in the ratio 0.525 : 0.475 = 21 : 19.
Most people of both sexes are married.
P (male) = 0.475, i.e. 47.5% of the population are male.
P (male ∩ divorced) = 0.028, i.e. 2.8% of adults are divorced men.
P (divorced | male) = P (male ∩ divorced)/P (male) = 0.028/0.475 = 0.059, i.e.

5.9% of adult males are divorced.
P (male | divorced) = P (male ∩ divorced)/P (divorced) = 0.028/0.069 = 0.406,

i.e. 40.6% of divorced adults are male.
If we also knew that the total adult population is 13.6 million, we can convert the
proportions to absolute numbers. The resulting table is known as a contingency table.
Single Married Widowed Divorced Total

Male 1,577,600 4,338,400 163,200 380,800 6,460,000
Female 1,264,800 4,420,000 897,600 557,600 7,140,000
Total 2,842,400 8,758,400 1,060,800 938,400 13,600,000
229 14
Example 14.15 Two events A and B are independent with P (A) = 0.3 and
P (B) = 0.1.
(a) Are A and B mutually exclusive? Give a reason.
(b) Determine P (A | B) and P (B | A).
(c) Determine P (A ∪ B) and P (Ac ∩ B c ).
(d) Determine P (Ac ∩ B c ).
(a) P (A ∩ B) = P (A) P (B) given the independence of A and B, so:
P (A ∩ B) = 0.3 × 0.1 6= 0.
Therefore, the event A ∩ B can occur, i.e. A and B are not mutually exclusive.
(b) Using independence, P (A | B) = P (A) and P (B | A) = P (B). So P (A | B) = 0.3

and P (B | A) = 0.1.
(c) We have:
P (A∪B) = P (A)+P (B)−P (A∩B) = 0.3+0.1−(0.3×0.1) = 0.3+0.1−0.03 = 0.37.
(d) Look at the Venn diagram in Figure 14.2. The white area represents both
Ac ∩ B c and (A ∪ B)c . These are the same set, i.e. Ac ∩ B c = (A ∪ B)c . Hence:
P (Ac ∩ B c ) = P ((A ∪ B)c ) = 1 − P (A ∪ B) = 0.63.
14.10 Summary
This unit has introduced the fundamentals of probability theory and we have seen some
important probability results, including conditional probability and independence. A
good grounding in probability theory is necessary before moving on to probability
distributions in the next unit.
Bayes’ theorem Complement

Conditional probability Event
Experiment Independent
Intersection Law of addition
Law of multiplication Mutually exclusive
Outcomes Probability theory
Sample space Sets
Union Venn diagrams
14 230
Learning outcomes
apply the ideas and notation used for sets in simple examples
recall some common probability results
explain the ideas of conditional probability and independence
Exercises
Exercise 14.1
Let K be the event of drawing a ‘king’ from a well-shuffled deck of playing cards. Let D
be the event of drawing a ‘diamond’ from the pack. Determine:
(a) P (K)
(b) P (D)
(c) P (K c )
(d) P (K ∩ D)
(e) P (K ∪ D)
(f) P (K | D)
(g) P (D | K)
(h) P (D ∪ K c )
(i) P (Dc ∩ K)
(j) P ((Dc ∩ K) | (D ∪ K)).
Are the events D and K independent, mutually exclusive, neither or both?
Exercise 14.2
If A and B are independent events such that P (A) = 0.2 and P (B) = 0.6, what is
P (Ac ∩ B c )?
Exercise 14.3
A and B are two mutually exclusive events. State what this means:
(a) in words
(b) using set notation.
231 14
Exercise 14.4
A student has an important job interview in the morning. To ensure he wakes up in
time, he sets two alarm clocks which ring with probabilities 0.97 and 0.99, respectively.
What is the probability that at least one of the alarm clocks will wake him up?
Exercise 14.5
20% of men show early signs of losing their hair. 2% of men carry a gene that is related
to hair loss. 80% of men who carry the gene experience early hair loss.
(a) What is the probability that a man carries the gene and experiences early hair loss?
(b) What is the probability that a man carries the gene, given that he experiences
early hair loss?
Exercise 14.6
Tower Construction Company (‘Tower’) is determining whether it should submit a bid
for a new shopping centre. In the past, Tower’s main competitor, Skyrise Construction
Company (‘Skyrise’), has submitted bids 80% of the time. If Skyrise does not bid on a
job, the probability that Tower will get the job is 0.6. If Skyrise does submit a bid, the
probability that Tower gets the job is 0.35.
(a) What is the probability that Tower will get the job?
(b) If Tower gets the job, what is the probability that Skyrise made a bid?
(c) If Tower did not get the job, what is the probability that Skyrise did not make a
bid?
Exercise 14.7
In a large lecture, 60% of students are female and 40% are male. Records show that
15% of female students and 20% of male students are registered as part-time students.
(a) If a student is chosen at random from the lecture, what is the probability that the
student studies part-time?
(b) If a randomly chosen student studies part-time, what is the probability that the
student is male?
Exercise 14.8
James is a salesman for a company and sells two products, A and B. He visits three
different customers each day. For each customer, the probability that James sells
product A is 1/3 and the probability is 1/4 that he sells product B. The sale of product
A is independent of the sale of product B during any visit, and the results of the three
visits are mutually independent. Calculate the probability that James will:
(a) sell both products, A and B, on the first visit
(b) sell only one product during the first visit
(c) make no sales of product A during the day
(d) make at least one sale of product B during the day.
14 232
Exercise 14.9
Given two events, A and B, state why each of the following is not possible. Use
formulae or equations to illustrate your answer.
(a) P (A) = −0.46.
(b) P (A) = 0.26 and P (Ac ) = 0.62.
(c) P (A ∩ B) = 0.92 and P (A ∪ B) = 0.42.
(d) P (A ∩ B) = P (A) P (B) and P (B) > P (B | A).
Exercise 14.10
At a local school, 90% of the students took test A, and 15% of the students took both
test A and test B. Based on the information provided, which of the following
calculations are not possible, and why? What can you say based on the data?
(a) P (B | A).
(b) P (A | B).
(c) P (A ∪ B).
If you knew that everyone who took test B also took test A, how would that change
your answers?
Exercise 14.11
A company is concerned about interruptions to email. It was noticed that problems
occurred on 15% of workdays. To see how bad the situation is, calculate the
probabilities of an interruption during a five-day working week:
(a) on Monday and again on Tuesday
(b) for the first time on Thursday
(c) every day
(d) at least once during the week.
Exercise 14.12
A restaurant manager classifies customers as well-dressed, casually-dressed or
poorly-dressed and finds that 50%, 40% and 10%, respectively, fall into these categories.
The manager found that wine was ordered by 70% of the well-dressed, by 50% of the
casually-dressed and by 30% of the poorly-dressed.
(a) What is the probability that a randomly chosen customer orders wine?
(b) If wine is ordered, what is the probability that the person ordering is well-dressed?
(c) If wine is not ordered, what is the probability that the person ordering is
poorly-dressed?
233 14
15. Probability II – Probability distributions
Unit 15: Probability II

Probability distributions
Overview
The previous unit introduced probability as a means for modelling uncertainty. We now
consider the probabilities attached to all possible outcomes of a chance experiment, that
is, how probability is distributed across the sample space. Just as we used descriptive
statistics to summarise important features of sample datasets, here we learn how to
calculate equivalent features of population probability distributions.
Aims
This unit explores probability distributions and how to calculate the expected value and
variance for discrete random variables. Particular aims are:
to introduce some common discrete probability distributions
to explore properties of such distributions such as the expected value and variance.
Background reading
15.1 Random variables

A random variable is a variable which contains the outcomes of a chance experiment.
Alternatively, a random variable is a description of the possible outcomes of an
experiment together with the probabilities of each outcome occurring. These, and other
possible definitions, are somewhat abstract, so we illustrate with some examples.
Example 15.1 Examine the outcomes when two fair dice are rolled. We consider
the random variable X which is the sum of the shown scores. We can read off the
various possibilities from the sample space:
(1, 1) (2, 1) (3, 1) (4, 1) (5, 1) (6, 1)
(1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2)
(1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3)
(1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4)
(1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5)
(1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6)
234
We could write the possible outcomes by P (X = 2) = 1/36, P (X = 3) = 2/36, and

so on, or, more succinctly, in a table depicting the probability distribution of the
random variable X as follows.
X=x 2 3 4 5 6 7 8 9 10 11 12
Probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Example 15.2 We describe the sample space when two fair coins are tossed, and
the associated random variable, X, which counts the number of tails. The sample
space is:
S = {HH, HT, T H, T T }.
X takes the form:
Number of tails 0 1 2
Probability 1/4 1/2 1/4
So, X is a random variable taking values 0, 1 and 2 with probabilities 1/4, 1/2 and
1/4, respectively.
15.2 Discrete random variables

Since different types of random variables will be analysed differently, it is necessary to
distinguish between discrete and continuous random variables. A random variable is
discrete if its set of possible values consists of isolated points on the number line.
(Often, but not necessarily, these will be non-negative integers.) The number of such
points may be finite or infinite.
Examples 15.1 and 15.2 (the dice and the coins) are both discrete (and finite). Note
that the sum of all the probabilities must be 1, since one of the possible outcomes must
occur. This is true for the dice, true for the coins, and true in general!
Other examples of discrete random variables include:
the number of defective items in a batch of 100 items
the number of people arriving at a store in a five-minute period
the number of crimes recorded daily at a police station.
The first of these is clearly finite. The others are finite in practice – but there is no
theoretical upper limit, so it may be convenient to specify an infinite number of
outcomes.
We can describe a discrete distribution (i.e. the random variable and its associated
probabilities) with a histogram or bar chart (rarely), a table as for the dice and coins
above (sometimes) or with a rule or formula (most common).
A useful discrete distribution which we shall discuss later is the binomial
distribution.
235
15.3 Continuous random variables

A random variable is continuous if it takes on values at every point over a (continuous)
interval. Loosely, things are measured rather than counted. Examples include:
the volume of petrol in a storage tank
the time between customer arrivals at a counter
the heights of a group of individuals
the noise in decibels at a nightclub.
A continuous distribution would usually be described by a formula (statisticians use the

term ‘probability density function’), although we might sometimes be able to use a
graph. A complete study of continuous distributions requires a knowledge of calculus.
We will not consider continuous probability distributions in this course except for the
normal distribution, which will be covered in depth later on.
15.4 Expectation of a discrete random variable

We now concentrate on discrete random variables. Recall, from Example 15.2, the
distribution of the number of tails, X, when two fair coins are tossed:
Number of tails 0 1 2
Probability 1/4 1/2 1/4
Suppose the experiment is repeated a large number of times, say n = 4,000,000. We

would expect approximately 1 million zeros, 2 million ones and 1 million twos. If we
observed this, the mean value of X would be:
sum of measurements (1,000,000 × 0) + (2,000,000 × 1) + (1,000,000 × 2)

=
n 4,000,000
0 × 1,000,000 1 × 2,000,000 1 × 2,000,000
= + +
4,000,000 4,000,000 4,000,000
1 1 1
=0× +1× +2×
4 2 4
= 1.
In the penultimate line, the first term is 0 × P (X = 0), the second term is
1 × P (X = 1) and the third term is 2 × P (X = 2).
Therefore, the mean value of X is:
2
X
x P (X = x).
x=0
Of course, this is no accident.
236
Expectation of a discrete random variable
We define the mean, or expected value, or ‘expectation’ of a discrete random

variable X to be: X
E(X) = x P (X = x) (15.1)
where we sum over all the x values which are taken by the random variable X.
We often write:
E(X) = µ
the same symbol used for the (population) arithmetic mean. We can think of E(X) as
the long-run average when the experiment is carried out a large number of times.
Example 15.3 Suppose I buy a lottery ticket for £1. I can win £500 with a
probability of 0.001 or £100 with a probability of 0.003. What is my expected profit?
We begin by defining the random variable X to be my profit (in pounds). Its
distribution is:
X=x −1 99 499
P (X = x) 0.996 0.003 0.001
Using (15.1), we get:
E(X) = (−1 × 0.996) + (99 × 0.003) + (499 × 0.001) = −0.2.
So I expect to make a loss of £0.20 (which will go to funding the prize money or,
possibly, charity).
15.5 Functions of a random variable

We have seen that a random variable may be specified by the set of values it takes
together with the associated probabilities of each value.
For example, suppose a fair (unbiased ) die is rolled. Let X denote the score obtained.
We can represent the outcomes using the table:
X=x 1 2 3 4 5 6
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6
Suppose we were interested in the random variables X1 = 1/X, X2 = X 2 or the random

variable X3 where: 
0 for x = 1, 2 or 3

X3 = 1 for x = 4 or 5

2 for x = 6.

These take the values derived from the function given, and the associated probabilities
are derived from those of X. Therefore, from the distribution of X we can derive, for
example, the distribution of X1 = 1/X.
237
X 1 = x1 1 1/2 1/3 1/4 1/5 1/6

P (X1 = x1 ) 1/6 1/6 1/6 1/6 1/6 1/6
Similarly, for X2 = X 2 we obtain:
X 2 = x2 1 4 9 16 25 36
P (X2 = x2 ) 1/6 1/6 1/6 1/6 1/6 1/6
And, finally, for X3 (as previously defined):
X 3 = x3 0 1 2
P (X3 = x3 ) 1/2 1/3 1/6
Just as we defined:
X
E(X) = x P (X = x)
we can define the expectation of a function of a discrete random variable.
Expectation of a function of a discrete random variable
For a discrete random variable X, then:

X
E(g(X)) = g(x) P (X = x)
where g(X) is the function of X being considered, defines the expectation of this
function of X where we sum over all the x values which are taken by the random
variable X.
Example 15.4 For the random variables X1 , X2 and X3 defined above, we have
the following.
For X1 , its expectation is:
X
1 1 1 1 1 1 1 1 1 1 1 1 1 1 49
E(X1 ) = E = P (X = x) = × + × + × + × + × + × = .
X x 1 6 2 6 3 6 4 6 5 6 6 6 120

X 1 1 1 1 1 1 91
E(X2 ) = E(X 2 ) = x2 P (X = x) = 12 × +22 × +32 × +42 × +52 × +62 × = .
6 6 6 6 6 6 6
1 1 1 2
E(X3 ) = 0 × +1× +2× = .
2 3 6 3
238
Activity 15.1 Consider the discrete probability distribution:
X=x 1 2 3 4 5
P (X = x) 0.1 0.2 0.3 0.3 0.1
Determine the following expectations:

(a) E(X)
(b) E(2X + 1)
(c) E(X 3 )
(d) E(1/X).
Is E(2X + 1) = 2 E(X) + 1? Is E(X 3 ) = (E(X))3 ? Is E(1/X) = 1/E(X)?
An immediate application of expectations of functions of discrete random variables is

the calculation of the variance of a discrete random variable.
15.6 Variance of a discrete random variable

Just as we needed the idea of dispersion (or spread) to describe a set of data, so we
need to define the variance of a random variable. The definition is similar. If X is a
random variable, we define the variance by:
X
Var(X) = (x − µ)2 P (X = x).
Recall that E(X) = µ, and the summation is taken over all the x values which are taken
by the random variable X. We often write:
Var(X) = σ 2
and just as we could find a sample variance in two ways, we can rewrite this as:
X
Var(X) = x2 P (X = x) − µ2 . (15.2)
So there are two equivalent versions we can use. The latter is often easier in practice.
The square root of the variance is the standard deviation. We could write (15.2) more
succinctly as follows.
Variance of a discrete random variable
The variance of a discrete random variable X is:
Var(X) = E(X 2 ) − (E(X))2 .
239
Example 15.5 For the two coins in Example 15.2, we saw that µ = 1. Therefore,
the variance is:
2
2
X 1 1 1
σ = (x − µ)2 P (X = x) = (0 − 1)2 × + (1 − 1)2 × + (2 − 1)2 ×
x=0
4 2 4
1 1
= +0+
4 4
1
= (first method)
2
or:
2
2
X
2 2 1 221 2 1
σ = x P (X = x) − µ = 0 × + 1 × + 2 × − 12
x=0
4 2 4

1
= 0+ +1 −1
2
1
= (second method)
2
√
giving a standard deviation of 1/ 2.
Example 15.6 Suppose, in an attempt to schedule fire crews efficiently, the

supervisor of a fire station has recorded the probability distribution of the number of
emergency calls, X, in a given day, based on historical data.
X=x 0 1 2 3 4
P (X = x) 0.25 0.30 0.25 0.15 0.05
(a) What are the expectation and variance of X?
(b) What is the probability that, on any given day, the number of calls exceeds:
i. µ + 2σ
ii. µ + 3σ?
(a) We have:
E(X) = (0 × 0.25) + (1 × 0.30) + (2 × 0.25) + (3 × 0.15) + (4 × 0.05) = 1.45.
and:
Var(X) = (0 − 1.45)2 × 0.25 + (1 − 1.45)2 × 0.30 + (2 − 1.45)2 × 0.25

+ (3 − 1.45)2 × 0.15 + (4 − 1.45)2 × 0.05
= 1.3475.
240
p √
(b) We have that σ = Var(X) = 1.3475 = 1.16.
i. Hence P (X > µ + 2σ) is:
P (X > 1.45 + 2 × 1.16) = P (X > 3.77) = P (X = 4) = 0.05.
ii. Hence P (X > µ + 3σ) is:
P (X > 1.45 + 3 × 1.16) = P (X > 4.93) = 0.
It is important to distinguish between a frequency distribution and a probability

distribution. The former uses data – it counts the number of observations satisfying
some criterion, or taking some value, in the dataset. The latter is based on theory, or
some assumed property – it gives the probability of an observation satisfying a criterion
or taking some value, based on theory or assumptions. The two are related, but are not
the same.
Activity 15.2 Find the variance and standard deviation of the discrete probability
distribution in Activity 15.1.
We now explore some special common discrete probability distributions.
15.7 Discrete uniform distribution

One of the simplest distributions is the discrete uniform distribution, where a
discrete random variable X has this distribution if X takes the values 1, 2, 3, . . . , k, each
with a probability of 1/k. For example, for k = 7, we have:
X=x 1 2 3 4 5 6 7
P (X = x) 1/7 1/7 1/7 1/7 1/7 1/7 1/7
What are the mean and variance in the general case? Well, the mean is:
X X 1 1 1 1
E(X) = x P (X = x) = k P (X = k) = 1× +2× +· · ·+k× = (1+2+· · ·+k).
k k k k
A useful result from mathematics is that:
k (k + 1)
1 + 2 + ··· + k = .
2
Refer to Activity 10.1 where this is derived. So:
1 k (k + 1) k+1
E(X) = × = .
k 2 2
Therefore, the expectation is the arithmetic mean of the minimum and maximum values.
A similar, slightly more involved calculation shows that:
k2 − 1
Var(X) = .
12
241

Show that:
k2 − 1
Var(X) =
12
where X follows a discrete uniform distribution. You may use the fact that:
k (k + 1) (2k + 1)
12 + 22 + · · · + k 2 = .
6
This distribution is of limited applicability, although it could, for example, be used in a

study of lottery outcomes because each set of numbers is equally likely to be drawn.
That said, it illustrates the principles used when looking at distributions, i.e. we:
define the distribution
find its mean and variance (and any other relevant properties)
consider how it may be applied.
So although this distribution is not very common in applications, it serves as a useful

reference point for more complex distributions.
15.8 Bernoulli distribution

A Bernoulli trial is an experiment with only two possible outcomes. We will number
these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively. There
are very many such cases, for example:
agree / disagree
male / female
employed / not employed
owns a car / does not own a car
business goes bankrupt / continues trading
and so on...
The Bernoulli distribution is the probability distribution of the outcome of a single

Bernoulli trial. This is the distribution of a random variable X with the probability
function: 
π
 for x = 1
P (X = x) = 1 − π for x = 0

0 otherwise.

That is, P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values are

possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter π. This is often written as:
X ∼ Bernoulli(π).
242
A parameter, or set of parameters, is a measure which completely describes a

probability distribution.
We note the following results.
Mean and variance of the Bernoulli distribution
If X ∼ Bernoulli(π), then:
E(X) = π
and:
Var(X) = π (1 − π).
Activity 15.4 Derive the mean and variance of the Bernoulli distribution.
15.9 Binomial distribution

Before giving a detailed description of this important distribution, we need a bit of
mathematics. We define the number n! (called ‘n factorial’) to be:
n! = n × (n − 1) × (n − 2) × · · · × 3 × 2 × 1
where n is a positive integer. For example, 5! = 5 × 4 × 3 × 2 × 1 = 120. Similarly,
4! = 4 × 3 × 2 × 1 = 24 and 1! = 1. We define 0! = 1. Next, we define the number of
combinations to be the number of ways of selecting x objects from a set of n objects
when order is unimportant and no objects may be repeated. In general, the number of
combinations is given by:
n n!
= .
x x! (n − x)!
By way of illustration, this says that there are:

5 5! 120
= = = 10
2 2! 3! 2×6
ways to select two objects from five objects without regard to order. We can easily
check this. If we have a set {1, 2, 3, 4, 5} with 5 objects, the 10 ways to select 2 objects
from these 5 (without replacement) are:
(1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5).
Suppose we carry out n Bernoulli trials such that:
in each trial the probability of success is π, i.e. a constant value of π
different trials are statistically independent events.
If X denotes the total number of successes in these n trials then X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer, and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).
243
Examples of applications of the binomial distribution include the following.
A coin (biased or unbiased) is tossed n times. We can find the probability of

obtaining x heads, where x = 0, 1, . . . , n.
An operation with a constant probability π of success is carried out n times. We

can find the probability of x successful operations.
A certain type of car battery has a known market share. If we examine n cars, we
can find the probability of finding x batteries of this type.
A known proportion of companies have an ethics code. If we contact n companies,

we can find the probability that x have an ethics code.
Example 15.7 A multiple choice test has four questions, each with four possible
answers. James is taking the test, but has no idea at all about the answers. So he
guesses every answer and, therefore, has a probability of 1/4 of getting any
individual question correct by chance.
Let X denote the number of correct answers in James’ test. This follows a binomial
distribution with n = 4 and π = 0.25, hence:
X ∼ Bin(4, 0.25).
What is the probability that James gets 3 of the 4 answers correct?

Here it is assumed that the guesses are independent, and each has a probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, such as 1110, is π 3 (1 − π)1 . However, we do not care about
the order of the correct and incorrect answers, only about the number of correct
(and hence incorrect) answers. So 1101 and 1011, for example, also count as 3
correct answers. Each of these also occurs with probability π 3 (1 − π)1 .
The total number of sequences with three ones (and hence one zero) is the number
of locations for the three ones which can be selected in the sequence of four answers.
This is 4!/(3! 1!) = 4. Therefore, the probability of obtaining three correct answers is:

4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.047.
3
Probability function of the binomial distribution
If X ∼ Bin(n, π), then the probability function of X is:

(
n
x
π x (1 − π)n−x for x = 0, 1, . . . , n
P (X = x) =
0 otherwise.
244
Example 15.8 Continuing Example 15.7, since X ∼ Bin(4, 0.25), we have:

4
P (X = 0) = × (0.25)0 × (0.75)4 = 0.316,
0

4
P (X = 1) = × (0.25)1 × (0.75)3 = 0.422,
1

4
P (X = 2) = × (0.25)2 × (0.75)2 = 0.211,
2

4
P (X = 3) = × (0.25)3 × (0.75)1 = 0.047
3
and:
4
P (X = 4) = × (0.25)4 × (0.75)0 = 0.004.
4
Mean and variance of the binomial distribution
If X ∼ Bin(n, π), then:

E(X) = n π
and:
Var(X) = n π (1 − π).
Example 15.9 Suppose we now have a test with 20 questions where each question
has 4 possible answers and consider again a student who guesses every one of the
answers. Let X denote the number of correct answers by such a student, so that
X ∼ Bin(20, 0.25). The expected number of correct answers is E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 3 decimal places):
X=x 0 1 2 3 4 5 6 7 8 9 10
P (X = x) 0.003 0.021 0.067 0.134 0.190 0.202 0.169 0.112 0.061 0.027 0.010
X=x 11 12 13 14 15 16 17 18 19 20
P (X = x) 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
We find that P (X < 8) = 0.898 and P (X < 9) = 0.959. Therefore, using the result
P (Ac ) = 1 − P (A), P (X ≥ 8) = 0.102 > 0.05 and P (X ≥ 9) = 0.041 < 0.05. So the
pass mark should be set at 9.
More generally, consider a student who has the same probability π of getting the
correct answer for each question, so that X ∼ Bin(20, π). In Figure 15.1, plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9 are provided. Note how the shape of the
245
distribution changes as the parameter π changes. In particular for the ‘extreme’ π,

i.e. π = 0.9, the distribution is heavily skewed. When π = 0.5, i.e. success and failure
are equally likely, the distribution of the number of successes is symmetric.
π = 0.25, E(X)=5 π = 0.5, E(X)=10

0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
Correct answers Correct answers
π = 0.7, E(X)=14 π = 0.9, E(X)=18

0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
Correct answers Correct answers
Figure 15.1: Binomial distribution probabilities where X ∼ Bin(20, π), for π = 0.25, 0.5,
0.7 and 0.9.
Figure 15.1 illustrates how different probability distributions may differ from each other
in a broader or narrower sense. In the broader sense, we have different families of
distributions which may have quite different characteristics, for example:
• symmetric vs. skewed
• continuous vs. discrete
• among discrete: finite vs. infinite number of possible values
• among continuous: different sets of possible values (for example, all real numbers x,
or x > 0).
The distributions discussed in this unit are really families of distributions in this sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
246
variance of the distribution, values of probabilities from it etc. In statistical analysis of a

random variable X we typically:
• select a family of distributions based on the basic characteristics of X
• use observed data to estimate values of the parameters of that distribution, and
perform statistical inference.
Activity 15.5 Of a large number of mass-produced articles, 5% are defective. Find

the probability that a random sample of 25 will contain:
(a) no defectives
(b) exactly one defective
(c) at least two defectives.
15.10 Poisson distribution

The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . . .
Probability function of the Poisson distribution
The probability function of the Poisson distribution is

(
e−λ λx /x! for x = 0, 1, 2, . . .
P (X = x) =
0 otherwise
where λ > 0 is a parameter.
If a random variable X has a Poisson distribution with parameter λ, this is often

denoted by:
X ∼ Poisson(λ).
Mean and variance of the Poisson distribution
If X ∼ Poisson(λ), then:
E(X) = λ (15.3)
and:
Var(X) = λ.
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions.
247
1. The numbers of occurrences in any two mutually exclusive intervals of time are
independent of each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.
In essence, these state that individual occurrences should be independent, sufficiently

rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number X of occurrences in
a randomly-selected time interval of length t = 1 follows a Poisson distribution with
mean λ, i.e. X ∼ Poisson(λ).
The single parameter λ of the Poisson distribution is the rate of occurrences per unit of
time. Examples of variables for which we might use a Poisson distribution are the:
number of telephone calls received at a call centre per minute

number of accidents on a stretch of motorway per week
number of customers arriving at a checkout per minute
number of misprints per page of newsprint.
Because λ is the rate per unit of time, its value also depends on the unit of time (length
of interval) we consider. For example, if X is the number of arrivals in an hour and
X ∼ Poisson(1.5), then if Y is the number of arrivals in two hours we must have that
Y ∼ Poisson(2 × 1.5) = Poisson(3).
λ is also the mean, E(X), of the distribution as we saw in (15.3).
Both motivations suggest that distributions with higher values of λ have higher
probabilities of large values of X. For example, Figure 15.2 plots the probabilities
P (X = x) for x = 0, 1, 2, . . . , 10 for Poisson(2) and Poisson(4).
Example 15.10 Customers arrive at a bank on weekday afternoons randomly at

an average rate of 1.6 customers per minute. Let X denote the number of arrivals in
a minute and let Y denote the number of arrivals in five minutes.
We assume a Poisson distribution for both, such that:
X ∼ Poisson(1.6) and Y ∼ Poisson(5 × 1.6) = Poisson(8).
(a) What is the probability that no customer arrives in a minute?

(b) What is the probability that more than two customers arrive in a minute?
(c) What is the probability that no more than one customer arrives in five minutes?
(a) For X ∼ Poisson(1.6), the probability is:

e−λ λ0 e−1.6 (1.6)0
P (X = 0) = = = e−1.6 = 0.2019.
0! 0!
248
0.25
λ=2
λ=4
0.20
0.15
p(x)
0.10
0.05
0.00
0 2 4 6 8 10
Figure 15.2: Poisson distribution probabilities where X ∼ Poisson(λ) for λ = 2 and 4.
(b) We have:
P (X > 2) = 1 − P (X ≤ 2) = 1 − (P (X = 0) + P (X = 1) + P (X = 2))
= 1 − P (X = 0) − P (X = 1) − P (X = 2)
e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2
=1− − −
0! 1! 2!
= 1 − 0.2019 − 0.3230 − 0.2584
= 0.2167.
(c) For Y ∼ Poisson(8), the probability is:
e−8 80 e−8 81
P (Y ≤ 1) = P (Y = 0)+P (Y = 1) = + = 0.000335+0.002684 = 0.003019.
0! 1!
Activity 15.6 Hits on a website arrive at the rate of 12 per hour. Briefly discuss
whether or not you believe the assumptions underlying the Poisson distribution
hold. Assuming the assumptions are valid, calculate the probabilities that:
(a) there are exactly three hits between 10:00 and 10:30
(b) there is exactly one hit between 14:00 and 14:05
(c) there are more than two hits between 16:40 and 16:45.
15.11 A word on calculators

In the examination you will be allowed a basic calculator only. To calculate binomial
and Poisson probabilities directly requires access to a ‘factorial’ key (for the binomial)
249
and ‘e’ key (for the Poisson), which will not appear on a basic calculator. Note that any
probability calculations which are required in the examination will be possible on a
basic calculator. For example, if a Poisson probability required the numerical value of
e−3 , then this would be provided in the examination question.
15.12 Summary
This unit has introduced the concept of a random variable and explained how there are
two types of random variable – discrete and continuous. Focusing on discrete random
variables, probability distributions were constructed to represent how likely the different
possible outcomes of a chance experiment were to occur. Important theoretical
properties of these probability distributions were also discussed, specifically the
expected value and variance.

Bernoulli distribution Bernoulli trial
Binomial distribution Combination
Continuous Discrete
Expected value Factorial
Normal distribution Parameter
Poisson distribution Poisson process
Probability distribution Random variable
Sample space Uniform distribution
Learning outcomes
appreciate the concepts of a random variable and a probability distribution
calculate the expected value and variance for discrete random variables
recognise some common discrete probability distributions
Exercises
Exercise 15.1
The probability function P (X = x) = 0.02x is defined for x = 8, 9, 10, 11 and 12. What
are the mean and variance of this probability distribution?
250
Exercise 15.2
Of all the candles produced by a company, 0.01% do not have wicks (the core piece of
string). A retailer buys 10,000 candles from the company.
(a) What is the probability that all the candles have wicks?
(b) What is the probability that at least one candle will not have a wick?
Exercise 15.3
If a large grass lawn contains an average of 1 weed per 600 cm2 , what will be the
distribution of X, the number of weeds in an area of 400 cm2 ? Hence find P (X ≤ 1).
Exercise 15.4
A graduate applies for 10 jobs. She believes she has a constant probability 0.1 of
receiving a job offer in each case. Assume independence of job offers.
(a) Write down the distribution of the total number of job offers received. What are
the mean and variance of the distribution?
(b) What is the probability of at least one job offer?
(c) The graduate is considering using the Poisson distribution to simplify the
calculation in (b). What advice would you give her?
(d) Discuss briefly whether you think the assumption of independence is realistic in
this context.
Exercise 15.5
The random variable X has a binomial distribution such that X ∼ Bin(4, 0.3). It has
the following probability distribution.
X=x 0 1 2 3 4
P (X = x) 0.2401 0.4116 0.2646 a b
(a) Find a and b.
(b) Suppose Y = (X − 3)2 . Find E(Y ).
Exercise 15.6
In a prize draw, the probabilities of winning various amounts of money are:
Prize (in £) 0 1 50 100 500

Probability of win 0.35 0.50 0.11 0.03 0.01
What is the expected value and standard deviation of the prize?
251
Exercise 15.7
For a random variable X, the formula:

14
× (0.3)6 × (0.7)8
6
was used to compute P (X = 6). What is the standard deviation of this probability
distribution?
Exercise 15.8
Explain briefly when it would be appropriate to use a:
(a) uniform distribution
(b) binomial distribution
(c) Poisson distribution.
252
16. Probability III – The normal distribution and sampling distributions
Unit 16: Probability III

The normal distribution and sampling
distributions
Overview
The normal distribution is introduced and probabilities calculated for this distribution
(which requires a transformation to the standard normal distribution). We then proceed
to consider the estimation of a population mean through the use of sampling. This gives
rise to a sampling distribution and its properties are discussed. We conclude the
probability section of the course with the powerful result known as the ‘central limit
theorem’.
Aims
This unit explores the normal distribution and how it relates to sampling distributions
and the central limit theorem. Particular aims are:
to work with the normal distribution
to understand the concept of a sampling distribution
to appreciate the usefulness of the central limit theorem.
Background reading
16.1 The normal distribution

The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons.
Many variables have distributions which are approximately normal – for example,
weights of humans, animals and various products.
The normal distribution has extremely convenient mathematical properties which

make it a useful default choice of distribution in many contexts.
Even when a variable is not itself even approximately normally distributed,

functions of several observations of the variable (‘sampling distributions’) are often
253
approximately normal due to the central limit theorem (CLT). Because of this,
the normal distribution has a crucial role in statistical inference. This will be
discussed later.
The probability density function (i.e. the formula for the distribution curve) of the
normal distribution (which you do not need to remember!) is:

1 1 2
f (x) = √ exp − 2 (x − µ) for − ∞ < x < ∞ (16.1)
2πσ 2 2σ
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this probability density function is said to have a normal
distribution with mean µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).
Mean and variance of the normal distribution
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
hence the standard deviation is σ.
The normal distribution is the so-called ‘bell curve’. The two parameters affect it as
follows.
The mean, µ, determines the location of the curve.
The variance, σ 2 , determines the dispersion (spread) of the curve.
For example, in Figure 16.1, N (0, 1) and N (5, 1) have the same dispersion but different
locations – the N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the
right, while N (0, 1) and N (0, 9) have the same location but different dispersions – the
N (0, 9) curve is centred at the same value as the N (0, 1) curve, but spread out more
widely.
The mean can also be inferred from the observation that the normal distribution is
symmetric about µ. This also implies that the median of the normal distribution is also
µ, and we also note that since the distribution reaches a maximum at µ, then the mean
and median are also equal to the mode.
Probabilities are given by areas under the curve, which involves integrating equation
(16.1). Unfortunately, such integrals cannot be evaluated in closed-form, so instead we
make use of statistical tables.1 Specifically, we note the special transformation:
X −µ
Z= .
σ
The transformed variable Z is known as a standardised variable, or z-score. It can be
shown (but is beyond the scope of this course), that the distribution of the z-score is
1
In practice we could also use a computer, but not in the examination!
254
0.4
0.3
N(0, 1) N(5, 1)
0.2
0.1
N(0, 9)
0.0
−5 0 5 10
Figure 16.1: Three examples of normal distributions.
N (0, 1), i.e. the normal distribution with mean µ = 0 and variance σ 2 = 1 (and,
therefore, a standard deviation of σ = 1).
Standard normal distribution
If X ∼ N (µ, σ 2 ), then:
X −µ
Z= ∼ N (0, 1).
σ
The cumulative probability, P (Z ≤ z), is often denoted by Φ(z) and values for various
‘z’ are given in Appendix C.
In the examination, you will have a copy of the table in Appendix C. The table shows
values of Φ(z) = P (Z ≤ z) for z ≥ 0. This can be used to calculate probabilities of any
intervals for any normal distribution – but how? The table seems to be incomplete.
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for z ≥ 0.
We now show how these are not really limitations, starting with ‘2.’, i.e. how to work
out cumulative standard normal probabilities for negative z-values.
The key to using the table is that the standard normal distribution is symmetric about
zero. This means that for an interval in one tail, its ‘mirror image’ in the other tail has
the same probability.
Suppose that z ≥ 0, so that −z ≤ 0. The table in Appendix C shows:
P (Z ≤ z) = Φ(z).
From it, we also get the following probabilities:

P (Z > z) = 1 − Φ(z)
P (Z ≤ −z) = Φ(−z) = 1 − Φ(z) = P (Z > z)
P (Z > −z) = 1 − Φ(−z) = Φ(z) = P (Z < z).
255
In the continuous world, the probability of a single point value is zero. Therefore, since
P (Z = z) = 0 for all z, we are indifferent between using ≤ and <, similarly we are
indifferent between using ≥ and >. So, P (Z ≤ z) = P (Z < z) and also we have that
P (Z ≥ z) = P (Z > z). This is because:
P (Z ≤ z) = P (Z < z) + P (Z = z) = P (Z < z) + 0 = P (Z < z)
P (Z ≥ z) = P (Z > z) + P (Z = z) = P (Z > z) + 0 = P (Z > z).
Figure 16.2 shows equal tail probabilities for the standard normal distribution, i.e. it
shows that P (Z ≤ −z) = P (Z ≥ z).
−z 0 +z
Figure 16.2: Equal tail probabilities for the standard normal distribution showing that
P (Z ≤ −z) = P (Z ≥ z).
Calculating probabilities for the standard normal distribution
If Z ∼ N (0, 1), for any two numbers z1 < z2 then:
P (z1 < Z ≤ z2 ) = Φ(z2 ) − Φ(z1 )
where Φ(z2 ) and Φ(z1 ) are obtained using the tabulated values in Appendix C.
Reality check : Remember that the standard normal distribution is symmetric about 0,
hence:
Φ(0) = P (Z ≤ 0) = 0.5.
So if you ever end up with results like P (Z ≤ −1) = 0.7 or P (Z ≤ 1) = 0.2 or
P (Z > 2) = 0.95, these must be wrong! Why? Well, P (Z ≤ −1) < P (Z ≤ 0) = 0.5, so
P (Z ≤ −1) cannot be 0.7. Similarly, 0.5 = P (Z ≤ 0) < P (Z ≤ 1), so P (Z ≤ 1) cannot
be 0.2. Finally, P (Z ≥ |z|) ≤ 0.5 for any z, so P (Z > 2) cannot be 0.95.
Example 16.1 If Z ∼ N (0, 1), what is P (Z > 1.20)?

It is useful to draw a quick sketch to visualise the area of probability. Such a sketch
is shown in Figure 16.3. Turning to Appendix C, we look up the z-value of 1.20 by
256
using the ‘1.2’ row and ‘0.00’ column which shows that:
Φ(1.20) = P (Z ≤ 1.20) = 0.8849.
Therefore, P (Z > 1.20) = 1 − Φ(1.20) = 0.1151.

Standard Normal Density Function
0.4
0.3
f Z (z )
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 16.3: The standard normal density function where P (Z > 1.20) is the area of the
shaded region.
Example 16.2 Turn to Appendix C. Look up the probability in the ‘0.8’ row and
‘0.04’ column of the table, which shows that:
Φ(0.84) = P (Z ≤ 0.84) = 0.7995.
We then also have the following.

P (Z > 0.84) = 1 − Φ(0.84) = 0.2005.
P (Z < −0.84) = 1 − P (Z ≤ 0.84) = 1 − Φ(0.84) = 0.2005. Alternatively,

P (Z < −0.84) = P (Z > 0.84) by symmetry of Z ∼ N (0, 1) about zero.
P (Z ≥ −0.84) = P (Z ≤ 0.84) = Φ(0.84) = 0.7995.
P (−0.84 ≤ Z ≤ 0.84) = P (Z ≤ 0.84) − P (Z < −0.84) = 0.5990.
Example 16.3 If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)?

We require the sum of the red and blue areas in Figure 16.4. The red area is given by:
P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0) = Φ(1.86) − Φ(0)

= 0.9686 − 0.5
= 0.4686.
257
The blue area is given by:
P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24) = Φ(0) − Φ(−1.24)

= Φ(0) − (1 − Φ(1.24))
= 0.5 − (1 − 0.8925)
= 0.3925.
Hence P (−1.24 < Z < 1.86) = 0.4686 + 0.3925 = 0.8611.
Standard Normal Density Function

0.4
0.3
f Z (z )
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 16.4: The standard normal density function where the red area is P (0 ≤ Z ≤ 1.86)
and the blue area is P (−1.24 ≤ Z ≤ 0).
Activity 16.1 Let Z ∼ N (0, 1).

(a) Calculate P (0 < Z < 1.2)
(b) Calculate P (−0.68 < Z < 0)
(c) Calculate P (−0.46 < Z < 2.21)
(d) Calculate P (0.81 < Z < 1.94).
(e) Find a value for z such that P (−z < Z < z) = 0.80.
16.1.1 Probabilities for any normal distribution

How about a normal distribution X ∼ N (µ, σ 2 ), for any other µ and σ 2 ? What if we
want to calculate, for any a < b, P (a ≤ X ≤ b)?
Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts
258
of the inequalities we get:

a−µ X −µ b−µ

P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
a−µ b−µ

=P ≤Z≤
σ σ
b−µ a−µ

=Φ −Φ
σ σ
which can be calculated using the table in Appendix C.
Note that this also covers the cases of the one-sided inequalities P (X ≤ b), with
a = −∞, and P (X ≥ a), with b = ∞. For a = −∞, then:
b−µ

P (X ≤ b) = Φ
σ
because Φ(−∞) = 0. For b = ∞, then:
a−µ

P (X ≥ a) = 1 − Φ
σ
because Φ(∞) = 1.
Example 16.4 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87). Note
that diastolic blood pressure can only be approximately normal, rather than exactly
normal, because normal random variables can take negative values and, clearly,
diastolic blood pressure cannot be negative. However, for practical purposes, we can
use the normal distribution to model diastolic blood pressure.
Suppose we want to know the probabilities of the following intervals:
X > 90 (high blood pressure)
X < 60 (low blood pressure)
60 ≤ X ≤ 90 (ordinary (mid) blood pressure).
These are calculated using the previous results, with µ = 74.2 and σ 2 = 127.87, and
hence σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31
and we can refer values of this standardised variable to the table in Appendix C. We
have:
X − 74.2 90 − 74.2

P (X > 90) = P > = P (Z > 1.40) = 1 − Φ(1.40)
11.31 11.31
This we have in the table in Appendix C, from which we obtain:
1 − Φ(1.40) = 1 − 0.9192 = 0.0808.
259
Also:
X − 74.2 60 − 74.2

P (X < 60) = P < = P (Z < −1.26) = P (Z > 1.26) = 1−Φ(1.26)
11.31 11.31
which is 1 − 0.8962 = 0.1038, according to the table. Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = (1 − P (X > 90)) − P (X < 60)

= (1 − 0.0808) − 0.1038
= 0.8154.
These (rounded) probabilities are shown in Figure 16.5.

0.04
Mid: 0.82
0.03
Low: 0.10
0.02
High: 0.08
0.01
0.00
40 60 80 100 120
Diastolic blood pressure
Figure 16.5: Probabilities for Example 16.4 regarding diastolic blood pressure.
Activity 16.2 The scores on a verbal reasoning test are modelled by a normal
distribution with a mean of µ = 100 and a standard deviation of σ = 10.
(a) What proportion of the scores will be greater than 95?
(b) What proportion of the scores will be less than 110?
(c) What is the probability of an individual selected at random having a score less
than 70?
(d) What are the quartiles of the distribution?
(e) What is the range of scores such that 5% of the scores are below the range and
5% of the scores are above the range?
260
16.1.2 Some probabilities around the mean

The following results hold for all normal distributions.
P (µ − σ < X < µ + σ) = 0.683. In words, 68.3% of the total probability is within 1
standard deviation of the mean.
P (µ − 1.96 × σ < X < µ + 1.96 × σ) = 0.950. In words, 95% of the total probability
is within 1.96 standard deviations of the mean.
P (µ − 2 × σ < X < µ + 2 × σ) = 0.954. In words, 95.4% of the total probability is

within 2 standard deviations of the mean.
P (µ − 2.58 × σ < X < µ + 2.58 × σ) = 0.99. In words, 99% of the total probability
is within 2.58 standard deviations of the mean.
P (µ − 3 × σ < X < µ + 3 × σ) = 0.997. In words, 99.7% of the total probability is

within 3 standard deviations of the mean.
The first two of these are illustrated graphically in Figure 16.6.
0.683
µ −1.96σ µ−σ µ µ+σ µ +1.96σ
<−−−−−−−−−− 0.95 −−−−−−−−−−>
Figure 16.6: Some probabilities around the mean. The shaded area shows that 68.3% of
the total probability is within 1 standard deviation of the mean. The shaded and hatched
areas combined show that 95% of the total probability is within 1.96 standard deviations
of the mean.
16.2 Sampling distributions

A simple random sample is a sample selected by a process where every possible
sample (of the same size, n) has the same probability of selection.2 The selection
process is left to chance which eliminates the effect of selection bias.3 Due to the
2
Sampling techniques are discussed in greater depth later in the course.
3
Forms of bias are also discussed later in the course.
261
random selection mechanism, we do not know (in advance) which sample will occur.
Every population element has a known, non-zero probability of selection in the sample,
but no element is certain to appear.
Consider a population of size N = 6 elements: A, B, C, D, E and F. We consider all
possible simple random samples of size n = 2 (without replacement, i.e. once an object
has been chosen it cannot be selected again). There are 15 different, but equally likely,
such samples, so each sample has the same probability of selection, i.e. 1/15. The
possible samples are:
AB, AC, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, DF, EF.
16.2.1 Estimation
A population has particular characteristics of interest such as the mean, µ, and
variance, σ 2 . Collectively we refer to these characteristics as parameters. If we do not
have population data, the parameter values will be unknown.
‘Statistical inference’ is the process of estimating the (unknown) parameter values using
the (known) sample data. We use a statistic (called an ‘estimator’) calculated from
sample observations to provide a ‘point estimate’ of a parameter.
Returning to our example, recall there are 15 different samples of size 2 from a
population of size 6. Suppose the variable of interest is monthly income, such that:
Individual A B C D E F
Income (in £000s) 3 6 4 9 7 7
Estimator of the population mean
We use the sample mean, X̄, as our estimator of the population mean, µ, where for
a random sample size n we define:
n
P
Xi
i=1
X̄ = .
n
For example, if the observed sample was AB, the sample mean is (3000 + 6000)/2 =
£4,500.
Clearly, different observed random samples will lead to different sample means.
Consider the values of X̄, i.e. x̄, for all possible samples (in £000s):
Sample Values x̄ Sample Values x̄

AB 3 6 4.5 BF 6 7 6.5
AC 3 4 3.5 CD 4 9 6.5
AD 3 9 6 CE 4 7 5.5
AE 3 7 5 CF 4 7 5.5
AF 3 7 5 DE 9 7 8
BC 6 4 5 DF 9 7 8
BD 6 9 7.5 EF 7 7 7
BE 6 7 6.5
262
So the values of X̄ vary from 3.5 to 8, depending on the sample values. Since we have
the population data here we can actually compute the population mean, µ (in £000s),
which is:
PN
Xi
i=1 3+6+4+9+7+7
µ= = = 6.
N 6
So, even with simple random sampling, we obtain some x̄ values far from µ. Here, in
fact, only one sample (AD) results in x̄ = µ.
Let us now consider the maximum possible absolute deviations of the sample mean
from the population mean, i.e. the distance |x̄ − µ|.
max |x̄ − µ| Range of x̄ Number of samples Probability

0 x̄ = 6 1 0.067
0.5 5.5 ≤ x̄ ≤ 6.5 6 0.400
1 5 ≤ x̄ ≤ 7 10 0.667
1.5 4.5 ≤ x̄ ≤ 7.5 12 0.800
2 4 ≤ x̄ ≤ 8 14 0.933
2.5 3.5 ≤ x̄ ≤ 8.5 15 1.000
So, for example, there is an 80% probability of being within 1.5 units of µ (in £000s).
We now represent this as a frequency distribution. That is, we record the frequency
of each possible value of x̄.
x̄ Frequency Relative frequency

3.5 1 1/15 = 0.067
4.5 1 1/15 = 0.067
5.0 3 3/15 = 0.200
5.5 2 2/15 = 0.133
6.0 1 1/15 = 0.067
6.5 3 3/15 = 0.200
7.0 1 1/15 = 0.067
7.5 1 1/15 = 0.067
8.0 2 2/15 = 0.133
This is known as the sampling distribution of X̄. The sampling distribution is a

central and vital concept in statistics. It can be used to evaluate how ‘good’ an
estimator is. Specifically, we care about how ‘close’ the estimator is to the population
parameter of interest.
As we have seen, different samples yield different sample mean values, as a consequence
of the random sampling procedure. Hence estimators (of which X̄ is an example) are
random variables. So X̄ is our estimator of µ, and the observed value of X̄, denoted x̄,
is a point estimate.
Like any distribution, we care about a sampling distribution’s mean and variance.
Together, we can assess how ‘good’ an estimator is. First, consider the mean – we seek
an estimator which does not mislead us systematically. So the ‘average’ (mean) value of
an estimator, over all possible samples, should be equal to the population parameter
itself.
263
Returning to our example:
x̄ Frequency Product x̄ Frequency Product

3.5 1 3.5 6.5 3 19.5
4.5 1 4.5 7.0 1 7.0
5.0 3 15.0 7.5 1 7.5
5.5 2 11.0 8.0 2 16.0
6.0 1 6.0
Hence the mean of this sampling distribution is:

3.5 + 4.5 + 15.0 + 11.0 + 6.0 + 19.5 + 7.0 + 7.5 + 16.0
= 6 = µ.
15
An important difference between a sampling distribution and other distributions is that
the values in a sampling distribution are summary measures of whole samples (i.e.
statistics or estimators) rather than individual observations. Formally, the mean of a
sampling distribution is called the expected value of the estimator, denoted by E(·).
Hence the expected value of the sample mean is E(X̄).
An unbiased estimator has its expected value equal to the parameter being
estimated. For our example, E(X̄) = 6 = µ.
Fortunately the sample mean X̄ is always an unbiased estimator of µ in simple random
sampling, regardless of:
the sample size, n
the distribution of the (parent) population.
This is a good illustration of a population parameter (µ) being estimated by its sample
counterpart (X̄).
The unbiasedness of an estimator is clearly desirable. However, we also need to take into
account the dispersion of the estimator’s sampling distribution. Ideally, the possible
values of the estimator should not vary much around the true parameter value. So we
seek an estimator with a small variance. Recall the variance is defined to be the mean
of the squared deviations about the mean of the distribution. In the case of sampling
distributions, it is referred to as the sampling variance.
Returning to our example:
x̄ x̄ − µ (x̄ − µ)2 Frequency Product

3.5 −2.5 6.25 1 6.25
4.5 −1.5 2.25 1 2.25
5.0 −1.0 1.00 3 3.00
5.5 −0.5 0.25 2 0.50
6.0 0.0 0.00 1 0.00
6.5 0.5 0.25 3 0.75
7.0 1.0 1.00 1 1.00
7.5 1.5 2.25 1 2.25
8.0 2.0 4.00 2 8.00
Total 15 24.00
264
Hence the sampling variance is 24/15 = 1.6.

The population itself has a variance – the population variance, σ 2 .
x x−µ (x − µ)2 Frequency Product

3 −3 9 1 9
6 0 0 1 0
4 −2 4 1 4
9 3 9 1 9
7 1 1 2 2
Total 6 24
Hence the population variance is σ 2 = 24/6 = 4.

We now consider the relationship between σ 2 and the sampling variance. Intuitively, a
larger σ 2 should lead to a larger sampling variance. For population size N and sample
size n, we note the following result when sampling without replacement:
N − n σ2
Var(X̄) = × .
N −1 n
So, for our example, we get:
6−2 4
Var(X̄) = × = 1.6
6−1 2
as we saw above.
We use the term standard error to refer to the standard deviation of the sampling
distribution, so:
r
N − n σ2
q
S.E.(X̄) = Var(X̄) = × = σX̄ .
N −1 n
Some implications are the following.
As the sample size, n, increases, the sampling variance decreases, i.e. the precision
increases.4
Provided the sampling fraction, n/N , is small, the term:
N −n
≈1
N −1
so it can be ignored – the precision depends effectively on n only.
Returning to our example, the larger the sample, the less variability there will be
between samples.
4
Although greater precision is desirable, data collection costs will rise with n – remember why we
sample in the first place!
265
x̄ Frequency Frequency x̄ Frequency Frequency

when n = 2 when n = 4 when n = 2 when n = 4
3.50 1 — 6.25 — 2
4.50 1 — 6.50 3 3
5.00 3 2 6.75 — 1
5.25 — 1 7.00 1 —
5.50 2 1 7.25 — 1
5.75 — 3 7.50 1 —
6.00 1 1 8.00 2 —
Activity 16.3 List all samples of size n = 4 from A, B, C, D, E and F, when

sampling without replacement. Hence verify that the sampling distribution of X̄ is
as indicated in the table above.
We see that there is a striking improvement in the precision of the estimator, because
the variability has decreased considerably. The range of possible x̄ values goes from 3.5
to 8.0 down to 5.0 to 7.25. The sampling variance is reduced from 1.6 to 0.4.
The factor (N − n)/(N − 1) decreases steadily as n → N . When n = 1 the factor equals
1, and when n = N it equals 0. Sampling without replacement, increasing n must
increase precision since less of the population is left out. In most practical sampling N
is very large (for example, several million), while n is comparatively small (at most
1,000, say). Therefore, in such cases the factor (N − n)/(N − 1) is close to 1, hence:
N − n σ2 σ2 Var(X)
Var(X̄) = × ≈ =
N −1 n n n
for small n/N . When N is large, it is the sample size n which is important in
determining precision, not the sampling fraction. Consider two populations: N1 = 3
million and N2 = 200 million, both with the same variance σ 2 . If we sample
n1 = n2 = 1000 from each population then:
2 N1 − n1 σ 2 σ2
σX̄ = × = 0.999667 ×
1
N1 − 1 n1 1000
and:
2 N2 − n2 σ 2 σ2
σX̄ = × = 0.999995 × .
2
N2 − 1 n2 1000
2 2
So σX̄1
≈ σX̄2
, despite N1 being much less than N2 .
16.2.2 Sampling from a normal population

The mean and variance of X̄ are equal to E(X) and Var(X)/n, respectively, for a
random sample of size n from any population distribution of X. What about the form
of the sampling distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, the sampling distribution of X̄ is also normal.
266
Sampling distribution of X̄ with a normal population
If {X1 , . . . , Xn } is a random sample from a normal distribution with mean µ and

variance σ 2 , then:
σ2

X̄ ∼ N µ, .
n
So we note E(X̄) = E(X) = µ. In an individual sample, x̄ is not usually equal to µ, the

expected value of the population. However, over repeated samples the values of X̄ are
centred at µ.
√
We also note Var(X̄) = Var(X)/n = σ 2 /n, and so the standard error is σ/ n. Variation
of values of X̄ in different samples (the sampling variance) is large when the population
variance of X is large. More interestingly, the sampling variance gets smaller when the
sample size n increases. In other words, when n is large the distribution of X̄ is more
tightly concentrated around µ than when n is small.
n=100
n=20
n=5
4.0 4.5 5.0 5.5 6.0
Figure 16.7: Sampling distributions of X̄ from a N (5, 1) population, for different n.
16.3 Central limit theorem

We now have the very convenient result that if a random sample comes from a normal
population, the sampling distribution of X̄ is also normal. However, what about
sampling distributions of X̄ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem
(CLT). In essence, the CLT states that the normal sampling distribution of X̄ which
holds exactly for random samples from a normal distribution, also holds approximately
for random samples from (nearly) any distribution.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution
267
which has mean E(Xi ) = µ and variance Var(Xi ) = σ 2 , for i = 1, . . . , n. If X̄n denotes
the sample mean calculated from a random sample of size n, then:
X̄n − µ

lim P √ ≤ z = Φ(z)
n→∞ σ/ n
for any z, where Φ(z) denotes P (Z ≤ z) where Z ∼ N (0, 1).
The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly
n→∞
well as n increases, and exactly when the sample size is infinite.
Sampling distribution of X̄ with a non-normal population
In less formal language, the CLT says that for a random sample from (nearly) any
non-normal distribution with mean µ and variance σ 2 , then:
σ2

X̄ ∼ N µ,
n
approximately, when n is sufficiently large. We can then say that X̄ is asymptotically

normally distributed with mean µ and variance σ 2 /n.
‘Nearly’ because the CLT requires that the variance of the population distribution is
finite. If it is not, the CLT does not hold, but such distributions are not common.
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from simple random samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
to be from such samples.
Even the basic version applies very widely when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
variables in the sample, we can also apply the CLT to:
Pn Pn
log(Xi ) X i Yi
i=1 i=1
or .
n n
Therefore, the CLT can be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single variable in a random sample.
How large is ‘large n’ ? The larger the sample size n, the better is the normal
approximation provided by the CLT. In practice, we have various rules-of-thumb for
what is ‘large enough’ for the approximation to be ‘accurate enough’. This also depends
on the population distribution – for example:
for symmetric distributions, even small n is enough
for very skewed distributions, larger n is required.
For many distributions, n > 30 is sufficient for the approximation to be reasonably

accurate.
268
Example 16.5 In the first example, random samples (not shown here) of sizes:
n = 1, 5, 10, 30, 100 and 1000
were simulated from an ‘exponential’ distribution (for which µ = 4 and σ 2 = 16).

This is a positively-skewed distribution, as shown by the histogram for n = 1 in
Figure 16.8. Although we will not cover the exponential distribution formally in this
course, it is an interesting distribution since it describes the waiting times between
events in a Poisson process.
Ten thousand samples of each size were generated. Histograms of the values of X̄ in
these samples are shown in Figure 16.8. Each plot also shows the approximating
normal distribution N (4, 16/n). The normal approximation is reasonably good
already for n = 30, very good for n = 100 and practically perfect for n = 1000.
n=5 n = 10
n=1
0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10
n = 30 n = 100 n = 1000
2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4
Figure 16.8: Distributions of X̄ from an exponential distribution for which µ = 4, for

different n.
Example 16.6 In the second example, 10,000 random samples (again, not shown
here) of sizes:
n = 1, 10, 30, 50, 100 and 1000
were simulated from the Bernoulli(0.2) distribution (for which µ = 0.2 and also
σ 2 = 0.2 × (1 − 0.2) = 0.16).
Here the distribution itself is not even continuous, and has only two possible values,
0 and 1. Nevertheless, the sampling distribution of X̄ can be well-approximated by
the normal distribution, when n is large enough, as shown in Figure 16.9.
269
n
P
Note that since here Xi = 1 or Xi = 0 for all i = 1, . . . , n, X̄ = Xi /n = m/n,
i=1
where m is the number of observations for which Xi = 1. In other words, X̄ is the
sample proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50.
n = 30
n = 10
n=1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5
n = 100 n = 1000
n = 50
0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24
Figure 16.9: Distributions of X̄ from Bernoulli(0.2), for different n.
Activity 16.4 Consider the population below with N = 4 elements.
A B C D
3 6 9 12
(a) Calculate the population mean and variance.
(b) Write down the sampling distribution of the sample mean for samples of size
n = 2.
(c) Using the result in (b), calculate the mean of the sampling distribution.
(d) Using the result in (b), calculate the variance of the sampling distribution.
(e) Use the formula for the variance of the sample mean to verify the relationship
between the value from (d) and the population variance.
270
16.4 Summary
The final probability unit covered the key points relating to the normal distribution. We
saw how to calculate probabilities for this distribution by considering areas under its
curve. We then proceeded to explain the concept of a sampling distribution and its
importance when estimating an unknown parameter such as a population mean when
sampling from a normal population and, by way of the central limit theorem, when
sampling from non-normal populations.

Central limit theorem Expected value
Frequency distribution Normal distribution
Parameters Point estimate
Precision Sample proportion
Sampling distribution Sampling fraction
Sampling variance Selection bias
Simple random sample Standard error
Standardised variable Unbiased estimator
Learning outcomes
compute areas under the curve for a normal distribution
explain what a sampling distribution represents
state and apply the central limit theorem
Exercises
Exercise 16.1
The random variable X has a normal distribution with mean µ and variance σ 2 , i.e.
X ∼ N (µ, σ 2 ). It is known that:
P (X ≤ 66) = 0.0359 and P (X ≥ 81) = 0.1151.
(a) Produce a clearly-labelled sketch to represent these probabilities on a normal curve.
(b) Show that the value of σ = 5.
(c) Calculate P (69 ≤ X ≤ 83).
271
Exercise 16.2
A random variable takes the values 1, 2 and 3, each with equal probability. List all
possible samples of size two which may be chosen, without replacement, from this
population and hence construct the sampling distribution of the sample mean, X̄.
Exercise 16.3
The weights of a large group of animals have a mean of 8.2 kg and a standard deviation
of 2.2 kg. What is the probability that a random selection of 80 animals from the group
will have mean weight between 8.3 kg and 8.4 kg? State any assumptions you make.
Exercise 16.4
A perfectly-machined regular tetrahedral (pyramid-shaped) die has four faces labelled 1
to 4. It is tossed twice onto a level surface and after each toss the number on the face
which is downward is recorded. If the recorded values are x1 and x2 and the mean is
x̄ = (x1 + x2 )/2, describe the distribution of x̄ as a random quantity over repeated
double tosses.
Exercise 16.5
A normal distribution has a mean of 40. If 10% of the distribution falls between the
values of 50 and 60, what is the standard deviation of the distribution?
Exercise 16.6
Consider the following set of data. Does it appear to approximately follow a normal
distribution? Justify your answer.
45 31 37 55 54 56
48 54 52 55 52 51
49 46 62 38 45 48
47 46 40 61 50 58
46 35 36 59 50 48
39 48 51 52 43 45
Exercise 16.7
Discuss the differences or similarities between a sampling distribution of size 5 and a
single (simple) random sample of size 5.
Exercise 16.8
The distribution of salaries of lecturers in a university is positively skewed, with most
lecturers earning near the minimum of the pay scale. What would a sampling
distribution of size 2 look like? How about size 5? How about size 50?
Exercise 16.9
In no more than 200 words, explain the term ‘central limit theorem’.
272
17. Sampling and experimentation I – Sampling techniques and contact methods
Unit 17: Sampling and

experimentation I
Sampling techniques and contact
methods
Overview
Statistics concerns data analysis, but to do any analysis first we need data! This unit
explores various methods which social scientists can use to gather data. Central to this
is the concept of sampling – the (possibly random) selection of a sample of members
from an underlying population. From our sample we can then make inferences about
the population. We begin by describing a range of sampling techniques, outlining their
relative advantages and disadvantages, and then consider the possible contact methods
which might be used.
Aims
This unit presents random and non-random sampling techniques and survey contact
methods. Particular aims are:
to provide an overview of sampling in the social sciences
to outline the advantages and disadvantages of various sampling techniques and

survey contact methods.
Background reading
17.1 Sampling
Sampling is a key component of any research design. The key to the use of statistics in
research is being able to take data from a sample and make inferences about a large
population. This idea is depicted in Figure 17.1.
Sampling design involves several basic questions.
Should a sample be taken?
If so, what process should be followed?
273
Figure 17.1: A depiction of inferring population characteristics from a sample drawn from
the population of interest.
What kind of sample should be taken?
How large should it be?
What can be done to control and adjust for non-response errors?
We now consider how to answer these questions.
Sample or census?
We introduce some important terminology.

Population – the aggregate of all the elements, sharing some common set of
characteristics, which comprise the universe for the purpose of the social science
problem.
Census – a complete enumeration of the elements of a population of study objects.
Sample – a subgroup of the elements of the population selected for participation

in the study.
To determine whether a sample or a census should be conducted, various factors need to

be considered such as the following.
A census is very costly so a large budget would be required, whereas a small budget
favours a sample because fewer population elements are observed.
The length of time available for the study is important – a sample is far quicker to
perform.
How large is the population? If it is ‘small’, then it is feasible to conduct a census

(it would not be too costly nor too time-consuming). However, it might not be
practical to enumerate a ‘large’ population.
We will be interested in some particular characteristic, such as the heights of a

group of adults. If there is a small variance in the characteristic of interest, then
274
Conditions favouring the use of a:

Factors sample census
Budget Small Large
Time available Short Long
Population size Large Small
Variance in the characteristic Small Large
Cost of sampling errors Low High
Cost of non-sampling errors High Low
Nature of measurement Destructive Non-destructive
Attention to individual cases Yes No
Table 17.1: Sample versus census.
population elements are ‘similar’, so we only need to observe a few elements to

have a clear idea about the characteristic. If the variance is large, then a sample
may fail to capture the large dispersion in the population, hence a census would be
more appropriate.
Sampling errors occur when the sample fails to adequately represent the population.
If the consequences of making sampling errors are extreme (i.e. the ‘cost’ is high),
then a census would appeal more since it eliminates sampling errors completely.
If non-sampling errors are costly (for example, an interviewer incorrectly

questioning respondents) then a sample is better because fewer resources would
have been spent on collecting the data.
Measuring sampled elements may result in the destruction of the object, such as
testing the lifespan of a tyre. Clearly, in such cases a census is not feasible as there
would be no tyres left to sell!
Sometimes we may wish to perform an in-depth interview to study elements in

great detail. If we want to focus on detail, then time and budget constraints would
favour a sample.
The conditions which favour the use of a sample or census are summarised in Table
17.1. Of course, in practice, some of our factors may favour a sample while others favour
a census, in which case a balanced judgement is required.
Activity 17.1 Under what conditions would (a) a sample be preferable to a census,
and (b) a census be preferable to a sample?
Classification of sampling techniques
We draw a sample from the target population, which is the collection of elements or
objects which possess the information sought by the researcher and about which
inferences are to be made. We now consider the different types of sampling techniques
which can be used in practice, which can be decomposed into ‘non-probability sampling
techniques’ and ‘probability sampling techniques’.
275
Non-probability sampling techniques are characterised by the fact that some units in
the population do not have a chance of selection in the sample. Individual units in the
population have an unknown probability of being selected. There is also an inability to
measure sampling error. Examples of such techniques are:
convenience sampling
judgemental sampling
quota sampling
snowball sampling.
Probability sampling techniques mean every population element has a known, non-zero
probability of being selected in the sample. Probability sampling makes it possible to
estimate the margins of sampling error, and hence all statistical techniques (such as
confidence intervals and hypothesis tests – not considered in this course) can be applied.
In order to perform probability sampling, we need a sampling frame which is a list of
all population elements. However, we need to consider whether the sampling frame is (i)
adequate (does it represent the target population?), (ii) complete (are there any missing
units, or duplications?), (iii) accurate (are we researching dynamic populations?), and
(iv) convenient (is the sampling frame readily accessible?). Examples of such techniques
are:
simple random sampling
systematic sampling
stratified sampling
cluster sampling
multistage sampling.
We now consider each of the listed techniques, explaining their strengths and
weaknesses. To illustrate each, we will use the example of 25 students (labelled ‘1’ to
‘25’) who happen to be in a particular class (labelled ‘A’ to ‘E’) as follows:
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
17.1.1 Non-probability sampling techniques

Convenience sampling
Convenience sampling attempts to obtain a sample of convenient elements (hence

the name!). Often, respondents are selected because they happen to be in the right
276
place at the right time. Examples include using students and members of social
organisations, as well as ‘people-in-the-street’ interviews.
Suppose class D happens to assemble at a convenient time and place, so all elements
(students) in this class are selected. The resulting sample consists of students 16, 17, 18,
19 and 20. Note that no students are selected from classes A, B, C and E.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Strengths of convenience sampling include being the cheapest, quickest and most
convenient form of sampling. Weaknesses include selection bias (discussed later) and the
lack of a representative sample.
Judgemental sampling
Judgemental sampling is a form of convenience sampling in which the population

elements are selected based on the judgement of the researcher. Examples include
purchase engineers being selected in industrial market research, as well as expert
witnesses used in court.
Suppose a researcher believes classes B, C and E to be ‘typical’ and ‘convenient’.
Within each of these classes one or two students are selected based on typicality and
convenience. The resulting sample here consists of students 8, 10, 11, 13 and 24. Note
that no students are selected from classes A and D.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Judgemental sampling is achieved at low cost, is convenient, not particularly

time-consuming and good for ‘exploratory’ research designs. However, it does not allow
generalisations and is subjective due to the judgement of the researcher.
Quota sampling
Quota sampling may be viewed as two-stage restricted judgemental sampling. The

first stage consists of developing control categories, or quota controls, of population
elements. In the second stage, sample elements are selected based on convenience or
judgement. An example might be using gender as our control characteristic. Suppose
that 48% of the population is male (and hence 52% is female), then we would want the
277
Control characteristic: Population Sample Sample

Gender composition composition size
Male 48% 48% 480
Female 52% 52% 520
Total 100% 100% 1,000
Table 17.2: Example of using gender as a quota control.
sample composition to reflect this. See Table 17.2 assuming a required sample size of
1,000 which means 48% of the sample (480) should be male and 52% of the sample
(520) should be female.
Suppose a quota of one student from each class is imposed. Within each class, one
student is selected based on convenience or judgement. The resulting sample consists of
students 3, 6, 13, 20 and 22.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Quota sampling is advantageous in that a sample can be controlled for certain

characteristics. However, it suffers from selection bias and there is no guarantee of a
representative sample.
Snowball sampling
In snowball sampling an initial group of respondents is selected, usually at random.

After being interviewed, these respondents are asked to identify others who belong to
the target population of interest. Subsequent respondents are selected based on these
referrals.
Suppose students 2 and 9 are selected randomly from classes A and B, respectively.
Student 2 refers students 12 and 13, while student 9 refers student 18. The resulting
sample consists of students 2, 9, 12, 13 and 18. Note there are no students from class E
included in the sample.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Snowball sampling has the major advantage of being able to increase the chance of
locating the desired characteristic in the population and is also fairly cheap. However, it
can be time-consuming.
278
Activity 17.2 What is the difference between judgemental and convenience

sampling? Give examples of where these techniques may be applied successfully.
17.1.2 Probability sampling techniques
Simple random sampling (SRS)
In a simple random sample each element in the population has a known and equal
probability of selection. Each possible sample of a given size, n, has a known and equal
probability of being the sample which is actually selected. This implies that every
element is selected independently of every other element.
Suppose we select five random numbers (using a ‘random number generator’) from 1 to
25. Suppose the random number generator returns 3, 7, 9, 16 and 24. Therefore, the
resulting sample consists of students 3, 7, 9, 16 and 24. Note there is no student from
class C.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
SRS is simple to understand and results are readily projectable. However, there may be
difficulty constructing the sampling frame, lower precision (relative to other probability
sampling methods) and there is no guarantee of a representative sample.
Systematic sampling
In systematic sampling the sample is chosen by selecting a random starting point

and then picking every ith element in succession from the sampling frame. The
sampling interval, i, is determined by dividing the population size, N , by the sample
size, n, and rounding to the nearest integer. When the ordering of the elements is
related to the characteristic of interest, systematic sampling increases the
representativeness of the sample. When the ordering of the elements produces a cyclical
pattern, systematic sampling may decrease the representativeness of the sample.
For example, suppose there are 100,000 elements in the population and a sample of
1,000 is required. In this case the sampling interval is i = N/n = 100000/1000 = 100. A
random number between 1 and 100 is selected. If, for example, this number is 23, then
the sample consists of elements 23, 123, 223, 323, 423, 523 and so on.
Suppose we select a random number between 1 and 5, say 2. The resulting sample of
students consists of students 2, 2 + 5 = 7, 2 + 5 × 2 = 12, 2 + 5 × 3 = 17 and
2 + 5 × 4 = 22. Note all the students are selected from a single row.
279
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Systematic sampling may or may not increase representativeness – it depends on

whether there is any ‘ordering’ in the sampling frame. It is easier to implement relative
to SRS.
Stratified sampling
Stratified sampling is a two-step process in which the population is partitioned

(divided up) into subpopulations known as strata 1 . The strata should be mutually
exclusive and collectively exhaustive in that every population element should be
assigned to one, and only one, stratum and no population elements should be omitted.
Next, elements are selected from each stratum by a random procedure, usually SRS. A
major objective of stratified sampling is to increase the precision of statistical inference
without increasing cost.
The elements within a stratum should be as homogeneous as possible (i.e. as similar as
possible), but the elements between strata should be as heterogeneous as possible (i.e. as
different as possible). The stratification factors should also be closely related to the
characteristic of interest. Finally, the factors (variables) should decrease the cost of the
stratification process by being easy to measure and apply.
In ‘proportionate stratified sampling’, the size of the sample drawn from each stratum is
proportional to the relative size of that stratum in the total population. In
‘disproportionate (optimal) stratified sampling’, the size of the sample from each
stratum is proportional to the relative size of that stratum and to the standard
deviation of the distribution of the characteristic of interest among all the elements in
that stratum.
Suppose we randomly select a number from 1 to 5 for each class (stratum) A to E. This
might result, say, in the stratified sample consisting of students 4, 7, 13, 19 and 21. Note
that one student is selected from each class.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Stratified sampling includes all the important subpopulations and ensures a high level
of precision. However, sometimes it might be difficult to select relevant stratification
factors and the stratification process itself might not be feasible in practice if it was not
known to which stratum each population element belonged.
1
‘Strata’ is the plural of ‘stratum’.
280
Cluster sampling
In cluster sampling, the target population is first divided into mutually exclusive and
collectively exhaustive subpopulations known as clusters. A random sample of clusters is
then selected, based on a probability sampling technique such as SRS. For each selected
cluster, either all the elements are included in the sample (one-stage cluster sampling),
or a sample of elements is drawn probabilistically (two-stage cluster sampling).
Elements within a cluster should be as heterogeneous as possible, but clusters
themselves should be as homogeneous as possible. Ideally, each cluster should be a
small-scale representation of the population. In ‘probability proportionate to size
sampling’ the clusters are sampled with probability proportional to size. In the second
stage, the probability of selecting a sampling unit in a selected cluster varies inversely
with the size of the cluster.
Suppose we randomly select three clusters: B, D and E. Within each cluster, randomly
select one or two elements. The resulting sample here consists of students 7, 18, 20, 21
and 23. Note that no students are selected from clusters A and C.
A B C D E
1 6 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 20 25
Cluster sampling is easy to implement and cost effective. However, the technique suffers
from a lack of precision and it can be difficult to compute and interpret results.
Multistage sampling
In multistage sampling, sample selection is performed at two or more successive

stages. This technique is often adopted in large surveys. At the first stage, large
‘compound’ units are sampled (primary units), and several sampling stages of this type
may be performed until we at last sample the basic units.
The technique is commonly used in cluster sampling so that we are at first sampling
main clusters, and then clusters within clusters etc. We can also use multistage
sampling with mixed techniques, i.e. cluster sampling at the first stage and stratified
sampling at the second stage.
An example might be a national survey of salespeople in a company. Sales areas could
be identified and a random selection is taken from these. Instead of interviewing every
person in the chosen clusters (which would be a one-stage cluster sample), only
randomly-selected salespeople within the chosen clusters will be interviewed.
Activity 17.3 How do probability sampling techniques differ from non-probability

sampling techniques? Which factors should be considered when choosing between
probability and non-probability sampling?
281
17.1.3 Contact methods

We conclude this section with a short discussion of contact methods. Once the
sampling procedure has been chosen, researchers have a choice of contact method for
the survey/interview.
The most common contact methods are face-to-face interviews, telephone interviews
and online/postal/mail (so-called ‘self-completion’) interviews. In most countries you
can assume:
an interviewer-administered face-to-face questionnaire will be the most expensive to

carry out
telephone surveys depend very much on whether your target population is on the
telephone (and how good the telephone system is)
self-completion questionnaires have low response rates.
We now explore some2 of the advantages and disadvantages of various contact methods.
Face-to-face interviews
• Advantages: good for personal questions; allow for probing issues in greater
depth; permit difficult concepts to be explained; can show samples (such as
new product designs).
• Disadvantages: (very) expensive; not always easy to obtain detailed
information on the spot.
Telephone interviews
• Advantages: easy to achieve a large number of interviews; easy to check on the
quality of interviewers (through a central switchboard perhaps).
• Disadvantages: not everyone has a telephone so the sample can be biased;
cannot usually show samples; although telephone directories exist for landline
numbers, what about mobile numbers? Also, young people are more likely to
use mobiles rather than landlines, so they are more likely to be excluded.
Self-completion interviews
• Advantages: most people can be contacted this way (there will be little
non-response due to not-at-home reasons); allow time for people to look up
details such as income, tax returns etc.
• Disadvantages: high non-response rate – it requires effort to complete the
questionnaire; answers to some questions may influence answers to earlier
questions since the whole questionnaire is revealed to the respondent – this is
important where the order of a questionnaire matters; you have no control over
who answers the questionnaire.
2
This is not necessarily an exhaustive list. Can you add any more?
282
Example 17.1 The following are examples of occasions when you might use a
particular contact method.
Face-to-face interview – a survey of shopping patterns

Here you need to be able to contact a sample of the whole population. You can
assume that a large proportion would not bother to complete a postal
questionnaire – after all, the subject matter is not very important and it takes
time to fill in a form! Using a telephone would exclude those (for example, the
poor and the elderly) who either do not have access to a telephone or are
unwilling to talk to strangers by telephone.
Telephone interview – a survey of business people and their attitudes

to a new item of office equipment
All of them will have a telephone, and also the questions should be simple to
ask. Here, booking time for a telephone interview at work (once it has been
agreed with the administration) should be much more effective than waiting for
a form to be filled in or sending interviewers to disrupt office routine.
Postal/mail questionnaire – a survey of teachers about their pay and

conditions
Here, on the spot interviews will not elicit the level of detail needed. Most
people do not remember their exact pay and taxation, particularly if they are
needed for earlier years. We would expect a high response rate and good-quality
data – the recipients are motivated to reply since they may be hoping for a pay
rise! Also, the recipients come from a group of people who find it relatively easy
to fill in forms without needing the help, or prompting, of an interviewer.
It is always possible to combine contact methods. The Family Expenditure Survey in

the UK, for example, combines the approach of using an interview three times over a
fortnight (to raise response and explain details) while the respondent household is
required to fill in a self-completion diary (showing expenditure, which could not be
obtained by an interview alone).
Email surveys are already widespread and becoming increasingly popular, although
they are only appropriate when the population to be studied regularly uses email and is
likely to reply to your questions, such as employees in your office. An obvious advantage
is that this contact method is very cheap to administer, although response rates tend to
be very low.
17.2 Summary
This unit has described the different sampling techniques which exist when sampling
from a population. It is important to know the merits and limitations of each so that a
recommendation for the most suitable choice of method can be made dependent on the
circumstances of the research problem. Attention has also been given to the choice of
contact method, again with a focus on the strengths and weaknesses of each type.
283
Census Cluster sampling

Contact methods Convenience sampling
Judgemental sampling Multistage sampling
Population Quota controls
Quota sampling Sample
Sampling design Sampling frame
Sampling interval Simple random sample
Snowball sampling Stratified sampling
Systematic sampling Target population
Learning outcomes
design and conduct surveys in a social science context
discuss the relative merits and limitations of different sampling techniques
recommend an appropriate survey contact method
Exercises
Exercise 17.1
What are the main potential disadvantages of quota sampling with respect to
probability sampling?
Exercise 17.2
Why might disproportionate stratified sampling be preferable to proportionate stratified
sampling?
Exercise 17.3
In no more than 200 words, discuss the relative advantages and disadvantages of
telephone interviewing compared to face-to-face interviewing.
Exercise 17.4
The simplest probability-based sampling method is simple random sampling. Give two
reasons why it may be desirable to use a sampling design which is more sophisticated
than simple random sampling.
284
Exercise 17.5
What is the difference between one-stage cluster sampling and two-stage cluster
sampling?
Exercise 17.6
A corporation wants to estimate the total number of worker-hours lost for a given
month because of accidents among its employees. Each employee is classified into one of
three categories – labourer, technician and administrator. Which sampling method do
you think would be preferable here – simple random sampling, stratified sampling, or
cluster sampling? Give arguments to explain your choice.
Exercise 17.7
What criteria would you use in deciding which contact method to use in a survey of
individuals?
Exercise 17.8
Discuss the feasibility of each of the types of survey contact methods (personal
interview, postal survey, email and telephone survey) for a random sample of university
students about their undergraduate experiences and attitudes at the end of the
academic year.
Exercise 17.9
Retirement and Investment Services would like to conduct a survey on online users’
demands for additional internet retirement services. Outline your suggested sampling
and contact method and explain how the results might be affected by your methodology.
285
18. Sampling and experimentation II – Bias and the design of experiments
Unit 18: Sampling and

experimentation II
Bias and the design of experiments
Overview
This unit explores potential sources of bias which may occur as a result of sampling.
Bias comes in various forms and potential remedies are presented. We conclude with a
look at the design of experiments in the social sciences. Unlike observational studies,
experiments are excellent for establishing causality through use of a control group.
Aims
This unit presents sources of bias and the design of experiments. Particular aims are:
to be aware of different sources of error and bias
to provide an overview of experimentation in the social sciences
to introduce the notion of causality and how properly designed experiments can
test for this.
Background reading
18.1 Introduction
We have previously seen that the term target population represents the collection of
units (people, objects etc.) in which we are interested. In the absence of time and
budget constraints we conduct a census, that is a total enumeration of the population.
Its advantage is that there is no sampling error because all population units are
observed and so there is no estimation of population parameters. Due to the large size,
N , of most populations, an obvious disadvantage with a census is cost, so it is often not
feasible in practice. However, even with a census non-sampling errors may occur, for
example if we have to resort to using cheaper (hence less reliable) interviewers who may
erroneously record data, misunderstand a respondent etc.
So we select a sample, that is a certain number of population members are selected and
studied. The selected members are known as elementary sampling units. Sample surveys
286
(hereafter ‘surveys’) are how new data are collected on a population and tend to be
based on samples rather than a census. Selected respondents may be contacted in a
variety of methods such as face-to-face interviews, telephone, mail or email
questionnaires.
Sampling error will occur (since not all population units are observed). However,
non-sampling errors should be fewer since resources can be used to ensure high quality
interviewers or to check completed questionnaires.
18.2 Types of error

Several potential sources of error can affect a research design which we do our utmost to
control. The ‘total error’ represents the variation between the true value of a parameter
in the population of the variable of interest (such as a population mean) and the
observed value obtained from the sample. The total error is composed of two distinct
types of error in sampling design.
Sampling error occurs as a result of selecting a sample rather than performing a
census (where a total enumeration of the population is undertaken). It is
attributable to random variation due to the sampling scheme used. For probability
sampling, we can estimate the statistical properties of the sampling error, i.e. we
can compute (estimated) standard errors which facilitate the use of hypothesis
testing and the construction of confidence intervals.1
Non-sampling error is a result of (the inevitable) failures of the sampling scheme.

In practice it is very difficult to quantify this sort of error, which typically requires
separate investigation. We distinguish between two types of non-sampling error.
• Selection bias – this may be due to (i.) the sampling frame not being equal
to the target population, (ii.) cases where the sampling scheme is not strictly
adhered to, or (iii.) non-response bias.
• Response bias – the actual measurements might be wrong, for example
ambiguous question wording, misunderstanding of a word in a questionnaire,
or sensitivity of the information which is sought. Interviewer bias is another
aspect of this, where the interaction between the interviewer and interviewee
influences the response given in some way, either intentionally or
unintentionally, such as through leading questions, the dislike of a particular
social group by the interviewer, the interviewer’s manner or lack of training.
These could all occur in an unplanned way and bias your survey badly.
Both kinds of error can be controlled or allowed for more effectively by a pilot survey.
A pilot survey is used:
to find the standard error which can be attached to different kinds of questions and
hence to underpin the sampling design chosen
to sort out non-sampling questions, such as:

• do people understand the questionnaires?
1
Note hypothesis testing and confidence intervals are not explicitly covered in this course.
287
• are our interviewers working well?

• are there particular organisational problems associated with this enquiry?
Activity 18.1 Give the main problem with the wording of the survey question: ‘Do
you want to be rich and famous?’.
18.3 Bias
Bias caused by non-response and response is worth a special mention. It can cause
problems at every stage of a survey, both random and non-random, and however
administered. The first problem can be in the sampling frame. Is an obvious group
missing? For example:
if the list is of householders, those who have just moved in will be missing
if the list is of those aged 18 or over on the electoral register, and the under-20s are
careless about registration, then younger people will be missing from the sample.
In the field, non-response (data not provided by a unit we wish to sample) is one of
the major problems of sample surveys as the non-respondents, in general, cannot be
treated like the rest of the population. As such, it is most important to try to get a
picture of any shared characteristics in those refusing to answer or people who are not
available at the time of the interview. We can classify non-response into:
item non-response, which occurs when a sampled member fails to respond to a
question in the questionnaire.
unit non-response, which occurs when no information is collected from a sample

member.
Non-response may be due to:

not-at-home due to work commitments or on holiday
refusals due to subject matter or sponsorship of the survey
incapacity to respond due to illness or language difficulties
not found due to vacant houses, incorrect addresses or moved on
lost schedules due to information being lost or destroyed after it had been
collected.
How should we deal with non-response? Well, note that increasing the sample size will
not solve the problem – the only outcome would be that we have more data on the
types of individuals who are willing to respond! Instead, we might look at improving
our survey procedures such as data collection and interviewer training. Non-respondents
could be followed up using callbacks or an alternative contact method to the original
survey in an attempt to subsample the non-respondents. A proxy interview (where a
unit from your sample is substituted with an available unit) may be another possibility.
288
(Note that non-response also occurs in quota sampling but is not generally recorded –
see the earlier discussion.) However, an obvious remedy is to provide an incentive (for
example cash or entry into a prize draw) to complete the survey – this exploits the
notion that human behaviour can be influenced in response to the right incentives!
Response error is very problematic because it is not so easy to detect. A seemingly
clear reply may be based on a misunderstanding of the question asked or a wish to
deceive. A good example from the UK is the reply to the question about the
consumption of alcohol in the Family Expenditure Survey. Over the years there is up to
a 50% understatement of alcohol use compared with the overall known figures for sales
from HM Revenue & Customs!
Sources of response error include the:
role of the interviewer due to the characteristics and/or opinions of the
interviewer, asking leading questions and the incorrect recording of responses
role of the respondent who may lack knowledge, forget information or be

reluctant to give the correct answer due to the sensitivity of the subject matter.
Control of response errors typically involves improving the recruitment, training and
supervision of interviewers, reinterviewing, consistency checks and increasing the
number of interviewers.
In relation to all of these problems pilot work is very important. It may also be possible
to carry out a check on the interviewers and contact methods used after the survey
(post-enumeration surveys).
18.4 Adjusting for non-response

Low response rates increase the probability that non-response bias will be problematic.
Response rates should always be reported and, whenever possible, the effects of
non-response should be estimated. This is possible by linking the response rate to
estimated differences between respondents and non-respondents. Information on the
differences between both groups may be obtained from the sample itself, for example
differences identified through callbacks could be extrapolated or perhaps a concentrated
follow-up could be performed on a subsample of non-respondents.
However, it may be that it is not possible to estimate the effects of non-response. In
such instances we can make adjustments during data analysis and interpretation. We
now briefly consider some possible adjustments for non-response.
Subsampling of non-respondents – the researcher contacts a subsample of the
non-respondents, usually by means of telephone or personal interviews.
Replacement – the non-respondents in the current survey are replaced with

non-respondents from an earlier, similar survey. The researcher attempts to contact
these non-respondents from the earlier survey and administer the current survey
questionnaire to them, possibly by offering a suitable incentive.
Substitution – the researcher substitutes for non-respondents other elements from

the sampling frame which are expected to respond. The sampling frame is divided
289
into subgroups which are internally homogeneous in terms of respondent

characteristics but heterogeneous in terms of response rates. These subgroups are
then used to identify substitutes who are similar to particular non-respondents but
dissimilar to respondents already in the sample.
Subjective estimates – when it is no longer feasible to increase the response rate by

subsampling, replacement or substitution, it may be possible to arrive at subjective
estimates of the nature and effect of non-response bias. This involves evaluating the
likely effects of non-response based on experience and available information.
Trend analysis – this is an attempt to discern a trend between early and late
respondents. This trend is projected to non-respondents to estimate where they
stand on the characteristic of interest.
Weighting – this attempts to account for non-response by assigning differential

weights to the data depending on the response rates. For example, suppose in a
survey the response rates were 85%, 70% and 40%, respectively, for the high-,
medium- and low-income groups. In analysing the data, these subgroups are
assigned weights inversely proportional to their response rates. That is, the weights
assigned would be 100/85, 100/70 and 100/40, respectively, for the high- medium-
and low-income groups.
Imputation – this involves imputing, or assigning, the characteristic of interest to

the non-respondents based on the similarity of the variables available for both
non-respondents and respondents. For example, a respondent who does not report
brand usage may be imputed the usage of a respondent with similar demographic
characteristics.
Activity 18.2 Non-response in surveys is considered to be problematic.

(a) Give two possible reasons why non-response may occur.
(b) Why is non-response problematic for the person or organisation conducting the
research? Give two reasons.
(c) How can non-response be reduced in telephone surveys and mail surveys,
respectively?
18.5 Experimental design in the social and medical

sciences
We conclude the ‘Sampling and experimentation’ part of the course with a look at
experimental design in the social and medical sciences. Research design is a set of
advanced decisions which make up the master plan specifying the methods and
procedures for collecting and analysing the required information. There is a large array
of alternative research designs which can satisfy research objectives. The key is to create
a design which enhances the value of the information obtained while reducing the cost
of obtaining it.
290
A research design is a framework or blueprint for conducting the research project. It

details the procedures necessary for obtaining the information needed to structure or
solve research problems. There are basic research designs which can be successfully
matched to given problems and research objectives and they serve a researcher much
like a blueprint serves a builder.
18.5.1 Experimental versus observational studies

In an experiment, an intervention or treatment is applied to some or all of the
experimental units (often people). The experimenter decides (using randomisation
and blocking) which person gets which treatment, or treatment combination. The
outcomes are recorded after the intervention (and often various measurements are made
before the intervention). The subsequent analysis involves comparing the outcomes for
the different treatments. Typically, the experimental units are not chosen to be
particularly representative of a population of interest.
In an observational study, data are collected about the units (people) without any
intervention. In fact the researcher always tries not to influence the observations. A
social sample survey is an example of an observational study where the main data
are responses to a questionnaire. The questionnaire and interviewing technique are
designed to obtain, as far as possible, responses which are not biased by the manner of
asking. Typically, the sample is designed to be representative of the population of
interest using either quota sampling or probability sampling.
18.5.2 Randomised controlled clinical trials

Randomised controlled clinical trials (RCCTs) are routinely used in the
development and testing of medical procedures. The methodology can also be
appropriate for some research in the social sciences, such as in criminology or education.
In the simplest completely randomised design, the participants are divided at random
(i.e. using a randomisation device) into a treatment group and a control group.
The treatment group receives the experimental treatment (for example, a new drug)
and the control group receives the control treatment (for example, the drug currently
used or a placebo).
The randomisation ensures that there will, on average, be no bias due to the allocation
of the treatments. The use of a control group is essential to estimate the extent to
which the outcomes in the treatment group are due to the treatment and are not ‘what
would have happened anyway’.
Typically, participating in an experiment produces a change, even if no treatment is
given. This is called the placebo effect in medical trials and the Hawthorne effect in
social science experiments. Therefore, it is usual to include a dummy treatment given to
the control group.
To avoid bias in assessing the effects of the treatments, double or triple blinding is
recommended. Double blinding means that both the subjects and administrators are
unaware of who has received the treatment. Triple blinding means that neither the
person receiving the treatment, nor those involved in their care, nor those involved in
measuring the outcomes know which treatment was given to which person.
291
The sample size in each treatment group must be large enough to ensure that medically
(or socially) important differences can be detected. Sample size calculations to ensure
adequate power are a routine part of experimental design.
18.5.3 Randomised blocks
To increase the accuracy of the comparisons, the units may be grouped into blocks (for
example by age and gender, or by severity of disease). Within each block one or more
units receive each treatment. Treatments are allocated using randomisation within each
block. Sometimes there are strata or subgroups of interest (for example, we might want
to know whether the drug is as effective for men as it is for women) in which case the
blocks should be chosen to correspond to strata (or to subsets of strata).
18.5.4 Multifactorial experimental designs
In multifactorial experimental designs, rather than just one factor or treatment (a

drug, say) being tested, several factors are tested simultaneously (such as a drug, diet
and exercise). For example, if each of these factors had two levels (experimental = 1,
control = 0), then there would be eight combinations. Ideally the units (people) would
be assigned to blocks of size eight (where people in the same block would have similar
characteristics) and the eight treatment combinations would be allocated at random
(using a randomisation device) to the eight people within each block.
Not only does this allow for an efficient use of resources (three different factors can be
compared for only a little more than the price of one), but also interactions between the
factors can be estimated (for example, the effect of aspirin might be different for those
on a low-fat diet than for those on a normal diet).
18.5.5 Quasi-experiments
Experiments in which there is no control group, or in which randomisation is not used,

are sometimes called quasi-experiments. An example might be where a new teaching
method is introduced for students taking a course in the current academic year.
Students who took the course in the previous academic year can be used as the control
group, but since no randomisation was used in the allocation of teaching method any
differences in outcomes for the two years might arise from known or unknown
differences between the years rather than from differences in the teaching method.
18.5.6 Cluster randomised trials
Cluster randomised trials are used where it is not practical to apply a treatment, or
treatment combination, to individuals using randomisation, but only to groups or
clusters of individuals. In an educational experiment, schools might be clusters. Half of
the schools, chosen at random, might be given new technology and the other half not.
Results for the students could be aggregated to school level and the treatments could be
compared. Note the experimental units are the clusters (schools) and the relevant
sample size is the number of clusters (schools), not the number of subunits (students).
292
18.5.7 Analysis and interpretation
Similar methods of statistical analysis may be used for experimental and for
observational data, but the interpretation differs.
An observational study (such as a survey of schools) may show that schools with
modern technology have better examination results. However, this could be due to the
fact that these schools are generally better equipped and/or have better students.
In an experiment where schools chose to participate it might be found that those
provided with ‘modern technology’ did better than those given extra supplies of paper
and pencils. This might be evidence that having modern equipment would help schools
which would choose to participate. Experiments can provide evidence of causation.
However, the results might not apply to all schools. Less adventurous or more
hard-pressed schools might benefit more from additional paper and pencils.
Activity 18.3 Write brief notes on:
(a) blind trials
(b) control and treatment groups
(c) measuring causation.
18.6 Summary
This unit has explored the different sources of error and bias which exist when drawing
a sample from a population. Non-response bias is particularly problematic, and a
variety of adjustments to account for non-response were suggested. Experimentation
concluded this topic, and the importance of a control and treatment group was outlined
in order to establish causality.
Blinding Blocking
Control group Experiment
Incentive Intervention
Interviewer bias Item non-response
Non-response Non-sampling error
Observational study Pilot survey
Placebo Randomisation
Research design Response bias
Response error Sampling error
Selection bias Treatment
Unit non-response
293
Learning outcomes
design and conduct experiments in a social science context
define different forms of bias, explain why they are problematic and offer potential
remedies
explain how an experiment can be used to determine causality
Exercises
Exercise 18.1
The following question appeared in a survey of university students: ‘How much time do
you spend studying per week?’. List two problems with the phrasing of this question
which may adversely affect the reliability of the answers to it.
Exercise 18.2
Give an example of response bias. Is response bias a form of sampling error or a form of
non-sampling error? Briefly explain why.
Exercise 18.3
In no more than 200 words, explain the difference between an experimental design and
a survey design, and discuss their relative advantages.
Exercise 18.4
Briefly discuss the advantages and disadvantages of paying respondents for an interview.
Exercise 18.5
A research group has designed a survey and finds the costs are greater than the
available budget. Two possible methods of saving money are a sample size reduction or
spending less on interviewers (for example, by providing less interviewer training or
taking on less-experienced interviewers). Discuss the advantages and disadvantages of
these two approaches.
Exercise 18.6
In no more than 200 words, discuss the role of the interviewer in a survey and the
importance of training an interviewer.
Exercise 18.7
Readers of the magazine Popular Science were asked to phone in (on a premium rate
number) their responses to the following question: ‘Should the United States build more
fossil fuel generating plants or the new so-called safe nuclear generators to meet the
294
energy crisis?’. Of the total call-ins, 86% chose the nuclear option. Discuss the way the
poll was conducted, the question wording, and whether or not you think the results are
a good estimate of the prevailing mood in the country.
Exercise 18.8
What is randomisation in the context of experimental design?
Exercise 18.9
Explain what is meant by each of the following and why they are considered desirable in
an experiment:
(a) placebo
(b) double blinding
(c) blocking
(d) multifactorial design.
295
19. Fundamentals of regression I – Correlation and the simple linear regression model
Unit 19: Fundamentals of regression I

Correlation and the simple linear
regression model
Overview
In Section 12.5, we saw that bivariate datasets could be visualised using scatter plots.
We discussed, for example, the effect advertising appeared to have on sales, i.e. whether
there is a positive or negative relationship between the variables. In this unit we go
further by introducing correlation and then proceed to modelling a linear relationship
between variables using a common procedure known as regression.
Aims
This unit explains the concepts of correlation and the fundamentals of regression.
to highlight the importance of correlation
to provide an introduction to modelling a linear relationship between variables.
Background reading
19.1 Introduction
We now investigate the relationship between variables. When we have data on two
variables (x and y, say), we have bivariate data. We will consider how to:
measure the strength of the relationship
model the relationship
predict the value of one variable on the basis of the other.
The first thing to do with data is to provide a graphical representation. For one variable
this might be a histogram or a pie chart. For two variables we produce a scatter plot (as
previously discussed in Section 12.5).
296
Example 19.1 Assume that we have some data in paired form, say:
(xi , yi ), for i = 1, 2, . . . , n.
An example might be unemployment and crime figures for 12 areas of a city.
Unemployment, x 2614 1160 1055 1199 2157 2305

Offences, y 6200 4610 5336 5411 5808 6004
Unemployment, x 1687 1287 1869 2283 1162 1201

Offences, y 5420 5588 5719 6336 5103 5268
We plot x on the horizontal axis and y on the vertical axis. By doing so, we can
easily see whether there is any relationship between the variables. The scatter plot is
shown in Figure 19.1.
Scatter plot of Crime against Unemployment
x
x
6000
x
Number of offences
x
x
5500
x x
x
x
x
5000
1000 1500 2000 2500
Unemployment
Figure 19.1: Scatter plot of the unemployment and reported crime data.
Looking at Figure 19.1 an approximate positive, linear relationship is apparent. x and y

increase together, roughly linearly. The implied linear relationship is not exact – the
points do not lie exactly on a straight line. Such an ‘upward shape’ is termed positive
correlation. (We will see later how to quantify correlation.)
Other examples of scatter plots are shown in Figures 19.2 and 19.3.
19.2 Correlation
Correlation measures the strength of the linear relationship between two variables,
each measured on an interval scale.
297
Scatter plot
x
x x
8
x
x
6
x
x
y
x
x
4
x
2
x
x
2 4 6 8
Figure 19.2: Scatter plot showing negative correlation (y decreases as x increases).
Positive correlation – the two variables tend to vary in the same direction.
Negative correlation – the two variables tend to vary in the opposite direction.
Perfect correlation – the two variables have points which all lie exactly on a
straight line.
If there exists a perfect linear relationship between x and y, we can represent them
using an equation of the form:
y = b0 + b1 x
where:
b0 represents the y-intercept of the line
b1 represents the slope or gradient of the line.
Examples of anticipated correlation include:
Variables Correlation
Height and weight Positive
Rainfall and sunshine hours Negative
Ice cream sales and sun cream sales Positive
Hours of study and examination mark Positive
Car’s petrol consumption and goals scored Zero
Positive correlation is characterised by large x with large y and small x with small y.
Negative correlation is characterised by large x with small y and small x with large y.
However, since x and y may have widely different numerical values we need to take this
298
Scatter plot
x x
8
x
6
x
x
y
x
4
x
2
x
x x
0 2 4 6 8
Figure 19.3: Scatter plot showing uncorrelated data (no obvious (linear) relationship
between x and y).
into account. We do this by considering how far away from their means the two
variables are.
So, we are interested in the degree to which variations in variable values are related to
each other. Our basis for the measurement of correlation is:
n
X n
X
(xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ.
i=1 i=1
Unfortunately, this measure is extremely sensitive to the units in which the variables
are measured. We would prefer a measure of correlation to remain the same regardless
of the units of measurement (for example days, hours, minutes or seconds). For this
reason we use the following.
Sample correlation coefficient
The sample correlation coefficient, r, measures the strength of the linear

relationship between two variables. It is defined as:
n
xi − x̄ yi − ȳ

1 X
r=
n − 1 i=1 sx sy
where x̄ and ȳ are the sample means, and sx and sy are the sample standard
deviations, of x and y, respectively.
Note that r is just the sum of the products of the z-scores (see Section 16.1) of each
point’s coordinates. This statistic is completely independent of the units used to
measure the variables.
299
Sample correlation coefficient (alternative formula)
We can also find r using the formula:

Sxy
r=p
Sxx Syy
where:
n
X n
X n
X n
X
2
Sxx = (xi − x̄) = x2i − nx̄ , 2
Syy = 2
(yi − ȳ) = yi2 − nȳ 2
i=1 i=1 i=1 i=1
and: n n
X X
Sxy = (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ
i=1 i=1
but we will not show their respective equivalences in this course.
Example 19.2 Returning to the unemployment/crime dataset in Example 19.1, we

have:
X12 X12 X12
2
xi = 19979, xi = 36695129, yi = 66803,
i=1 i=1 i=1
12
X 12
X
yi2 = 374471231 and xi yi = 113784494.
i=1 i=1
So, since n = 12, we have x̄ = 19979/12 = 1664.92 and ȳ = 66803/12 = 5566.92, so

the (sample) correlation coefficient, r, is:
12
P
xi yi − nx̄ȳ
i=1
r = s
12
12
x2i yi2
P P
− nx̄2 − nȳ 2
i=1 i=1
113784494 − (12 × 1664.92 × 5566.92)

=p
(36695129 − 12 × (1664.92)2 ) × (374471231 − 12 × (5566.92)2 )
= 0.8609.
The (sample) correlation coefficient, r, takes between −1 and 1, i.e. we have:
−1 ≤ r ≤ 1.
r > 0 indicates positive correlation, with r = 1 indicating perfect positive

correlation.
r < 0 indicates negative correlation, with r = −1 indicating perfect negative

correlation.
300
The closer |r| is to 1, the stronger the linear relationship is.
r ' 0 suggests that x and y have no linear relationship.
Beware r ' 0 does not necessarily imply no relationship (as there could be a
non-linear relationship). For example, the scatter plot in Figure 19.4 arises from
data where r = 0.1481, but there is a clear quadratic relationship.
Scatter plot
x
2500
x
x
x
x x
2000
x x
x
1500
y
x
x
1000
x
x
500
x
x
20 30 40 50 60 70 80
Figure 19.4: Scatter plot of data simulated from the (approximate) quadratic equation
y = 2(x − 15)(85 − x).
Activity 19.1 State whether the following statements are true or false, explaining
your answers.
(a) ‘The correlation coefficient between x and y is the same as the correlation
coefficient between y and x.’
(b) ‘If the slope is negative in a regression equation y = b0 + b1 x, then the

correlation coefficient between x and y would be negative too.’
(c) ‘If two variables have a correlation coefficient of minus 1 they are not related.’
(d) ‘A large correlation coefficient means the regression line will have a large slope
b1 .’
19.3 Simple linear regression

Here we introduce the simple linear regression model. This is only part of a very
large topic in statistical analysis. In the simple model, we have two variables y and x.
301
y is the dependent (or response) variable – the variable we are trying to

explain.
x is the independent (or explanatory) variable – the variable we think

influences y.
Numerous reasons exist for establishing a mathematical relationship between y and x.

We may wish:
to find and interpret unknown parameters in a known relationship
to understand the reason for such a relationship – is it causal?
to predict or forecast y for specific values of the explanatory variable.
Hence our objectives in regression analysis are to:
estimate any unknown parameters
estimate the variation about the proposed model
estimate the precision of the estimates
test the adequacy of the proposed model and the relevance of the explanatory
variable.
A linear relationship between variables
Assume a true (population) linear relationship between a response variable y and an

explanatory variable x of the approximate form:
y = b0 + b 1 x
where:
b0 and b1 are fixed, but unknown, parameters
b0 is the y-intercept
b1 is the slope of the line.
We seek to estimate b0 and b1 using (paired) sample data (xi , yi ), for i = 1, . . . , n.

Particularly in the social sciences, we would not expect a perfect linear relationship
between the two variables. Hence we modify this basic model to get:
y = b0 + b1 x + ε
where ε is some random perturbation from the initial ‘approximate’ line. In other
words, each y observation almost lies on the hypothesised line, but ‘jumps’ off the line
according to the random variable ε. Often we refer to ε as the error term of the model.
302
19.4 Parameter estimation

For given sample data we could first produce a scatter plot from which any linear
relationship would be visible. When the data indicate a linear relationship, it suggests
we should perform a (simple) linear regression. So we need to estimate the population
regression line using the sample data. This estimated line is often called the line of
best fit.
How do we choose the line of best fit? Well, we require a formal criterion for
determining the line of best fit. Without going into details (which are beyond the scope
of this course), estimation of b0 and b1 will be by least squares estimation.
Specifically, we seek to minimise the sum of the squared error terms.
So the error terms are used to estimate the parameters (slope and intercept) of the
model. Note this is an optimisation problem. The intercept tells us the value of the
response variable when the explanatory variable is zero. The slope tells us by how much
the response variable changes when the explanatory variable increases by one unit.
Least squares estimators and line of best fit
The least squares estimator of b1 is:

n
P
xi yi − nx̄ȳ
bb1 = i=1 .
n
x2i − nx̄2
P
i=1
The least squares estimator of b0 is:

bb0 = ȳ − bb1 x̄.
Hence the line of best fit has the equation:
yb = bb0 + bb1 x
where yb is our estimate of y based on the line of best fit when x is the value of the
explanatory variable.
Example 19.3 Returning to the unemployment/crime dataset from Example 19.1,

we have:
X12 X12 X12
2
xi = 19979, xi = 36695129, yi = 66803,
i=1 i=1 i=1
12
X 12
X
yi2 = 374471231 and xi yi = 113784494.
i=1 i=1
Since n = 12, we have x̄ = 19979/12 = 1664.92 and ȳ = 66803/12 = 5566.92, hence:
P12
xi yi − nx̄ȳ
bb1 = i=1 113784494 − (12 × 1664.92 × 5566.92)
= = 0.7468.
12
P 2 2
36695129 − (12 × (1664.92)2 )
xi − nx̄
i=1 303
We then estimate the intercept to be:

bb0 = ȳ − bb1 x̄ = 5566.92 − 0.7468 × 1664.92 = 4323.6.
Hence the least squares regression line is:
yb = 4323.6 + 0.7468x.
19.5 Prediction
One of the reasons for calculating the line of best fit is prediction. Specifically, for
some value of x, we can provide a prediction of y. So, returning to the example, how
many offences would you predict if there were 2,000 unemployed people in a city area?
To answer this we just substitute the desired value of x into the least squares regression
line:
yb = 4323.6 + 0.7468 × 2000 = 5817.
Provided we are predicting y for an x value which is within the available x data, then we
can be fairly confident in our prediction. This is what we call interpolation. However,
if we base our prediction on an x value outside the available x data, then we should view
the prediction with caution. This would be an example of extrapolation which is risky
since the relationship between x and y may change for such out-of-sample values of x.
Activity 19.2 The following table shows the number of computers (in 000s), x,
produced by a company each month and the corresponding monthly costs (in
£000s), y, for running its computer maintenance department.
Number of computers Maintenance costs Number of computers Maintenance costs

(in 000s), x (in £000s), y (in 000s), x (in £000s), y
7.2 100 6.8 103
8.1 116 7.3 106
6.4 98 7.8 107
7.7 112 7.9 112
8.2 115 8.1 111
The following statistics can be calculated from the data.

10
X 10
X 10
X
xi = 75.5, yi = 1080, xi yi = 8184.9,
i=1 i=1 i=1
10
X 10
X
x2i = 573.33 and yi2 = 116988.
i=1 i=1
(a) Draw the scatter diagram.
(b) Calculate the correlation coefficient for computers and maintenance costs.
304
(c) Find the best-fitting straight line relating y and x.
(d) Comment on your results. How would you check on the strength of the
relationship you have found?
19.6 Summary
This unit has introduced the concept of correlation to measure the strength of a linear
relationship between two continuous variables. Having seen that a linear relationship
exists between two such variables, it is possible to model the relationship
mathematically using the simple linear regression model. Estimation of the intercept
and slope in the regression model was discussed and the subsequent use of the
estimated model for prediction.

Dependent variable Error term
Extrapolation Independent variable
Intercept Interpolation
Least squares estimation Line of best fit
Negative correlation Perfect correlation
Positive correlation Prediction
Sample correlation coefficient Simple linear regression model
Slope (gradient) Uncorrelated
Learning outcomes
discuss the strength of correlation between two continuous variables
interpret correlation coefficients
explain the purpose of regression
Exercises
Exercise 19.1
(a) Explain the meaning of the term ‘least squares estimation’.
(b) Explain, briefly, the difference between correlation and regression.
305
Exercise 19.2
Define the term ‘sample correlation coefficient’, r, based on data (x1 , y1 ), . . . , (xn , yn ).
Describe some properties of r in terms of how its value is different when the data have
different patterns of scatter plot.
Exercise 19.3
An area manager in a department store wants to study the relationship between the
number of workers on duty and the value of merchandise lost to shoplifters. To do so,
she assigned a different number of clerks for each of 10 weeks. The results were:
Week Number of workers (xi ) Loss (yi )

1 9 420
2 11 350
3 12 360
4 13 300
5 15 225
6 18 200
7 16 230
8 14 280
9 12 315
10 10 410
Here are some useful summary statistics:
10
X 10
X 10
X
xi = 130, yi = 3,090, xi yi = 38305,
i=1 i=1 i=1
10
X 10
X
x2i = 1760, yi2 = 1007750.
i=1 i=1
(a) Which is the independent variable and which is the dependent variable?
(b) Plot the data in a scatter plot and comment on its shape.
(c) Calculate the least squares regression line.
(d) Interpret the regression coefficients of the fitted line.
(e) Predict the loss when the number of workers is 17.
(f) Compute the correlation coefficient between the number of workers and the loss.
Exercise 19.4
Write down the simple linear regression model, explaining each term in the model.
306
Exercise 19.5
The following data were recorded during an investigation into the effect of fertiliser in
g/m2 , x, on crop yields in kg/ha, y.
Crop yields (kg/ha) 160 168 176 179 183 186 189 186 184
Fertiliser (g/m2 ) 0 1 2 3 4 5 6 7 8

9
X 9
X 9
X
xi = 36, yi = 1611, xi yi = 6627,
i=1 i=1 i=1
9
X 9
X
x2i = 204, yi2 = 289099.
i=1 i=1
(a) Plot the data and comment on the appropriateness of using the simple linear
regression model.
(b) Calculate a least squares regression line for the data.
(c) Predict the crop yield for 3.5 g/m2 of fertiliser.
(d) Would you feel confident predicting a crop yield for 10 g/m2 of fertiliser? Explain
briefly why or why not.
Exercise 19.6
In a study of household expenditure a population was divided into five income groups
with the mean income, x, and the mean expenditure, y, on essential items recorded (in
Euros per month). The results are in the following table.
x y
1000 871
2000 1300
3000 1760
4000 2326
5000 2950

5
X 5
X 5
X
xi = 15000, yi = 9207, xi yi = 32805000,
i=1 i=1 i=1
5
X 5
X
x2i = 55000000, and yi2 = 19659017.
i=1 i=1
(a) Fit a straight line to the data.

(b) How would you use the fit to predict the percentage of income which households
spend on essential items? Comment on your answer.
307
20. Fundamentals of regression II – Interpretation of computer output and assessing model adequacy
Unit 20: Fundamentals of regression II

Interpretation of computer output and
assessing model adequacy
Overview
In practice, regression calculations are performed by computer and so in this final unit
we will consider how to interpret computer output to assess the adequacy of a given
regression model. Central to this are whether the model provides a good ‘fit’ to the
data, in terms of explanatory power, and the statistical significance of the explanatory
variable.
Aims
This unit considers how to judge whether a regression model is ‘good’ and how to
interpret typical computer output of a regression. Particular aims are:
to assess how good a regression model is in terms of explaining variation in the

dependent variable
to familiarise you with interpreting standard computer output from regression

modelling.
Background reading
20.1 Introduction
We conclude the course with a discussion of computer output for the simple linear
regression model and how to assess the adequacy of a particular model. Remember our
aims are decision-making and prediction. In order to make the best decisions and
predictions we need to use the best models.
20.2 Analysis of variance

Our overall objective is to explain the response variable, y, which is a random variable.
Specifically, we try to explain the variation in y. Using simple linear regression, we
308
attempt this using a single explanatory variable. The total variation in the response
variable sample data is simply:
Xn
TSS = Syy = (yi − ȳ)2 . (20.1)
i=1
We call this the total sum of squares (TSS). As can be seen from (20.1), TSS is
simply the sum of the squared deviations of the response observations, the yi s, about
the mean, ȳ.1 We can decompose TSS into two components which are:
the amount of variation which we are able to explain using the proposed model,
called the explained sum of squares (ESS)
the remaining (or residual) variation which we are unable to explain with the
model, called the residual sum of squares (RSS).
Hence:
TSS = ESS + RSS. (20.2)
20.3 Coefficient of determination, R2

We can assess the overall fit of a model using the coefficient of determination. This
measures the proportion of the total variability in the response variable explained by the
model.
Coefficient of determination
The coefficient of determination, denoted R2 , is defined as:

ESS
R2 = .
TSS
Note that, as a proportion, 0 ≤ R2 ≤ 1.2 The closer R2 is to 1, the better the

explanatory power of the model. Note also that R2 = r2 for a simple linear model
(only), where r is the sample correlation coefficient.
Given our objective is to explain as much of the variation in the response variable as
possible, we would like R2 to be as close to 1 as possible. Ideally we would like R2 to be
equal to 1. However, in practice we would not expect a perfect linear relationship to
exist between the variables – this is especially true when dealing with social science
variables due to the complex interdependencies which exist between such variables. In
fact, an R2 of around 0.6 may be considered ‘good’.
When conducting simple linear regression, if we had a choice of candidates to use as the
explanatory variable then we would prefer the one which maximises R2 . We would then
have identified the simple linear regression model with the greatest explanatory power.
1
For example, if all the response observations were the same, that is, y1 = y2 = · · · = yn , then they
are all equal to ȳ which means TSS = 0, and hence there is no variation in the response variable to
explain. Therefore, there would be no need for a regression model!
2
Also, note that since TSS, ESS and RSS are ‘sums of squares’ they are each non-negative. Using
(20.2), it follows that ESS ≤ TSS, hence 0 ≤ R2 ≤ 1.
309
However, how to choose the candidate explanatory variables in the first place? Well, it
would be sensible to draw on our prior knowledge about the response variable and
general common sense to come up with some obvious choices. For example, if our
response variable is a macroeconomic variable (consumption, say) then we could use
basic economic theory to come up with something suitable as an explanatory variable
(income, say).
20.4 Computer output

Computer output usefully displays important regression results. We consider an
example where we attempt to explain Graduate Record Examination (GRE) scores by
the number of mathematics courses taken by students. Typical output has the following
form:
Regression equation: [ = 812.988 + 50.479 × Courses
GRE
Predictor Coefficient Std. error t ratio p

Constant 812.98763 70.73298 11.4938 0.000
Courses 50.478786 3.347518 15.0795 0.000
S = 120.37 R-sq = 92.3% R = 0.9607
We now explore the output components in detail.
In the ‘Predictor’ column, the ‘Constant’ relates to the y-intercept and ‘Courses’ is
the explanatory variable.
In the ‘Coefficient’ column, the parameter estimates bb0 and bb1 (of the y-intercept
and slope of the regression line, respectively) are provided, yielding the fitted
regression line (using appropriate rounding) of:
[ = 812.988 + 50.479 × Courses.
GRE
In the ‘Std. error’ column, the ‘Courses’ value (3.347518) is the standard error of
the slope, denoted sb̂1 , which measures the precision of the slope estimate.
Similarly, the ‘Constant’ value (70.73298) is the standard error of the y-intercept,
denoted sb̂0 , which measures the precision of the y-intercept estimate, although we
shall not consider this term any further in this course.
In the ‘t ratio’ column, the ‘Courses’ value (15.0795 = 50.478786/3.347518) is the t
statistic:
bb1
t=
sb̂1
which can be used to perform a statistical test to assess the significance of
‘Courses’ as an explanatory variable of ‘GRE’ – that is whether or not the true
slope b1 = 0. However, we shall perform the test using the ‘p-value’, discussed next.
Similarly, there is a t statistic, t = bb0 /sb̂0 , for testing whether or not the true
intercept b0 = 0 – that is whether or not the true line passes through the origin –
but, again, we shall not consider this further in this course.
310
In the ‘p’ column, the ‘Courses’ value is the p-value for a ‘two-sided’ hypothesis
test of whether the true slope b1 = 0 or b1 6= 0. All p-values less than 0.05 suggest
that b1 6= 0 and hence that the explanatory variable is statistically significant.3
In the bottom row, ‘S’ is the standard error of the regression – the standard
deviation of the observed y values about the predicted yb values. It is an estimate of
the standard deviation of the model error term, ε, and tells us by how much the
regression line varies.
In the bottom row, ‘R-sq’ is the value of the coefficient of determination, R2 , which
is the proportion of the variation in y explained by x.
√
In the bottom row, R (where R = R-sq) is the sample correlation coefficient, r, as
defined in the previous unit.
Testing the statistical significance of x as an explanatory variable is worth further

discussion. Why are we interested in whether or not the true slope b1 = 0? Well, let us
revisit the simple linear regression model:
y = b0 + b1 x + ε.
If b1 = 0, then changes in x have no effect on y, whereas if b1 6= 0 then changes in x

have some effect on y. Clearly, if b1 > 0 then a one-unit increase in x leads to a b1 -unit
increase in y, while if b1 < 0 then a one-unit increase in x leads to a b1 -unit decrease in
y.
Recall that our objective is to explain as much of the total variation in y as possible
with our regression model. Therefore, concluding whether or not b1 = 0 is essential for
determining whether x is a true explanatory variable of y. If we concluded that b1 = 0
when using a particular explanatory variable x in our model, then this indicates x has
no effect on y and, therefore, the particular model is of no use in terms of explaining y.
Returning to our GRE example, our simple linear regression model can be written as:
GRE = b0 + b1 Courses + ε. (20.3)
We observe that the p-value associated with ‘Courses’ is 0.000 which is clearly below4
our threshold value of 0.05 and hence we conclude that b1 in (20.3) is not equal to 0.
Therefore, the number of mathematics courses taken by students, ‘Courses’, does help
to explain GRE scores. Indeed, we estimate that taking one additional mathematics
course will increase a student’s GRE by 50.479 points.
Example 20.1 A swimming pool construction company wondered whether jobs

can be completed in a shorter time if more workers are used. Data were collected on
the number of workers and the time of completion (in hours) for a sample of 27 pool
construction jobs.
3
Although hypothesis testing is beyond the scope of this course, we will consider the test of whether
b1 = 0 to determine the statistical significance of x as an explanatory variable of y. For our purposes, the
test simply involves comparing the p-value reported in the computer output with the threshold value of
0.05, with a p-value less than 0.05 indicating that the explanatory variable in the model is ‘statistically
significant’.
4
Note this p-value is not zero, just to three decimal places, i.e. the p-value is less than 0.0005.
311
The results of the regression analysis were:
Regression equation: [ = 141.57 − 6.725 × Workers

Time

Constant 141.56994 6.26140 22.6099 0.000
Workers −6.724680 0.54940 −12.2401 0.000
S = 15.41 R-sq = 85.7% R = −0.9257
By looking at the sample correlation coefficient, −0.9257, we see that there is a very
strong, negative linear relationship between the number of workers and the
construction time. This means the more workers on a job, the shorter the completion
time.
Looking at the p-value of the ‘Workers’ explanatory variable, we see it is 0.000 (to
three decimal places) which is clearly below 0.05, indicating that the number of
workers is a highly significant explanatory variable so is useful in explaining the
response variable. The R2 value tells us that this model is able to explain 85.7% of
the variation in pool construction time using the number of workers as the
explanatory variable.
The coefficient of ‘Workers’ is −6.725 which means each additional worker on a pool
construction job reduces the completion time by 6.725 hours.
Example 20.2 The coach of a school basketball team wanted to investigate

whether there was a relationship between the heights of players and the average
number of points scored per game. Data were collected from the 12 members of the
school team and a simple linear regression was performed with ‘Height’ as the
explanatory variable. The results were:
Regression equation: Average

\ points = −40.36 + 0.706 × Height

Constant −40.3606 33.50440 −1.2046 0.256
Height 0.706061 0.493902 1.4296 0.183
S = 5.84 R-sq = 16.9% R = 0.4119
The correlation coefficient is 0.4119, indicating positive correlation between height of

players and the average number of points scored per game. However, the ‘small’
value suggests that the linear relationship between the variables is moderately weak.
Turning to the regression results, we note the p-value of ‘Height’ is 0.183 and so this
means it is not a statistically significant explanatory variable, since 0.05 < 0.183.
This is also reflected in the coefficient of determination which tells us that only
16.9% of the variation in the average number of points scored per game can be
explained by the heights of players.
So, the given regression model is of no practical use. The coach might want to
consider alternative explanatory variables and see if one can be found which is
312
statistically significant and results in a reasonable R2 . We might recommend hours

of practice per week as a possible choice, or perhaps the number of years spent
playing basketball.
Example 20.3 An estate agent recorded sales of houses according to the sale price,
in pounds, and the size of living space, in square metres. She was interested in
investigating the relationship between these two variables and wondered whether the
sales price could be predicted from the size of living space. A regression model was
estimated with the following results:
Regression equation: Sale

\ price = −1263752 + 11647 × Living space

Constant −1263752 332772 −3.798 0.002
Living space 11647 1244 9.362 0.000
S = 609400 R-sq = 86.2% R = 0.9284
It is clear that there is a very strong positive correlation between the sale price and
amount of living space, with a sample correlation coefficient of 0.9284. This is
perhaps not too surprising since we would expect larger properties to be worth more.
‘Living space’ is statistically significant due to the small p-value and the coefficient
of 11647 can be interpreted by saying that for every extra square metre of living
space, the sale price increases by £11,647.
The coefficient of determination tells us that 86.2% of the variation in house sale
prices can be explained by living space alone. What might account for the other
13.8%? Perhaps the number of bedrooms, location, age of the property, allocated
parking etc.
The intercept of the model is negative. Is this reasonable? Well, the intercept gives
the predicted value of the response variable when the explanatory variable is zero.
Clearly, a sale price cannot be negative! However, we would expect a minimum
amount of living space for any house (perhaps 50 square metres?), so the model is
fine as we would never encounter properties with near-zero amounts of living space!
Finally, note the ‘large’ value of S, the standard error of the regression. This is
purely a consequence of the large values of the response variable, since sale prices are
in pounds, rather than hundreds of thousands of pounds. Always pay attention to
the units of measurement!
Activity 20.1 A retailer has asked you to develop a model which could be used to
predict total sales for some proposed new retail locations. As an expanding retailer,
it needs accurate predictions to determine whether it would be profitable to build
new stores at various locations. The company has obtained data from a household
survey on retail sales per household, y, and income per household, x. You run a
simple linear regression model and obtain the results below. Comment on the
adequacy of the model.
313
Regression equation: [ = 559.5 + 0.3815 × Income

Sales

Constant 559.5 1145.1 0.386 0.704
Income 0.3815 0.02529 15.084 0.000
S = 147.7 R-sq = 91.9% R = 0.9586
Activity 20.2 A company sets different prices for its DVD system in different
regions of its country of operation. Data on the number of units sold and the
corresponding prices were collected and a simple linear regression analysis
performed. The regression results are:
Regression equation: [ = 457.16 − 0.3331 × Price

Sales

Constant 457.1648 33.2657 13.743 0.000
Price −0.3331 0.2011 −1.656 0.149
S = 30.3 R-sq = 31.4% R = −0.5604
What conclusions can you draw?
20.5 Several explanatory variables

Previously we saw simple linear regression which was characterised by one explanatory
variable. Often one explanatory variable is not enough to adequately explain the total
variation in the response variable, in which case we could add more explanatory
variables.
For example, absenteeism in the workforce could be due to:
hours worked
flexibility in work practice
salary paid etc.
while the salary for managers could be related to:

qualifications
experience
hours worked
performance etc.
Remember the aim of statistics is decision-making and prediction. In order to make the
best decisions and predictions we need to use the best models. This often means making
314
the models more complex by adding more explanatory variables. However, the models
should not be too complex.5
It is very straightforward to extend the simple linear regression model to incorporate
several explanatory variables. Multiple linear regression is just a natural extension
of this framework, but with more than one explanatory variable.
Example 20.4 Suppose XYZ Catering sells catering goods and senior management
wants to know which factors affect sales. Your team has data on sales, clients,
suppliers etc. How does the management question translate into a model? What is
the dependent variable? What is (are) the independent variable(s)?
Clearly, here the dependent variable is sales (the variable we are trying to explain).
There could be several explanatory factors such as:
size of client company
type of client company
location
etc.
which we could take to be independent variables.
However, multiple linear regression is beyond the scope of this course so it will not be
discussed further, although many of you are likely to meet multiple linear regression
during your undergraduate studies.
20.6 Summary
In practice most datasets which are used for regression are large and the estimation of
regression parameters can be computationally intensive. Therefore, we tend to use
computers to perform regression analysis. In this final unit of the course, we have looked
at the interpretation of regression output. Specifically, we have seen how to obtain the
equation of the best-fitting line, how to determine how much of the total variation in
the response variable can be explained by the model (using R2 ) and how to determine
whether the explanatory variable in our model was statistically significant. Finally, we
briefly looked at introducing more than one explanatory variable into the regression
model which has the advantage of being more realistic, but the disadvantage of leading
to a more complicated model.

Coefficient of determination Computer output
Decision-making Explained sum of squares
5
A principle known as Occam’s razor says that simplicity is preferred to complexity, other things
equal.
315
Explanatory power Multiple linear regression

p-value Prediction
Residual sum of squares Standard error
Total sum of squares Total variation
Learning outcomes
determine how good a regression model is at explaining the dependent variable
interpret the computer output of a regression model
assess the statistical significance of the explanatory variable
Exercises
Exercise 20.1
The head of a statistics department has taken data from her instructors to observe the
correlation between the number of homework assignments the instructors give for a
course and the average course grade for the students. The following is a plot of the
residuals, defined as yi − ybi , for this study.
Plot of residuals
0.2
x x
0.1
x x
Grade Average
x
x x
0.0
x x
x x x x
−0.1
x x
x
−0.2
0 5 10 15
Number of Homework Assignments
(a) Based on the plot of residuals, describe the strength of a linear relationship
between the number of homework assignments and grade average. What would be
a likely value for the correlation coefficient, r? Explain your answer.
(b) Based on the plot of residuals, describe the effect that the number of homework
assignments has on student grades.
(c) Based on the plot of residuals, about how many homework assignments should an
instructor give to maximise a student’s grade average? Explain your answer.
316
Exercise 20.2
A botanist is studying the relationship between the trunk circumferences of a species of
tree and the number of leaves it has. The scatter plot and regression results are given
here:
Scatterplot with Least-Squares Regression Line

3000
2500
2000
Number
1500
of Leaves
1000
A
500
0
100 150 200 250 300 350 400 450
Trunk Circumference (inches)
Predictor Coefficient Std. error t ratio p-value

Constant 934.43 298.850 3.2167 0.0074
Circumference 3.1815 1.06577 2.9852 0.0114
S = 363.09 R-sq = 42.6% R = 0.6528
(a) What is the equation of the least squares regression line relating the number of
leaves to the trunk circumference in inches? Define any variables used.
(b) If the point A, as shown in the scatter plot above, represents a tree with a trunk
circumference of 350 inches, and 980 leaves, what is the residual, yi − yb, for this
data point?
(c) If the data point A is removed from the sample, what effect will this have on the
correlation coefficient, r? Explain.
317
318
Part 3
Appendices
319
A. A sample examination paper
A
Appendix A
A sample examination paper
Important note: This Sample examination paper reflects the examination and
assessment arrangements for this course in the academic year 2013–2014. The format
and structure of the examination may have changed since the publication of this subject
guide. You can find the most recent examination papers on the VLE where all changes
to the format of the examination are posted.
Mathematics and Statistics
Time allowed: 2 hours.
Candidates should answer ALL questions. Section A (50 marks) covers the
Mathematics part of the course, Section B (50 marks) covers the Statistics part of the
course. Candidates are required to pass BOTH sections to pass the examination.
Candidates are strongly advised to divide their time accordingly.
A list of formulae and the table of cumulative Normal probabilities is provided at the
end of this paper.1
A calculator may be used when answering questions on this paper and it must comply
in all respects with the specification given with your Admission Notice. The make and
type of machine must be clearly stated on the front cover of the answer book.
1
The table is provided at the back of this subject guide in Appendix C.
320
A
Section A: Mathematics
Answer ALL questions (50 marks in total).
1. (a) The demand for a product, q, is related to its price, p, by the equation
q = 10 − p,
while suppliers respond to a price of p by supplying an amount, q, given by the

equation
q = 4p − 30.
Find the equilibrium price and the corresponding level of production.
Write down the supply function. For which values of p and q is it economically
meaningful? (5 marks)
(b) Solve the equation log2 (8) − log3 (9) = log10 (x). (5 marks)
(c) Find the derivatives of the following functions. (5 marks)
i. cos(x2 ).
ii. ex cos(x2 ).
(d) Suppose that you buy a car for £10, 000 and its value depreciates continuously
at a rate of 25% per year. What is its value after three years? Explain why the
car’s value is halved after 4 ln(2) years. (5 marks)
[You may use the fact that, to 5dp, e0.25 = 1.28403.]
2. Consider the function f (x) = x3 − 2x2 − 15x.

(a) Find and classify the stationary points of f (x). (5 marks)
(b) Sketch the curve y = f (x). (5 marks)
(c) Find the area of the region bounded by the curve y = f (x), the x-axis and the
vertical lines x = −1 and x = 1. (5 marks)
3. Consider an annuity which pays £100 every year. The first payment is to be made
now and further payments will be made at the end of each year for the next n years.
(a) Find the present value of this annuity, simplifying your answer as far as
possible, given that an interest rate of 5% per annum compounded annually is
available to you. (5 marks)
(b) If the annuity is to make eleven payments, what is the smallest lump sum
payment that will be worth more to you than the annuity? (3 marks)
(c) How many payments are needed if the annuity is to be worth more than a
lump sum of £2, 000? (5 marks)
(d) If the annuity was a perpetuity, what would be its present value? (2 marks)
[You may use the facts that, to 5dp, 1.0511 = 1.71034 and log1.05 (21) = 62.40033.]
321
A
Section B: Statistics
Answer ALL questions (50 marks in total).
4. (a) Would the distribution of income (in pounds per year) in the UK most likely
be symmetrically distributed, skewed to the right, or skewed to the left?
Briefly explain why. Which measure of central tendency would you use to
describe income? Justify your choice.
(5 marks)
(b) Given events A and B where P (A) = 0.5 and P (A ∪ B) = 0.7, find P (B) in
the following three cases:
i. when A and B are mutually exclusive

ii. when A and B are independent
iii. when P (A | B) = 0.5.
(5 marks)
(c) In an examination, the scores of students who attend schools of type A are
approximately normally distributed about a mean of 61 with a standard
deviation of 5. The scores of students who attend type B schools are
approximately normally distributed about a mean of 64 with a standard
deviation of 4. Which type of school would have a higher proportion of
students with marks above 70?
(5 marks)
(d) You randomly select 1,000 names from the subscription list of a magazine
designed for hunters. You mail a questionnaire about gun control to these
readers and receive 700 responses. You randomly select 200 of the 700
responses for inclusion in your study. What forms of bias are evident in this
design?
(5 marks)
5. Three members of an exclusive country club, Mr Adams, Miss Brown and Dr

Cooper, have been nominated for the office of president. The probabilities of Mr
Adams and Miss Brown being elected are 0.3 and 0.5, respectively. If Mr Adams is
elected, the probability of an increase in membership fees is 0.8. If Miss Brown or
Dr Cooper is elected, the corresponding probabilities of an increase in membership
fees are 0.1 and 0.4, respectively.
(a) What is the probability that Dr Cooper is elected president?

(5 marks)
(b) What is the probability that there will be an increase in membership fees?
(5 marks)
(c) Given that membership fees have increased, what is the probability that Dr
Cooper was elected president?
(5 marks)
322
A
6. For a group of 15 students, the following table shows the average number of hours
per week spent on study and their final results in the corresponding examination.
Number of hours studied, x 16 17.5 11.5 13.5 15 12.5 20.5 14.5

Examination mark, y 77 85 48 59 75 41 95 72
Number of hours studied, x 16.5 13.5 22 18.5 17 19.5 19.5

Examination mark, y 80 70 99 85 83 97 89
Summary statistics for these data are:
15 15 15
x2i = 4218.75,
P P P
xi = 247.5, yi = 1155,
i=1 i=1 i=1
15 15
yi2 = 92999
P P
and xi yi = 19750.5
i=1 i=1
(a) Calculate the sample correlation coefficient for these data and comment.
(5 marks)
(b) Calculate the least squares regression line of y on x.
(5 marks)
(c) Use the calculated line to predict examination marks for a student who studied
for 16 hours. Would you consider a prediction based on 20 hours to be more
accurate? Explain why/why not.
(5 marks)
[END OF PAPER]
323
A
Formula sheet
df df dg
The chain rule: If f (x) = f (g) for some function g(x), then = .
dx dg dx

d df dg
The product rule: f (x)g(x) = g(x) + f (x) .
dx dx dx

d f (x) 1 df dg
The quotient rule: = g(x) − f (x) .
dx g(x) [g(x)]2 dx dx
The sum of a finite geometric series is given by

1 − rn
a + ar + ar2 + · · · + arn−1 = a .
1−r
The variances for a population and a sample are:
N
n

x2i x2i − nx̄2
P P
i=1 i=1
σ2 = − µ2 and s2 = .
N n−1
For events A and B, we have:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
P (A ∩ B) = P (A) P (B | A)
P (A | B) P (B)
P (B | A) = .
P (A)
For a discrete random variable X, we have:

X
µ = E(X) = x P (X = x)
X
E(g(X)) = g(x) P (X = x)
σ 2 = Var(X) = E(X 2 ) − (E(X))2 .
If X ∼ Bin(n, π), then:

n x n n!
P (X = x) = π (1 − π)n−x , where =
x x x! (n − x)!
324
A
E(X) = n π and Var(X) = n π (1 − π).
If X ∼ Poisson(λ), then:
e−λ λx
P (X = x) = , E(X) = λ and Var(X) = λ.
x!
The sample correlation coefficient is:

Sxy
r=p
Sxx Syy
where:
n
X n
X
Sxx = (xi − x̄)2 = x2i − nx̄2
i=1 i=1
n
X n
X
2
Syy = (yi − ȳ) = yi2 − nȳ 2
i=1 i=1
n
X n
X
Sxy = (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ.
i=1 i=1
The regression slope and intercept are given by:

n
P
xi yi − nx̄ȳ
bb1 = i=1 and bb0 = ȳ − bb1 x̄.
n
x2i − nx̄2
P
i=1
325
B. Solutions to the sample examination paper
B Appendix B
Solutions to the sample examination
paper
The solutions to the sample examination paper provided below are to give guidance
about the level of detail required by the Examiners. In response to a ‘qualitative’
question, it is essential to directly address the areas of the syllabus which are being
assessed. For a ‘quantitative’ question, it is essential that you show all your working as
most of the credit will be for the method, rather than the final answer.
Question 1
(a) As in Section 2.3.3, given the demand equation q = 10 − p and the supply equation
q = 4p − 30, the equilibrium price is given by
40
10 − p = 4p − 30 =⇒ 5p = 40 =⇒ p= = 8.
5
The corresponding quantity is then given by, say, q = 10 − 8 = 2. (This is, of course, the
equilibrium quantity.)
Then, as in Section 4.1.4, using the supply equation, we can see that the supply function
is given by qS (p) = 4p − 30. This is economically meaningful as long as q ≥ 0 and
p ≥ 15/2 since other values of p or q will make at least one of these quantities negative.
(b) As in Section 4.2.2, if we note that
log2 (8) = log2 (23 ) = 3 and log3 (9) = log3 (32 ) = 2,
the given equation is just
3 − 2 = log10 (x) =⇒ log10 (x) = 1,
and so, using the definition of the logarithm, we see that x = 101 = 10.
(c) For (i), the function f (x) = cos(x2 ) is the composition given by f (g) = cos(g) with
g(x) = x2 . Thus, using the chain rule from Section 6.1.3, we find that
df df dg
= · = [− sin(g)][2x] = −2x sin(x2 ).
dx dg dx
326
For (ii), the function h(x) = ex cos(x2 ) is the product of the functions ex and cos(x2 ).
Thus, using the product rule from Section 6.1.1, we find that
dh
= [ ex ][cos(x2 )] + [ ex ][−2x sin(x2 )] = ex [cos(x2 ) − 2x sin(x2 )],
B
dx
if we use our answer from (i).
(d) As the car is initially worth £10, 000 and its value depreciates continuously at a rate
of 25% per year, we can use what we saw in Section 9.3 to see that its value is given by
10, 000 e−(0.25)(3)
after three years. We are told in the question that, to 5dp, e0.25 = 1.28403 and so this
gives us
10, 000 10, 000
0.25 3
' = 4, 723.615,
(e ) (1.28403)3
i.e. the car’s value is £4, 723.61 after three years.
The car’s value is halved from £10, 000 to £5, 000 after t years where
1
5, 000 = 10, 000 e−0.25t =⇒ e−0.25t = .
2
Using the definition of ‘ln’, as in Section 4.2.2, this then gives us

1 1
−0.25t = ln =⇒ t = −4 ln = 4 ln(2),
2 2
if we use the laws of logarithms. Consequently, as required, the car’s value is halved
after 4 ln(2) years.
Question 2
We are given the function f (x) = x3 − 2x2 − 15x.

(a) To find the stationary points of f (x), as in Section 7.2.1, we find its derivative, i.e.
f 0 (x) = 3x2 − 4x − 15,
and solve the equation f 0 (x) = 0 and so, using factorisation, we can see that
5
3x2 − 4x − 15 = 0 =⇒ (3x + 5)(x − 3) = 0 =⇒ x=− or 3.
3
Thus, the stationary points occur when x = −5/3 and x = 3.
To classify these stationary points, we see that the second derivative of f (x) is given by
f 00 (x) = 6x − 4,
and so, noting that
At x = −5/3, we have f 00 (−5/3) = −14 < 0 and so this is a local maximum.
327
At x = 3, we have f 00 (3) = 14 > 0 and so this is a local minimum.

Consequently, we see that the stationary points at x = −5/3 and x = 3 are a local
B maximum and local minimum respectively.
(b) To sketch the curve y = f (x), as in Section 7.2.2, we note that:
The y-intercept, which occurs when x = 0, is given by y = 0.
The x-intercepts, which occur when y = 0, are given by
x3 − 2x2 − 15x = 0 =⇒ x(x2 − 2x − 15) = 0 =⇒ x(x − 5)(x + 3) = 0,
and so we get x = −3, x = 0 and x = 5.
The stationary points, as found in (a), are

• a local maximum when x = −5/3 and y = f (−5/3) = 400/27, and
• a local minimum when x = 3 and y = f (3) = −36.
Lastly, as the highest power of x in f (x) is x3 , we should find that f (x) → ∞ as

x → ∞ and f (x) → −∞ as x → −∞.
So, using this information, we can set up the sketch and finish it off as illustrated in
Figure B.1.
y y y = f (x)
400 400
27 27
3 3
O O
−3 − 53 5 x −3 − 53 5 x
−36 −36
The set up The sketch

Figure B.1: Sketching the curve y = f (x) for Question 2(b).
(c) As in Section 8.2, to find the area of the region bounded by the curve y = f (x), the
x-axis and the vertical lines x = −1 and x = 1, we observe that f (x) is positive for
−1 ≤ x ≤ 0 and negative for 0 ≤ x ≤ 1. This means that the area we need to find is
given by
Z 0 Z 1

f (x) dx + f (x) dx .
−1 0
So, as the first of these integrals gives us

0 0
x4 2 3 15 2

1 2 15 79
Z
3 2
x − 2x − 15x dx = − x − x =0− + − = ,
−1 4 3 2 −1 4 3 2 12
328
and the second of these integrals gives us

Z 1 4 1
3 2 x 2 3 15 2 1 2 15 95
0
x − 2x − 15x dx =
4
− x − x
3 2 0
= − −
4 3 2
−0=− ,
12 B
79 95 29
we find that the required area is 12
+ 12
= 2
.
Question 3
This question uses the material from Unit 10.

(a) The annuity pays £100 every year. The first payment is made now and so its
present value is £100 whereas given that an interest rate of 5% per annum compounded
annually is available, the payments at the end of the first, second, . . . , nth years will
have present values given by
100 100 100
, , . . . ,
1.05 1.052 1.05n
respectively. As such, the present value of the annuity as a whole is the sum of the
geometric series
100 100 100
100 + + + · · · + ,
1.05 1.052 1.05n
which has a first term of 100, a common ratio of 1/1.05 and n + 1 terms. Consequently,
using the formula for the sum of a finite geometric series, we see that
1
1− n+1

1

100 1.05 = 2, 100 1 − ,
1 1.05n+1
1−
1.05
is the present value of this annuity.
(b) If the annuity is to make eleven payments, so that n + 1 = 11, we see that its
present value is

1 1
2, 100 1 − ' 2, 100 1 − = 872.17395,
1.0511 1.71034
to 5dp if we use the fact that 1.0511 = 1.71034 to 5dp. Thus, the present value of the
annuity is £872.17 and, from this, we see that £872.18 is the smallest lump sum
payment that is worth more to you than the annuity.
(c) For the annuity to be worth more than a lump sum of £2, 000, we need n + 1
payments where

1 1 20 1 1
2, 100 1 − n+1
> 2, 000 =⇒ 1 − n+1
> =⇒ > ,
1.05 1.05 21 21 1.05n+1
and so we need to solve the inequality 1.05n+1 > 21. Indeed, given the fact that
log1.05 (21) = 62.40033 (5dp), we can use logarithms to see that
n + 1 > log1.05 (21) ' 62.40033,
and so we would need 63 payments if we want the annuity to be worth more than the
given lump sum.
329
(d) If the annuity was a perpetuity, its present value would be given by the infinite
geometric series
100 100 100
B 100 + +
1.05 1.05 2
+ ··· +
1.05n
+ ··· ,
whose sum is
100
= 2, 100,
1
1−
1.05
if we use the formula for the sum of an infinite geometric series (or think about what we
found in (a) as n → ∞). Thus, the present value of the corresponding perpetuity would
be £2, 100.
Question 4
(a) The distribution of income would be skewed to the right. Most people earn, say,
between £12,000 and £60,000, with few earning less. However, a relatively small
number earn a lot more, leading to a long ‘tail’ to the right. We would probably
not use the mode, instead preferring the mean or median. The mean, though, is
sensitive to outliers and so will be ‘pulled’ up due to the few high earners. As such,
it could be argued that the median would be the best measure of central tendency
for representing the ‘average’ income of a UK employee.
(b) i. We use the fact that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). If A and B are
mutually exclusive, then P (A ∩ B) = 0, so:
P (B) = P (A ∪ B) − P (A) = 0.7 − 0.5 = 0.2.
ii. If A and B are independent, P (A ∩ B) = P (A) P (B), so:
P (A ∪ B) = P (A) + P (B) − P (A) P (B).
Hence:
0.7 = 0.5 + P (B) − 0.5 × P (B).
Therefore, 0.5 × P (B) = 0.2, hence P (B) = 0.4.
iii. Again, P (A ∪ B) = P (A) + P (B) − P (A ∩ B) and also:
P (A ∩ B) = P (B) P (A | B)
so:
0.7 = 0.5 + P (B) − 0.5 × P (B).
Hence P (B) = 0.2/0.5 = 0.4.
Alternatively (and more elegantly), P (A) = P (A | B) implies A and B are
independent and so we can use the result of part ii.
330
(c) The z-scores for types A and B are, respectively:
70 − 61 70 − 64
zA =
5
= 1.8 and zB =
4
= 1.5. B
Since zA > zB , type B schools have a higher proportion of students with marks
above 70.
Alternatively, the actual proportions could be calculated. P (Z > 1.8) = 0.0359 and
P (Z > 1.5) = 0.0668, hence type B schools have the higher proportion.
(d) Several forms of bias exist in this design, including undercoverage bias, response
bias and non-response bias. Readers of a hunting magazine would probably share
positive views about gun ownership. This group of readers is not a representative
sample of the general public when it comes to gun control. Therefore,
undercoverage bias is suggested. Since this group of readers probably has strong
views about gun control, they might answer more often than the general public,
resulting in non-response bias. There is also an expectation among hunters that
gun control is not a good idea. This expectation might lead to a response bias not
found in the general population. In order to try to avoid this form of bias, a
magazine covering a topic unrelated to guns might be a more appropriate
population from which to select a sample.
Question 5
Define the following events:
X is membership fees increase
A is Mr Adams elected
B is Miss Brown elected
C is Dr Cooper elected.
(a) We have P (C) = 1 − P (A) − P (B) = 1 − 0.3 − 0.5 = 0.2.
(b) We have:
P (X) = P (A) P (X | A) + P (B) P (X | B) + P (C) P (X | C)

= (0.3 × 0.8) + (0.5 × 0.1) + (0.2 × 0.4)
= 0.37.
(c) Using Bayes’ theorem, we have:
P (C) P (X | C) 0.2 × 0.4 8

P (C | X) = = = = 0.2162.
P (X) 0.37 37
331
Question 6
B (a) The sample correlation coefficient is:

15
P
xi yi − nx̄ȳ
Sxy i=1
r=p = s = 0.9356.
Sxx Syy 15
15
x2i − nx̄ yi2 − nȳ 2
P 2
P
i=1 i=1
This indicates (very) strong, positive correlation between examination marks and
hours of study.
(b) The least squares regression line parameters are estimated to be:
15
P
xy − nx̄ȳ
bb1 = i=1
15
= 5.1333
P
x2 − nx̄2
i=1
and:
bb0 = ȳ − bb1 x̄ = −7.7000.
Hence the estimated regression line is:
yb = −7.7000 + 5.1333x.
(c) For x = 16, the expected examination mark is:
−7.7 + 5.1333 × 16 = 74.43
which we may round to 74. We expect the predicted value for x = 16 to be more
accurate because the available x data cover a range of 11.5 to 22, hence 16 is near
the middle of the sample x values whereas 20 is toward the upper limit.
Interpolation is more accurate for values near the centre of the sample data.
332
C. Table of cumulative normal probabilities
Appendix C
Table of cumulative normal
probabilities C
The entries in this table are cumulative probabilities for the standard normal
distribution and give Φ(z) = P (Z ≤ z) for z ≥ 0. For example, P (Z ≤ 1.96) = 0.9750.
For values of z < 0, use P (Z ≤ z) = 1 − P (Z ≤ |z|) = 1 − Φ(|z|). For example,
P (Z ≤ −1) = 1 − P (Z ≤ 1) = 1 − Φ(1) = 1 − 0.8413 = 0.1587.
333
C. Table of cumulative normal probabilities
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
C 0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
334

IFP - Mathematics and Statistics PDF

Uploaded by

Copyright:

Available Formats

IFP - Mathematics and Statistics PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IFP - Mathematics and Statistics PDF

Uploaded by

Copyright:

Available Formats

International Foundation Programme

James Ward and

 J.M. Ward, The London School of Economics and Political Science

Published by: University of London

1 Review I — A review of some basic mathematics 9

2 Review II — Linear equations and straight lines 33

3 Review III — Quadratic equations and parabolae 50

4.1.3 Combinations of functions . . . . . . . . . . . . . . . . . . . . . . 72

6 Calculus II — More differentiation 100

7 Calculus III — Optimisation 113

7.1.5 A note on the ‘large x ’ behaviour of functions . . . . . . . . . . . 118

8 Calculus IV — Integration 128

9 Financial Mathematics I — Compound interest and its uses 146

10 Financial Mathematics II — Applications of series 158

10.2.2 Annuities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Part 2 Statistics 172

11 Data exploration I – The nature of statistics 175

12 Data exploration II – Data visualisation 188

Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

13 Data exploration III – Descriptive statistics: measures of location, dispersion

14 Probability I – Introduction to probability theory 215

Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

15 Probability II – Probability distributions 234

16 Probability III – The normal distribution and sampling distributions 253

17 Sampling and experimentation I – Sampling techniques and contact

17.1.2 Probability sampling techniques . . . . . . . . . . . . . . . . . . . 279

18 Sampling and experimentation II – Bias and the design of experiments286

19 Fundamentals of regression I – Correlation and the simple linear regression

20 Fundamentals of regression II – Interpretation of computer output and

Part 3 Appendices 319

A A sample examination paper 320

B Solutions to the sample examination paper 326

C Table of cumulative normal probabilities 333

Route map to the guide

Recommendations for working through the units

i. Read the overview and the aims of the unit.

Overview of learning resources

The subject guide and textbooks

Online study resources

Virtual Learning Environment (VLE)

Making use of the Online Library

Up-to-date information on examination and assessment arrangements for this

Where available, past examination papers and Examiners’ commentaries for

Arithmetic and algebra: A review of arithmetic (including the use of fractions

Functions: An introduction to functions. Some common functions (including

Financial mathematics: Compound interest over different compounding

Aims of the course

a grounding in arithmetic and algebra;

an overview of functions and the fundamentals of calculus;

an introduction to financial mathematics.

Learning outcomes for the course (Mathematics)

manipulate algebraic expressions;

graph, differentiate and integrate simple functions;

calculate basic quantities in financial mathematics.

The aims of this unit are as follows.

To revise the most basic ideas behind algebra.

1.1.1 Basic arithmetic

addition denoted by ‘+’ gives us ‘sums’, e.g. 6 + 3 = 9;

subtraction denoted by ‘−’ gives us ‘differences’, e.g. 6 − 3 = 3;

J.M. Ward, The London School of Economics and Political Science