This book contains information obtained from authentic and highly regarded sources. Every effort has
been made to trace copyright holders and to obtain their permission for the use of copyright material.
Reprinted material is quoted with permission, and sources are indicated. A wide variety of references
are listed. Reasonable efforts have been made to publish reliable data and information, but the author
and the publisher cannot assume responsibility for the validity of all materials or for the consequences
of their use. All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system or transmitted in any form or by any means—graphic, electronic, or mechanical, including
photocopying, recording, taping, or information storage and retrieval systems—without permission
of the copyright holder.
ISBN 9780815344919
Published by Garland Science, Taylor & Francis Group, LLC, an informa business,
711 Third Avenue, New York, NY, 10017, USA, and 3 Park Square, Milton Park, Abingdon,
OX14 4RN, UK.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
There is a scientific revolution happening in biomedical genetics. The new genetics does
not just apply to the well-known and well-described Mendelian diseases with clear pat-
terns of inheritance, nor is it limited to major chromosomal abnormalities. What makes
the revolution so exciting is that it includes all human diseases and all aspects of human
disease. Diseases that have been largely, but not entirely, ignored in the past are the main
focus of this revolution. The potential arising from this work is astounding. It is already
having an impact and the impact will only grow over time. There are many books on
genetics, but few concentrate on complex diseases—those that do not fit the simple pat-
terns of Mendelian disease and cannot be described as chromosomal abnormalities.
Over the past 15–20 years interest in these genetically complex diseases has taken full
flight. Though earlier studies had identified some important genetic links and associations,
many of the early studies had failed to be replicated and studies in this area of genetics had
developed a poor reputation. There were some good studies and many bad studies. The
difference between good and bad studies is quite well known. However, developments in
the last 20 years have restored interest and confidence in studies of complex disease.
A number of important developments were the keys to opening up this area for high-
quality research. The two most important developments have been the Human Genome
Mapping Project and the development of supercomputers along with the necessary sys-
tems capable of handling the data that very-large-scale studies produce. These two devel-
opments go cap-in-hand, one is not possible without the other. In 2015, we have the
human genome sequence, the SNP Map and the HapMap. Of course array platforms
for genotyping and application of this knowledge as well as more sophisticated statistical
analysis have also filled an essential gap. Indeed, the genetics of today is as much about sta-
tistics as it is about biology and there are Professors of Statistical Genetics in our academic
institutions who dedicate their research to extracting important facts from the mountains
of data that current studies can generate.
This book addresses the subject of genetics of complex disease and is designed in two parts.
The first part (Chapters 1–5) provides a basic background to genetically complex diseases,
and why and how we study them. The second part (Chapters 6–12) focuses on specific
sub-branches of genetics of complex disease and specific examples to highlight the applica-
tion of genetic data in complex disease and the extent to which this data is fulfilling the
promises of the Human Genome Project.
Chapter 1 covers the necessary background to genetic variation in the human population,
i.e. our evolutionary past and how genetic variation arises. Chapter 2 goes on to define
complex diseases and compare them with Mendelian and chromosomal diseases. Chapter
3 looks at how we investigate complex diseases, including the different plans and strate-
gies available to us. Do we chose a single gene or region to study, or do we throw the net
wider and investigate the whole genome in a genome-wide association or linkage study?
Chapter 4 considers why we are interested in complex diseases, focusing on the major
promises of the Human Genome Project in relation to complex disease. These suggested
that genetic testing will be used in disease diagnosis, patient treatment and management,
and in understanding disease pathology. Chapter 5 looks at how data from the studies
described below is handled in a range of different statistical tests.
Sufficient information is given in each of Chapters 1–5 to enable students to understand
the major points and, where appropriate, examples are used to illustrate the key con-
cepts (e.g. in Chapter 2, where Crohn’s disease and Hirschsprung’s disease are discussed as
two different models of genetically complex diseases). Chapter 4 uses quite a few disease
examples to illustrate how the genetic information may be used to meet the promises of
the Human Genome Project.
After Chapter 5, the book goes on to look at three specific areas: immunogenetics (Chapter
6), infectious disease (Chapter 7), and pharmacogenetics (Chapter 8).
Chapter 6 on immunogenetics deals with how common variation in genes that regulate
the immune response can increase or reduce susceptibility to common diseases. The chap-
ter concentrates on the major histocompatibility complex on chromosome 6p21.3. The
chapter includes a considerable number of recently studied examples and discusses the
different interpretations that can be applied to the data. In each case, the extent to which
these examples do or do not fulfill the promises of the genome project is considered. There
are positive examples of how genetics can be used as an aid to diagnosis (e.g. in ankylosing
spondylitis), and also how associations and linkage with certain risk alleles may be helping
us to understand disease pathogenesis (e.g. in autoimmune liver disease).
Chapter 7 on infectious disease looks at the past and the present considering how genetic
variations may influence the likelihood of infection per se and the outcome following
exposure to infectious agents. The discussion provides interesting links with mankind’s
early history. The chapter concentrates on a few selected examples to illustrate the con-
cept and demonstrate how the studies discussed are helping to fulfill the promises of the
Human Genome Project. Once again, there are clear examples of genetic investigations
impacting on our understanding of disease pathology.
Chapter 8 on pharmacogenetics discusses past and present developments in a fast-expand-
ing field that is at present providing some of the most promising results in complex disease
genetics. Studies have shown that responses to commonly used pharmacological agents
can be determined by common genetic variation. The impact of this variation ranges from
failure to respond to a drug to life-threatening toxic reactions. The potential to use genet-
ics to tailor therapy and also to develop new therapeutic agents is a real possibility in this
sub-branch of complex disease genetics, and one that major pharmaceutical companies
and academic institutions are aware of.
Chapters 9 and 10 focus on specific disease groups: cancer (Chapter 9) and diabetes
(Chapter 10). These two chapters stand alone because one group of diseases (cancer) has
a very significant impact in terms of morbidity and mortality in the developed and devel-
oping world and the other group (diabetes) is for the most part a perfect example of a
complex disease. The potential medical impact of genetic studies in these diseases is vast.
More rapid diagnosis, better patient care, personal life planning, and personal treatment
planning are all possible. As we gain a greater knowledge of the genetics of these diseases
we will start to have a better grasp on the underlying pathology of each disease, which will
open up doors for diagnosis, treatment, and management. In some cases, this will mean
simple things like changes to a person’s diet; in others, selecting the appropriate chemo-
therapeutic agent to use for a patient. To some extent some of these aims have already been
achieved, but as this book indicates, there is still much to be done.
In Chapter 9 (cancer), a selected number of examples are discussed. These include breast
cancer, prostate cancer, and lung cancer. The selection is based on the most common
cancers, which are also, to some extent, those about which we know the most. Links to
useful websites are given for further information and updates. Diabetes (Chapter 10) is
discussed in its various forms, especially type 1 and type 2 diabetes and is specifically used
to illustrate the difference in the genetic portfolios in type 1 and type 2 diabetes. Here,
the question is why are two diseases that have so much in common so different in terms
of their genetic profiles?
The last two chapters deal with societal and ethical issues in the new genetic era and the
future of genetics in complex disease. This is a fast-moving area of science. The facts being
produced today will be marketed as diagnostic or prognostic indicators almost as quickly
as they are identified. Genetic testing for risk alleles will soon be normal practice, but this
will have ethical and social consequences. The potential for misuse of genetics is discussed
in Chapter 11, highlighting the importance of understanding what a genetic test in com-
plex disease is really telling you. You will need to know what a genetic profile is telling
you before getting tested. There is considerable commercial interest in genetic profiling
and this has ethical and societal impact. Other points discussed include who owns your
genome and who can access your genetic data?
Chapter 12 closes the book by looking at the techniques and technologies that have been
used and those that will be used in the future. The chapter reminds us that technologies
used in the past will also be used in the future, but it also highlights some fascinating new
possibilities. Most important will be direct sequencing either at the level of the exome (i.e.
protein-coding genes only) or the whole genome.
The structure of the book is designed to provide a basic platform on which students can
build their knowledge base. Each of the chapters (including the basic chapters) uses exam-
ples of disease to illustrate key specific points and provides a reasonable level of basic
current data on each example used. In particular, the book focuses on the promises of the
Human Genome Project that suggested genetics will be used to improve disease diagnosis,
to develop individual treatment and management plans for patients, and to inform the
debate on disease pathogenesis. At each stage and after each example, the text reflects on
the extent to which these promises have been or will be met, looking at both the present
and, if possible, the future. Links to the web are also provided for access to updates and
further information throughout the book. There is an extensive Glossary at the end of the
book.
These are very exciting times for genetics, especially in complex disease. They are also fast-
moving times. The book is written as a starting point (a first block) and for the most part it
is written in an historical style to ensure it remains in date whatever develops in the future.
This book provides a good starting point for anyone studying the genetics of so-called
complex diseases. It is written for the undergraduate student and early postgraduate stu-
dent alike. It is written for the medical and non-medically minded individual. This era is
one of the most exciting eras in modern genetics, perhaps as exciting as when the structure
of DNA was first revealed to the scientific community.
We would like to thank the staff at Garland Science, Liz Owen, David Borrowdale and
Deepa Divakaran, for their support and encouragement in producing this book.
As senior author I would like to give specific thanks to: Robert Taylor (Newcastle
University) who provided advice on the mitochondrial genome, John Mansfield (Newcastle
University) who provided necessary background on inflammatory bowel disease, Roger
Williams (King’s College, London) and Oliver James (Newcastle University) both of
whom provided a supporting environment within which to learn and develop a back-
ground in liver disease and genetics as well as the necessary skills to produce this book,
Derek Doherty (Trinity College, Dublin) who worked with me on the molecular genetics
of the MHC in liver disease at King’s College Hospital, London and the many members
of different research teams who have contributed to my research between 1982 and 2015.
In addition I would like to give special thanks to the hundreds of students who, through
their positive interaction and feedback, have encouraged the writing of this book. Finally,
I would like to give very special thanks to Carolyn Donaldson who encouraged and sup-
ported production of this book from start to finish, especially during difficult times.
Peter Donaldson
The authors and publisher would like to thank external advisers and reviewers for their
suggestions and advice in preparing the text and figures.
Geoffrey Bosson (Newcastle University, UK); Margit Burmeister (University of
Michigan, USA); Angela Cox (University of Sheffield, UK); Rachelle Donn (University
of Manchester, UK); Yalda Jamshidi (St George's, University of London, UK); Martin
Kennedy (University of Otago, New Zealand); Andrew Knight (Newcastle University,
UK); Hao Mei (Tulane University, USA); John Pearson (University of Otago, New
Zealand); Logan Walker (University of Otago, New Zealand); Kai Wang (University of
Iowa, USA); Yun Zhang (Oxford Brookes University, UK).
Preface v
Acknowledgments viii
1 Genetic Diversity 1
1.1 Genetic Terminology 2
The use of the terms genes and alleles varies, though they do have precise definitions 2
1.2 Genetic Variation 5
Genetic variation can be measured by several methods 6
Alleles on the same chromosome are physically linked and inherited as haplotypes 7
Linkage disequilibrium promotes conservation of haplotypes in populations 8
1.3 Genetics and Evolution 9
Mutation is the major cause of genetic variation 10
Genetic variation caused by mutation alters allele frequencies in populations 11
Migration and dispersal cause gene flow 12
Allele frequencies can change randomly via genetic drift 13
The thrifty gene hypothesis 16
Natural selection acting on different levels of fitness affects the gene pool 16
1.4 Calculating Genetic Diversity: Determining Population Variability 19
Genotype and allele frequencies illustrate genetic diversity 19
Allele frequency refers to the numbers of alleles present in a population 20
Heterozygosity provides a quantitative estimation of genetic variation 21
The HWP is a complex but essential concept in population genetics 21
Calculating expected genotype frequencies using the HWP 22
Different populations may have different allele frequencies 22
1.5 Population Size and Structure 26
Breeding population size is important in evolution 26
Genetic variation is not always uniform in a population 26
Wahlund’s principle 27
1.6 The Mitochondrial Genome 27
1.7 Gene Expression and Phenotype 29
Genetic variation is manifested in the phenotype 29
Phenotypes are influenced by the environment 29
1.8 Epigenetics 30
Incidence and prevalence can be very different or very similar depending on the prognosis
for the disease 75
Incidence and prevalence of disease may vary in different populations 76
What is the evidence for a genetic component to the disease? 76
What is known about the disease pathology? 80
Before we get down to the hard business of study planning there are one or two other
questions that it is important to ask 82
3.2 Planning Stage 2: Choosing a Strategy 84
Two basic strategies for identifying risk alleles in complex disease 84
In terms of the history of genetic studies in complex disease there are two main periods:
pre- and post-genome 84
Each of these two strategies has a substrategy 88
3.3 Good and Bad Practice 93
Accurately identifying true disease susceptibility alleles in GWAS (and other association
studies) is dependent on sample size 93
Case selection can introduce bias into a study 94
It is important to consider whether we are studying a disease, a syndrome, or a trait within
a disease subgroup 94
Selection of appropriate controls is equally important in any study 94
Errors in the laboratory and in sample handling can also introduce bias into a study 96
Statistical analysis is the key in any study of complex disease 96
SNP chip selection is an important factor to consider in study design 96
Unfortunately publication bias does occur 97
Replication in an independent sample is crucial for all association studies, especially GWAS 98
3.4 New Technologies and the Future 100
The technological advances of the past decade have had a major impact on research into
the genetics of complex disease and the rate of change is going to increase 100
New developments will come from the ENCODE project, and will also involve more
epigenetics and imputation analysis 100
The real debate about the future of complex disease research lies not in the genetics itself,
but downstream from the genetics 101
Conclusions 101
Further Reading 103
Predicting disease severity through genetic analysis may have clinical significance in
terms of patient management 111
Common genetic variation may predict response to treatment and be critical in patient care 113
Onset, severity, and response to treatment are all part of patient management 113
4.4 Disease Pathogenesis 114
Early studies offered potential insight into the biology of ankylosing spondylitis 115
Later GWAS have offered even further insight into the biology of ankylosing spondylitis 116
Rheumatoid arthritis has many strong genetic associations, some of which can be used to
help us unravel the pathogenesis of this disease 116
Bipolar disease is a disease for which there are many weak genetic associations, but few
strong consistent associations 122
Coronary artery disease is the most common cause of death in the developed world 127
4.5 What about the Other Diseases? 136
Conclusions 137
Further Reading 138
Linkage disequilibrium is a useful tool in association studies provided you know how to
handle it 167
The ability to detect a significant association through linkage disequilibrium can increase
the power of an association study 167
Most association analyses identify multiple SNPs, other genetic variants, and haplotypes 170
Conclusions 171
Further Reading 172
8 Pharmacogenetics 253
8.1 Definition and a Brief History of Pharmacogenetics 254
8.2 Cytochrome P450 255
There is a clear relationship between genotype and phenotype for several forms of
cytochrome P450 255
The conversion of the analgesic drug codeine, which is administered as a pro-drug and
is activated to morphine by CYP2D6, is of clinical importance 258
The cytochrome P450 CYP2C9 metabolizes warfarin – a very widely used drug 259
CYP2C19 activates clopidogrel – a drug widely used to prevent strokes and heart attacks 259
8.3 Other Drug-Metabolizing Enzymes and Transporters 261
For phase II conjugation reactions, the UDP glucuronosyltransferase family makes the
largest contribution 261
Methyltransferases are also important in phase II drug metabolism 261
Polymorphisms in drug transporters also play a role in pharmacogenetics 263
8.4 Drug Targets 263
The relationship between VKOR and coumarin anticoagulants is one of the most consistently
reported genetic associations involving drug targets unrelated to cancer 263
The efficacy of β-adrenergic receptor agonists widely used in the treatment of allergies may
also be genetically determined 264
8.5 Adverse Drug Reactions 266
HLA genotype is a potent determinant of susceptibility to several different types of adverse
drug reactions 266
Glossary 373
Index 397
Color Inserts
The use of the terms genes and alleles varies, though they do have
precise definitions
The terms gene and allele are often used as though they are the same, but it is important to
note that this is incorrect and that the correct term to use when considering genetic varia-
tion is allele. A gene is, as stated above, the basic unit of inheritance. The scientific litera-
ture is peppered with examples of incorrect use of this terminology. Writers often refer to
the “cystic fibrosis gene” and the “hemochromatosis gene” as though only patients with
these diseases possess the gene, when actually all members of the population possess these
genes. In these two examples, which are both Mendelian autosomal recessive disorders,
the difference between affected patients and healthy members of the population is that
patients possess two copies of the disease-causing alleles. Unaffected population members
may have a single copy of the disease-causing allele or may not carry this allele at all.
Instead, they will have one or two copies of the non-disease-causing allele. Thus, it is the
possession of the requisite alleles that causes the disease and not the possession of the gene.
Finally, the term allele is sometimes used to include any genetic variation within a region,
Figure 1.1: Karyotypes of human chromosomes. The figure illustrates the entire autosome showing banding
patterns for each chromosome in size order. Chromosomal banding was (and is) traditionally used to identify
chromosomes and chromosomal sites for clinical diagnosis. To obtain these patterns it is necessary to first
denature the DNA with enzymes, and then dye the sample to produce light and dark bands. Karyotypes are
assigned based on the chromosome length, banding pattern, and position of the centromere. Chromosome 1 is
the longest, and chromosome 22 is the shortest among the autosomal chromosomes. (From Strachan T & Read A
[2011] Human Molecular Genetics, 4th ed. Garland Science.)
36.3
36.2
Key:
25.3
36.1 25.2
25.1 Centromere
35 24
34.3
34.2 rDNA
34.1 23
33
32.3
Non-concentric
32.2
32.1
22
heterochromatin
31.3 21
31.2 26
16 25
31.1
15 24.3 16
24.2 15.3
14 25
22.3 24.1 15.3
22.2 23 15.2 24 22
13 22 15.2
22.1 15.1 23
15.1 22.3
21 12 21.3 14 22.3 21 22.2
14
11.2 22.2
21.2 13 22.1 15.3 22.1
11.1 13.3
13.3 11.1 21.1 12 13.2 15.2
13.2 11 13.1 21.3 15.1 21.3
13.1 11.2 14.3 21.2
14.2 11 12 21.2 14
12 12 14.1 12 11 21.1
11 21.1 13
11 13 13 13.1 11.1 11.4
13.2 11.2 12 12
14.1 11.3
12 14.2 12 13.3 12 11.2 11.2 11.23
14.3 21.1 11.1 11.1 11.22
11.2 13.1
21.1
21.1
11.1
21.2 11.2 11.1 11.1 11.21
11.1
21.2 21.3 13.2 11.21
21.2 21.3 11.1 13.3 12 11.1
13 11.22 11.2
21.3 11.2 22
22 12 12
22 23 14 14 11.23
13.1 13
23 23 24 15
13.2 15
24 24.1 25 16.1 21.1
24.2 13.3 21.1
21 16.2 21.2
25 24.3 26 16.3
21 21.3 21.2
27 22 21 21.3
31
31 22 23.1 22.1
23 23.2 22.1
32.1 28 22.1 22.2
23.3 31.1
32.2 24 22.2 22.3
31.2 23
32.3 31.1 31.1 22.3
32.1 25.1
33 25.2 31.2 31.2 23.1 31.3 24
32.2 25.3 31.3 23.2
32.3 26.1 31.3 23.3 25
34 32 32
41 26.2 33.1 24
35 32 33
26.3 33.2 26
42.1 33.3 25.1 34
42.2 36 33 25.2
27 34 25.3 35 27
42.3 37.1 34
43 28 35.1 26
37.2 35.2 36
44 37.3 29 35 27 28
35.3
1 2 3 4 5 6 7 X
23.3
23.2 24
23.1
23 15.1 15.5
22 15.4
22 14 15.3
21.3
21.2 13 15.2
21 15.1
21.1 12.3 13.3
12 12.2 14 13
13 12.1 13.2
13.1 12
11.2 13 12.3 13
11.1 12 11.2 12.2 11.2 12 13
11.1 11 11.1 12 12.1 11.1 11.2 12
11.22 11.21
11.23
11 11.1
11.2 11.2 11 11.1
11.2 11.2
12 12 11.12 11.1 12.1 11.1
11.1 1 11 12.2 11.1
11 12 12.3 11.2 11.1
13 13 21.1 11.2
12 13.1 13 12
21.2 14.1 12
21.1 13.1 13.2 13
21.1 13.2 14.2 13
21.2 21.3 13.3 13.3 14
21.2 14 14.3 15
21.3 22.1 13.4
22.2 13.5 15 21.1 21 21.1
21.3 22.1 14.1 21.2 21.2
22.2 22.3 14.2 21.1
14.3 21.3 21.3
22.1 23.1 21.2 22 22.1
22.3 23.2 21 21.3
22.2 22.1 23 22.2
22.3 23.3 22 22.3
31 24.1 22.2 22 24.1
24.2 22.3 23
23 24.3 23.1 31 24.2
32 23 24
25.1 23.2 24.3
33
24.1 25.2 23.3 24.1 32 31 25
34.1 25.3
24.2 32.1 26.1
24.2 26.1 24 24.31 33
34.2 26.2 24.32 32.2 26.2
24.3 34.3 26.3 25 34 32.3 26.3
24.33
8 9 10 11 12 13 14 15
13.3
13
13.2
13.1 12 11.32
11.31
12 11.2 13.3
11.2 13
11.1 13 11.3
11.2 11.1 13.2
11.1 11.2 11.1 12 13 12 11.2
11.1 13.1
11.1 12 12 11.1
21.1 11.2 12 11.2 11.2 11.1
11.2 12.1 11
12.1 21.2 11.2 11.1 11.21
12.2 11 11.1 11.1 11.1
12.2 21.3 12 11.1 11.1 11.22
13 12.3 11.2
21.1 11.2 11.2 11.23
21 22 13.1
21.2 21 12.1
21.3 12 12.2
23 13.2 12.3
22 13.1 13.1
24 22.1 12
23 22 13.3 13.2 13.2
22.2
24 25 23 13.4 13.3 22.3 13.3
16 17 18 19 20 21 22 Y
whether or not it is part of the exome or intronic sequence. This would not be acceptable
to all readers of this book, but those with a focused interest in this area may consider this
correct. The use of terminology changes over time.
Most individuals have two copies of a given gene – one inherited from the mother and
one from the father. As a result, they are diploid. The genotype is the set of alleles that
an individual possesses. An individual may have two identical alleles, in which case they
are homozygous, or two different copies, in which case they are heterozygous (Figure 1.2).
When we consider the expression of a genetic variant we use the term phenotype. A phe-
notype can also be referred to as a trait or characteristic and may be either physical, physi-
ological, biochemical, or behavioral. Thus, the condition of having blue eyes or dark hair
is a phenotype, but so is having sickle cell anemia. Phenotypes are most often referred to
as traits or characteristics when they do not relate to an illness or disease.
In 2001, the first draft of the map of the human genome was published. Even though this
was not the complete sequence, it marked the beginning of a new era in genetics frequently
referred to as the post-genome era. The great advantage of working in the post-genome
era is that we have access to the genome map, and the majority of human genetic varia-
tion is known and available through websites such Human Genome Resources (http://
www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml) and the SNP Map
(http://www.ncbi.nlm.nih.gov/SNP/).
Person A has the same allele at the Person B has a different allele at the
marked gene on both chromosomes marked gene on the two chromosomes
and is therefore homozygous and is therefore heterozygous
Figure 1.2: Homozygous versus heterozygous. The figure illustrates a single pair of chromosomes in two
individuals (A and B). In contrast to the picture in Figure 1.1, the band represents a single gene. Individual A
inherits the same allele from both parents (i.e. both black) and is therefore homozygous for this genotype, while
individual B inherits different alleles (gray and black) for this gene and is therefore heterozygous.
600
500
400
300
200
Figure 1.3: VNTRs used to genotype the interleukin (IL)-1 receptor antagonist gene (IL1RN). Genotyping
the IL-1 receptor antagonist 86 bp VNTR sequence using PCR and agarose electrophoresis. The figure shows the
five most common alleles (1–5) and the four most common genotypes (1/2, 1/3, 1/4, and 1/5). The molecular
weight markers indicate the approximate band sizes for each allele on the agarose gel as follows: allele 1, 410 bp;
allele 2, 240 bp; allele 3, 325 bp; allele 4, 500 bp; allele 5, 600 bp. Note the figure does not show the precise
position in the gel and the ladder is illustrative only. The genotypes for each sample can be assigned using the
band sizes obtained.
Generation 1 Generation 2
A A
Mutation
B B
Normal allele New disease-causing allele
C C
D D
Figure 1.4: The development of a new mutation or allele in the A–B–C–D haplotype. The figure
illustrates the same chromosome in two individuals in two different generations (generations 1 and 2). On
the two chromosomes there are two patterns for the haplotype A–B–C–D. One includes a black normal band
(representing the normal allele) and one includes a dark gray band (representing a “new” mutated allele). In this
illustration the mutation may have arisen through recombination in meiosis. The bands illustrated are not the
same as those seen in Figure 1.1, which are the bands based on chromosome staining for karyotyping.
African populations tend to have a greater variety of haplotypes in any given region than
other populations. This is expected for a population that is older than all others, and there-
fore has had more time to diversify and develop more haplotype variations (Figure 1.4).
In younger populations, such as those in Europe and Asia, fewer haplotypes would be
expected because these populations have descended from smaller founder populations in
which a small subset of the total available haplotypes were present and there has also been
less time for new combinations to develop.
Figure 1.5: Extreme linkage disequilibrium. The MHC illustrates extreme linkage disequilibrium whereby
alleles at closely linked gene loci are inherited together more often than expected by chance. One example of this
is the HLA 8.1 ancestral haplotype (shown above), which is associated with an increased risk of many different
autoimmune diseases, but may also convey some survival advantage. The individual alleles are all common in the
normal northern European population, but occur together at frequencies far greater than expected by chance.
Thus, 60% or more of HLA-B8-positives have HLA-A1, and 90% or more of HLA-B8-positives have HLA-DRB1*03
and DQB1*02. HLA-B8 is the least common of all of these alleles at around 16%, and if the assortment were
random then these pairings would be in equilibrium and the likelihood of finding HLA-B8 and HLA-DRB1*03
together would be the sum of their individual frequencies. In this case, the values are approximately 20% for
HLA-DRB1*03 and 16% for HLA-B8. This would mean that instead of seeing approximately 14% of the population
with the combination HLA-DRB1*03–HLA-B8, we would see approximately 3%.
when the loci are close together, crossover is less common and linkage disequilibrium is
more likely to persist. Linkage disequilibrium can be used to provide useful information
about the distance between genes. Where there is extreme linkage disequilibrium, hap-
lotypes may be conserved and in many cases these conserved haplotypes are common in
the population. The human MHC (Figure 1.5) illustrates these concepts well (see also
Chapter 6). Linkage disequilibrium is a major tool in understanding modern genome-
wide linkage/association studies (GWLS/GWAS).
Table 1.1 Mutation, migration, genetic drift, and natural selection have different effects on genetic
variation within populations and on genetic divergence between populations.
because blending the total gene pool makes populations similar in terms of their genetic
composition. Note that migration and genetic drift act in opposite directions: migration
increases genetic variation within populations and reduces divergence between populations,
whereas genetic drift reduces genetic variation within populations and increases divergence
among populations. Mutations mostly increase the genetic variability both within and
between populations, though they can occasionally restore the wild-type. Natural selec-
tion, by contrast, can either increase or reduce genetic variation within a population and
increase or reduce genetic divergence between populations.
Finally, before considering each of these processes in turn, it is important to make it clear
that populations are simultaneously affected by many evolutionary forces acting at the
same time and that evolution results from the complex interplay of these processes.
it is much easier to estimate mutation rates than it is for Mendelian autosomal recessive
diseases or for non-Mendelian complex diseases. Estimates of mutation rates for a variety
of human genes lie between 10−6 and 10−5 mutations per locus per gamete per generation.
However, the estimated mutation rate is higher for some Mendelian diseases. For example,
in type 1 neurofibromatosis and Duchenne muscular dystrophy the estimated mutation
rate is as high as 10−4. This is 10–100 times greater than the general mutation rates.
The OMIM (Online Mendelian Inheritance in Man) database (http://www.ncbi.nlm.
nih.gov/omim) lists human genes and it is an excellent source for information on spe-
cific genetic diseases. For many diseases, a larger number of genetic mutations have been
identified than those listed on OMIM, but this is a good starting point to catalog genetic
variations that are linked to or associated with a specific disease and it also has a very good
bibliography for each disease.
Introducing the Hardy–Weinberg Principle
The Hardy–Weinberg Principle (HWP) or Hardy–Weinberg Equilibrium (HWE) test
is one of the central pillars of statistical analysis in population genetics (Table 1.2). The
term equilibrium in population genetics refers to something (an allele or gene) that is
in a state of balance. Equilibrium arises when alleles remain unchanged over time. The
HWE test assesses how allele frequencies have changed from generation to generation.
The HWP states that in a large breeding population, provided none of the evolutionary
forces described below are operating, allele frequencies will remain the same from genera-
tion to generation. In practice, the HWE test can be used to understand the change in
allele frequencies over time and indicate whether evolution has taken place. HWE is also
used in studies of complex disease to determine whether there is bias in the study sample
and in the qualitative assessment of studies. The HWP is a complex principle and the basic
concept and its application are discussed in more detail in Section 1.4.
Table 1.2 The HWE (p2 + 2pq + q2 = 1) dictates that the sum of allele genotypes is always 100% and
this formula can be used to determine the expected frequency of the different genotypes in a
population.
Number of generations
Figure 1.6: Changes due to recurrent mutations slows as the frequency p of an allele drops. The figure
shows the influence of mutations on the frequency (p) of a single allele. The mutation rate in a single generation is
exceedingly small and because the frequency of allele p drops with each mutation, the rate of change will become
even slower over time. Reverse mutations will increase the frequency of allele p. Eventually the actions of the
opposing forces, i.e. forward and reverse mutations, will establish equilibrium in the frequencies of the alleles p and q.
population in each generation, can be sufficient to prevent the accumulation of high levels
of genetic differentiation between populations. Migration adds genetic variation to popu-
lations and increases genetic differences within the recipient population. However, genetic
diversification can also occur in spite of migration when other evolutionary forces such as
natural selection are sufficiently strong.
1.0
Population 1 Population 4
Population 2 Population 5
Population 3
0.8
0.6
Frequency of A
0.4
0.2
0.0
0 20 40 60 80 100
Generations
Figure 1.7: Hypothetical model of genetic drift in five different populations. The figure illustrates the
potential influence of genetic drift in five populations. The model considers the different outcomes over 100
generations. In all cases the starting allele frequencies are 0.5 for A and 0.5 for a and each population is assigned
20 individuals (N = 20). In all cases the frequency of allele A only is considered. The simulations obtained over 100
generations indicate a variety of outcomes with peaks and troughs moving towards frequencies of 1 or 0 for A in
every case.
and the other will be fixed at 100%. As in this case the gene is now monomorphic (i.e.
there is only one allele), all individuals are homozygous for the predominant allele and
there can be no further fluctuation in that population. Genetic drift can lead to homozy-
gosity even in large populations, but this will take many more generations to occur.
Figure 1.7 also illustrates another effect of genetic drift. In the example all five popula-
tions begin with the same allele frequencies (50% or 0.5 for both alleles), but because
genetic drift is random, the frequencies in different populations do not change in the same
way and so populations gradually acquire genetic differences. Consequently genetic drift
will increase the genetic variation between different populations and there will be genetic
divergence over time. In contrast, the opposite effect may also be seen whereby there is
reduced genetic variation within populations. Through random change, an allele may
eventually reach a frequency of either 100% or 0, at which point all individuals in the
population are homozygous for one allele. When an allele has reached a frequency of 1,
we say that it has reached fixation. The other allele is lost (reaching a frequency of 0) and
can be restored only by migration from another population or by mutation. Fixation leads
to a loss of genetic variation within a population. Given enough time, all small popula-
tions will become fixed for one allele or the other. Which allele becomes fixed is random
in the absence of other forms of selection pressure, though it may be determined by the
initial frequency of the allele. If the initial frequency of two alleles is 0.5, both alleles have
an equal probability of fixation; however, if one allele is initially more common, it is more
likely to become fixed.
Genetic drift can lead to the fixation of deleterious, neutral, or beneficial alleles, but the
effect is greatly influenced by the population size. Allele loss and fixation due to genetic
drift occur more rapidly in small populations. Therefore, in nature, both population size
and geography can influence genetic drift, and consequently the genetic composition of
a population. Some human populations have settled on small islands or in geographically
isolated areas and the allele frequencies within these small isolated populations are more
susceptible to genetic drift. A population may be reduced in size for a number of genera-
tions because of epidemic disease, famine, or other natural or even man-made disasters. As
genetic drift is a random process, small isolated populations tend to be more genetically
dissimilar to other populations. Geography and population size can influence the effect of
genetic drift by creating either a bottleneck or a founder effect.
Bottleneck effect
Changes in population size may influence genetic drift via the bottleneck effect. Natural
and man-made disasters such as famines or war may reduce the size of the founder popula-
tion. Depending on the size of the effect and the original population, this can change the
degree of genetic variability within the population. Such events may randomly eliminate
most of the members of the population with or without regard to the genetic composition
or through selection of a group, for example favoring those with specific alleles following
infectious epidemics. This can create a bottleneck effect within a population whereby the
level of genetic variation is extremely limited (Figure 1.8).
Figure 1.8: The bottleneck effect. The bottleneck effect can occur as a result of major environmental events
such as famine or plague whereby the founder or parent population is drastically reduced. This may affect the
degree of genetic variability within a population. Natural selection may also operate under these circumstances,
favoring those with specific alleles, especially when the disaster involves infectious disease.
Founder effect
Geography and population size may also influence genetic drift via the founder effect.
The founder effect involves migration, where a small group of individuals separate from a
larger population and establish a colony in a new location. For example, a few individuals
may migrate from a large continental population and become the founders of an island
population. The founding population is likely to have less genetic variation than the origi-
nal population from which it was derived and consequently the allele frequencies in the
founding population may differ markedly from those of their original population.
A Negative selection
B Positive selection
C Balancing selection
Figure 1.9: Three different models of natural selection. Rows A, B and C illustrate three patterns of natural
selection (selection signatures). The columns represent three generations: the first column shows the starting
group of four individuals looking at the same chromosome in each, the second column shows the first generation
with mutations, and the third column shows the final outcome for the chromosomes in the three different
patterns of natural selection. Each circle represents a polymorphism, within a haplotype. White circles represent
mutations under neutrality, black circles indicate deleterious mutations, and gray circles indicate advantageous
mutations. Pattern A illustrates genetic polymorphisms under negative selection. Deleterious mutations arise
(black dot) and they can be removed immediately (if severely deleterious, e.g. line 3 in column 3) or kept at low
frequencies (if weakly deleterious, e.g. line 2 in column 3). Linked neutral polymorphism will also disappear (or
be kept at low frequencies, e.g. line 3 in column 3). Pattern B illustrates genetic polymorphism under positive
selection. When a new advantageous mutation arises (shaded circle in line 2, column 2), the allele increases in
frequency (in the population) along with linked neutral polymorphisms (lines 3 and 4 in column 3, which now
resemble line 2, column 2). Pattern C illustrates balancing selection. Two new alleles are shown (shaded and black
circles) and, if they confer advantage in the heterozygous state, they will increase to intermediate frequencies.
Linked neutral polymorphisms will also increase to intermediate frequencies. (From Ermini L, Wilson IJ, Goodship
TH & Sheerin NS [2012] Immunobiology 217:265–271. With permission from Elsevier.)
Purifying selection
Purifying natural selection (also called negative selection) reduces the frequency of det-
rimental alleles in a population. New mutants often have detrimental effects on biological
fitness and purifying selection reduces the number of new mutations in the gene pool. In
humans, 38–75% of all new non-synonymous mutations are thought to be affected by
moderate to strong negative selection. Deleterious mutations are generally found at low
frequencies because of the adverse effect they may have on biological fitness. Negative
selection is responsible for the removal (or maintenance at low frequencies) of mutations
associated with severe Mendelian disorders. Mendelian disease genes come under wide-
spread purifying selection, especially when the disease mutations are dominant.
Positive Darwinian selection
Some mutant alleles introduced to a population by gene flow may be advantageous. In
this case a directional genetic change may allow a population to adapt to its environment
and new, better adapted alleles may replace old, less well adapted alleles. Such selection of
alleles that are advantageous is called adaptive Darwinian selection or positive Darwinian
selection. Under the action of positive selection advantageous alleles rapidly achieve high
frequencies within the population. This occurs at a rate much faster than that of a neu-
trally selected allele. As a consequence of this rapid increase few recombination events
will take place and any neutral variation linked to selected variants will also increase in
frequency within the population. This process often results in a transitory increase in the
strength of linkage disequilibrium between alleles on the same haplotype.
Balancing selection
A third form of natural selection is balancing selection, whereby heterozygotes show a
higher level of biological fitness than homozygotes. This leads to the maintenance of two
or multiple alleles in a population at a given locus. Polymorphisms are maintained in the
population for a longer period of time than expected. Balancing selection is often referred
to as heterozygote advantage, especially in cases where a mutant allele known to cause a
disease in homozygotes is found at a high frequency in heterozygous healthy members of
the population. Genome scans suggest that balancing selection is less extensive than posi-
tive selection. However, balancing selection does occur. The two examples below illustrate
heterozygous advantage in two autosomal recessive Mendelian diseases.
Cystic fibrosis is one of the most common autosomal recessive diseases in Northern
European populations, affecting approximately 1:2500 new born children. The caus-
ative gene in cystic fibrosis is the cystic fibrosis transmembrane conductive regulator
gene (CFTR) and there are currently 1910 mutations on the CFTR mutation database
(http://www.genet.sickkids.on.ca/StatisticsPage.html) (Figure 1.10). Carriers of the
CFTR mutations (heterozygotes) appear to have, or have had in the past, some repro-
ductive advantage over wild-type normal homozygotes. There has been debate over what
this advantage might be. The CFTR gene encodes a membrane chloride channel pro-
tein that is required by some bacteria such as those belonging to the genus Salmonella
Δ508
W1282X
G542X
G551D
N1303LYS
Rare
Figure 1.10: The five most common CFTR gene mutations. The five mutations listed account for over 70% of
the overall mutations and Δ508 is the most common of all, accounting for approximately two-thirds of all cases.
All of the other mutations, of which there are at least 1905, are found at frequencies of less than 1% and together
these account for approximately 30% of all CFTR mutations.
Figure 1.11: Red blood cells in sickle cell disease. Sickle cells are shaped like a harvesting sickle and, unlike the
normal doughnut-shaped red blood cells, these cells can be hard with sharp edges that can damage the wall of
small blood vessels as they passage through the body. They will often clog the flow of blood and break up as they
pass through the small blood vessels.
(e.g. Salmonella typhi) to enter into epithelial cells. One explanation is that carriers of a
mutant CFTR allele may be more resistant to infection by such bacteria than those with
two copies of the wild-type gene.
Sickle cell anemia is another example. This is a genetic autosomal recessive blood disorder
that is characterized by red blood cells that occasionally assume an abnormal, rigid, sickle
shape (Figure 1.11). The β-globin allele variant, called HbS, is responsible for the sickling
of red blood cells seen in the disease. Despite the high mortality associated with homo-
zygosity the sickling allele HbS is found at high frequencies in Africa (up to 30%). One
possible explanation for the abundance of the HbS allele in Africa is that heterozygosity
confers some resistance to malaria.
number of AA individuals
f ( AA ) =
N
number of Aa individuals
f ( Aa ) =
N
number of aa individuals
f ( aa ) =
N
The sum of all the genotype frequencies always equals 1 (or 100%).
Genotypes are not permanent. They are disrupted in the processes of segregation and
recombination that take place when individual alleles are passed to the next generation
through the gametes. Alleles, in contrast, are not broken down and the same allele may be
passed from one generation to the next. For this reason the calculation of allele frequencies
is often the preferred choice when determining the genetic variability of a population. In
addition, there are always fewer alleles than genotypes, e.g. for the gene with two alleles A
and a above, there are two alleles, but there are three genotypes. By using alleles, popula-
tion diversity can be described in fewer terms than by using genotypes. Finally, by using
allele frequencies in case control population studies rather than genotype frequencies, no
assumptions about the impact of homozygosity or of heterozygote advantage are being
made. This is especially important in the context of complex disease where in the absence
of a clear pattern of inheritance it would not be appropriate to make any such assumption,
at least in the initial stages of analysis.
If we consider a gene with only two alleles A and a and we suppose the frequencies are p
for allele A and q for allele a; then p and q can be calculated as:
2n AA + n Aa
p = f ( A) =
2N
2naa + n Aa
q = f (a ) =
2N
In this equation nAA, nAa, and naa represent the numbers of AA, Aa, and aa individuals, and
N represents the total number of individuals in the sample it is necessary to divide by 2N
because being diploid means each individual has two alleles for each gene (one from the
maternal locus and one from the paternal locus).
The sum of the allele frequencies is always 1 (100%) ( p + q = 1); therefore where there are
only two alleles, q can be determined by simple subtraction after p has been calculated:
q = 1 − p
These calculations apply only where there are two alleles. In cases where there are several
different alleles at a locus the calculation used is based on the same principle, but is more
complicated. Statistical software will usually be used to perform complex calculations, but
it is important to understand the underlying principles in any analysis.
HE = 1 − ∑p
i =1
i
2
In this equation n is the number of alleles and pi is the frequency of the ith allele at a locus.
The value of this measure ranges from 0 for no heterozygosity to nearly 1 (i.e. 100%) for
a system with a large number of equally frequent alleles.
The HWP depends on certain assumptions, of which the most important are:
• Mating in a population is random – there are no subpopulations that differ in allele
frequency.
• Allele frequencies are the same in males and females.
• All the genotypes are equal in viability and fertility – selection does not operate.
• Mutation does not occur.
• Migration into the population is absent – gene flow does not occur.
• Genetic drift does not occur.
• The population is sufficiently large that the frequencies of alleles do not change from
generation to generation.
The HWP states that after one generation of random mating, in a large breeding popula-
tion, where the restrictions listed above all apply, single-locus genotype frequencies can be
presented as a binomial function (where there are only two alleles) or multinomial func-
tion (where there are multiple alleles). Under the above conditions and over time, allele
frequencies will reach equilibrium and remain constant from generation to generation.
q
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1
0.8
q2 p2
Genotype frequency
0.6
2pq
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
Figure 1.12: A plot of the HWE-based genotype frequencies (p2, 2pq, and q2) as a mathematical function
of allele frequencies (p and q). The plot illustrates the influence of allele frequencies on genotype frequencies
and shows what we can expect from the HWE test. The plot shows how the two alleles p and q determine
genotype frequencies and these change as the allele frequencies change. For example the closer q is to 1, the
lower the value of p and the higher the value for the q2 genotype (homozygous q). When p and q are both set at
0.5 (50%), then the frequency of pq heterozygotes is high.
Then we can use a simple calculation based on the Punnett Square illustrated in Table
1.2. The top side of the square is divided into proportions p and q representing the
frequencies of the male alleles A and a, respectively. The left side represents the same
proportions, but for the female alleles. If we assume there is random union of gam-
etes we can apply the product rule of probabilities. Imagine a pool with male and
female gametes, p with A alleles and q with a alleles, and where zygote formation
occurs by random union. The upper-left square represents the frequency of the homo-
zygous genotype AA. The expected frequency is simply the product of the separate
allele frequencies.
Frequency of AA = p × p = p2 (Homozygous for A)
The frequency of homozygous genotype aa is shown in the lower-right square:
Frequency of aa = q × q = q2 (Homozygous for a)
The other two squares illustrate the third possibility, i.e. Aa heterozygotes. The total pro-
portion of Aa heterozygotes can be calculated:
Probabilities
Probability represents the chance of a given event (or outcome) to occur. It is a measure
of the uncertainty and can be a number between 0 and 1. There are different schools
of thought regarding the concept of probability. The use of probability is illustrated in
Box 1.2. In the case of mutually exclusive events (e.g. tails or heads when the same coin is
flipped at the same time), the combined probabilities of the outcomes (i.e. heads or tails)
can be calculated by summing the individual probabilities of each event. This is known as
the sum rule. When two (or more) independent outcomes can occur simultaneously (i.e.
they are not mutually exclusive), the joint probability of the outcomes is expressed by the
product rule. The sum rule and product rule are also illustrated in Box 1.2.
Probability
Probability represents the chance of a given event occurring. It is a measure of the uncer-
tainty and can be a number between 0 and 1. If probability is equal to 0 the event cannot
take place, in contrast if it is 1 the event must occur. Although there are different schools
of thought regarding the concept of probability, we prefer to define the probability as
belief in future events. For example, when we flip a coin, if we state that the probability
of observing heads (A) is 0.5 and the probability of observing tails (B) is 0.5, we believe
heads and tails have equal chance in the next flip. The probability of heads [P(A)] can be
calculated using:
P(A) = A/N
where N is the total number of outcomes (i.e. the number of times the coin is flipped).
In statistical terms, the total number of times the coin is flipped is equal to 1 and thus
the probability of heads [P(A)] is 0.5. Instead of flipping a coin, we may want to throw
a dice and estimate the probability of throwing a 5. In this case, the number of throws
is 1. If there is only one 5 on a dice and we are using a six-sided dice, the probability of
throwing a 5 is 0.167.
Sum rule
In the case of mutually exclusive events (e.g. tails or heads when the same coin is flipped
at the same time), the combined probabilities of the events can be calculated by summing
the individual probabilities of each event. This is known as the addition law of probabil-
ity, or the sum rule:
P(A or B) = P(A) + P(B)
where A and B are two mutually exclusive events, and P represents the probability.
If we flip the same coin twice, the combined expected probability is:
P(heads or tails) = P(head) + P(tails) = 0.5 + 0.5 = 1
Product rule
When two (or more) independent events can occur simultaneously, meaning that they
are not mutually exclusive, the joint probability of the events is expressed by the product
rule. The product rule states that the joint probability of two independent events occur-
ring is the product of the individual probabilities:
P(A and B) = P(A) × P(B)
where A and B are two events, and P represents the probability.
If we flip two different coins, the expected probability of a head from coin 1 is 0.5 and
the probability of a head from the coin 2 is 0.5, the joint probability of two heads is:
P(two heads) = P(head coin 1) × P(head coin 2) = 0.5 × 0.5 = 0.25
Migrations from one population into another may also be responsible for stratification.
Thus, a population is considered stratified if:
• Genetic drift occurs in some, but not all, subpopulations.
• Migration does not happen uniformly throughout the population.
• Mating is not genetically random throughout the population.
All of the evolutionary factors that we have already discussed can contribute to the struc-
ture of a population. This structure affects the extent of genetic variation and the pattern
of distribution.
Wahlund’s principle
A population may appear to be homogeneous, but this may be a deception. This can lead
to false-positive associations, as we will see in Chapter 5. Subpopulations and population
stratification may not be obvious in studies of populations and population structure, and
as a result the study samples may include heterogeneous subsamples or clusters from the
study population. When data from these subpopulations are grouped together and differ-
ences in allele frequencies among them are inferred, a deficiency of heterozygotes and an
excess of homozygotes will be found, even if Hardy–Weinberg proportions exist within
each subsample. This effect is known as Wahlund’s principle or the Wahlund effect and
is one of the major problems in genetic association studies.
include mtDNA from a number of different populations and also from some ancient
individuals of whom the first three were sequenced in 2008, including one Neanderthal
and two Homo sapiens (the Neolithic Tyrolean Iceman and a Paleo Eskimo).
The substitution rate for the entire mitochondrial genome has been estimated as
1.665 × 10−8 (±1.479 × 10−9) substitutions per nucleotide per year or one mutation every
3624 years per nucleotide. Though the overall substitution rate is low, substitutions at
some positions occur more frequently than at others. These are referred to as hot-spot
positions and are mostly located in the control regions (positions 16,362, 16,311, 16,189,
16,129, 16,093, 195, 152, 150, and 146). The mutation rate has also been estimated for
both the control and coding regions, and this higher rate of substitution makes mtDNA
particularly valuable in studying relationships in recently diverged lineages.
When applied to modern populations, mtDNA studies clearly support the theory that
modern humans originated in Africa and spread from that continent approximately
56,000–73,000 years ago. This fits with the observation of greater genetic diversity in
African populations and can be observed through the construction of a phylogenetic tree
based on mtDNA variations from populations from all parts of the world. By applying
the molecular clock to the tree, it has been possible to demonstrate that the ancestral
mtDNA, i.e. that one from which all modern mitochondrial genomes are descended,
existed between 140,000 and 290,000 years ago. Phylogenetic reconstruction shows that
this mitochondrial genome was located in Africa and the person who possessed it must
have been African. Thus, we can conclude that the most recent common ancestor for
modern humans is African and because of the matrilineal inheritance of mtDNA, she has
been called the mitochondrial Eve (Figure 1.13). This theory also relies on the observation
that less variation in mtDNA occurs among humans than would be expected. Perhaps this
reflects the importance of mitochondrial gene function, which may drive conservation of
the mitochondrial genome or alternatively genetic variation may be reduced by genetic
drift, and thus a small population size 50,000 years ago could have the same effect as the
bottleneck from a single common ancestor 200,000 years ago.
~20,000 <8000
39,000– H, J Z
51,000 T, U, Uk, V
12,000– A, D 15,000
I, W, X R Y X
C, D, G A 15,000 A A
7,000– C, D
M1-M40 B
9,000
B
65,000– N F
70,000 M
~3000
L2 L3 M
B
L1
130,000– C, D
200,000 L0 A
B
M42 S P
Q
48,000
Figure 1.13: Mitochondrial genotypes and human migration. The figure illustrates human migration out of
Africa and subsequent migrations based on mtDNA genotypes. The “out of Africa” principle of human evolution
and the idea of the mitochondrial Eve are important in understanding human diversity. Based on this figure,
expansion began around 120,000–150,000 years ago in Africa and 56,000–73,000 years ago out of Africa, but
estimates vary and quotes depend on the hominid species being discussed. (From the MITOMAP database;
http://www.mitomap.org.)
1.8 Epigenetics
Epigenetics means above or in addition (“epi-”) to genetics and it may be typically defined
as the study of heritable changes in gene expression that are not due to changes in the
DNA sequence. Epigenetics can involve chemical modifications of DNA or proteins that
are closely associated with DNA (e.g. histones) and a prominent role for RNA is also
emerging. The structure of the chromosome in the eukaryotic cell is highly ordered and
it undergoes a process of compaction. To achieve these compact structures DNA is com-
bined with various proteins, and then coiled and super-coiled to form chromatin. The
basic unit of chromatin [the nucleosome core particle (NCP)] is composed of a 147-bp
DNA chain in a 1.7 left-handed super-helical turn around an eight sectioned octamer
composed of two copies each of four different histones (Figure 1.14). The fundamental
unit of this packaging is called the nucleosome. This involves a complex interaction with
histone proteins. This coiling to form chromatin and the interactions with histone pro-
teins are key elements in epigenetics.
DNA is also modified by biochemical processes such as methylation and two alleles
with the same sequence may have different states of methylation that confer a dif-
ferent phenotype. Though these changes do not alter the DNA sequence, they may
have major effects on the expression of the gene. Some of these changes are heritable,
though they do not affect the DNA structure. Methylation is an important factor for
post-translational modification of histones and subsequent formation of nucleosomes
for packing DNA in the nucleus. It is thought that methylation establishes epigenetic
Linker
DNA
Core DNA
H3 H4 1.7 turns
Nucleosome
5.5 nm
H1 H3 Linker
H2A DNA
H2B
11 nm
Figure 1.14: The NCP: the basic unit of chromatin. The figure shows the structure of the histone octamer
with the N-terminal tails and DNA wrapped around the histone structure. This core protein comprises four
different histones: H2A, H2B, H3, and H4. The NCP organizes 147 bp of DNA in a 1.7 left-handed helical coil.
Small DNA sections called linker DNA help to stabilize the structure by association with a linker histone H1. (From
Armstrong L [2014] Epigenetics. Garland Science.)
inheritance as long as the maintenance methylase acts to restore the methylated state
after each cycle of replication. Thus, a methylated state can be perpetuated through an
indefinite series of somatic meiosis. Methylation is an important factor in histone for-
mation for packing DNA in the nucleus. A useful site for epigenetics can be found at
http://www.ncbi.nlm.nih.gov/epigenomics/.
Not all geneticists interested in epigenetics are focused on biochemical processes—some
are concerned with the wider meaning of the word epigenetics as “epi-,” i.e. outside genet-
ics, in more general terms. This group concerns themselves with the environment, diet,
lifestyle, and the interaction with our genes. There is significant evidence to link these
phenotypic characteristics with genetic variation, as noted in Section 1.7.
Conclusions
This chapter illustrates some of the basic science and statistical concepts that are a prereq-
uisite for understanding genetic variations in human populations. Human evolution has
created a diverse and complex gene pool, and a number of different interacting evolution-
ary forces have played a key role in the generation of this diversity. These include muta-
tion, migration, genetic drift, and the bottleneck and founder effects, to name but a few.
Considering genetic diversity is important in the study of complex diseases. The level of
genetic diversity in a population varies depending on the population size and structure.
Genetic diversity is not confined to the autosome, but is also found in the mtDNA.
Other factors that are important in considering complex disease are gene expression and
phenotype. Epigenetics is important both in the narrow sense, where we consider such
things as methylation of genes, and in a broad sense, where we consider the impact of the
environment (e.g. nutrition) on disease.
Statistics plays an increasingly important and complex role in modern genetics. Though
there are many excellent statistical programs to use for analysis of data, all those with an
interest in this field are advised to have a basic understanding of the statistical concepts
that apply to studies of complex disease.
Diversity in the gene pool confers both advantages and disadvantages to individuals within
a population. Diversity though modification of common genes can lead to a variety of
diseases, such as those described by the classic genetic models: chromosomal, mitochon-
drial, and Mendelian traits. However, these are rare. Diversity also leads to increased risk
of more common diseases (genetically complex diseases), whereby the inheritance of an
allele confers an increased or reduced risk. This is important in evolution because a diverse
gene pool increases the likelihood that some of the population will survive even the most
severe disease.
The research that is being applied to complex disease will be increasingly applied in medical
practice, but there are also societal and ethical issues arising from our increased knowledge
of the genome and the increasing role it is beginning to play in modern medical practice.
Further Reading
Books Ermini L, Wilson IJ, Goodship TH & Sheerin
Armstrong L (2014) Epigenetics. Garland NS (2012) Complement polymorphisms: geo-
Science. This book is a useful new guide to graphical distribution and relevance to disease.
epigenetics. Immunobiology 217:265–271.
Cavalli-Sforza LL & Bodmer WF (1971) The Ewing B & Green P (2000) Analysis of expressed
Genetics of Human Populations. W. H. Freeman. sequence tags indicates 35,000 human genes.
Nat Genet 25:232–234.
Collins F (2010) The Language of Life: DNA
and the Revolution in Personalised Medicine. Gilbert MT, Kivisild T, Grønnow B et al.
Harper-Collins. (2008) Paleo-Eskimo mtDNA genome reveals
matrilineal discontinuity in Greenland. Science
Crawford MH (2007) Foundations of 320:1787–1789.
Anthropological Genetics. In Anthropological
Genetics: Theory, Methods and Applications Green RE, Malaspinas AS, Krause J et al.
(Crawford MH ed.), pp 1–16. Cambridge (2008) A complete Neanderthal mitochondrial
University Press. genome sequence determined by high-through-
put sequencing. Cell 134:416–426. This is a
Darwin C (1859) The Origin of Species: By fascinating report of mtDNA sequencing in an
Means of Natural Selection or the Preservation ancient mtDNA sample.
of Favoured Races in the Struggle for Life. J
Murray. This is perhaps the most important Hales CN & Barker DJP (1992) Type 2 (non-
book in the history of genetics. insulin dependent) diabetes mellitus: the
thrifty phenotype hypothesis. Diabetologica
Hedrick PW (2011) Genetics of Populations, 35:595–601. This is an original paper that
4th ed. Jones & Bartlett. contradicts Neel’s thrifty gene hypothesis; see
Jones S (2000) The Language of the Genes, 2nd Neel (1962).
ed. Harper Collins. Hardy GH (1908) Mendelian proportions in
Lewis R (2011) Human Genetics – The Basics. a mixed population. Science 28:49–50. This
Routledge. An alternative very short textbook is the original report of the Hardy–Weinberg
on genetics that covers the basics quite well. Principal.
This is a good source for students wishing to
Mayo O (2008) A century of Hardy–Weinberg
extend their knowledge to include more on
equilibrium. Twin Res Hum Genet 11:249–256.
classical genetics in a short time.
This is a review of this difficult concept, and
Strachan T & Read AP (2011) Human is of particular use for students and research-
Molecular Genetics, 4th ed. Garland Science. ers needing to understand the principal in more
This is an excellent basic genetics textbook that detail.
goes into more depth on many of issues in this
Mueller JC (2004) Linkage disequilibrium for
chapter of this book. Chapters 2 and 3 and also
different scales and applications. Brief Bioinform
chapters 15 and 16 of HMG are particularly
5:355–364.
useful.
Nachman MW & Crowell SL (2000) Estimate
of the mutation rate per nucleotide in humans.
Articles Genetics 156:297–304.
Barreiro LB, Laval G, Quach H et al. (2008) Neel JV (1962) Diabetes mellitus: a “thrifty”
Natural selection has driven population dif- genotype rendered detrimental by “progress”?
ferentiation in modern humans. Nat Genet Am J Hum Genet 14:353–362. This is an origi-
40:340–345. nal paper with an interesting original hypoth-
Cann RL, Stoneking M & Wilson AC (1987) esis. The hypothesis is contested by Hales and
Mitochondrial DNA and human evolution. Barker (1992).
Nature 325:31–36. Reich DE, Cargill M, Bolk S et al. (2001)
Ermini L, Olivieri C, Rizzi E et al. Linkage disequilibrium in the human genome.
(2008) Complete mitochondrial genome Nature 411:199–204.
sequence of the Tyrolean Iceman. Curr Biol Soares P, Ermini L, Thomson N et al. (2009)
18:1687–1693. Correcting for purifying selection: an improved
overestimate the impact of our genes on disease. In the current millennium it is hard to
disagree with the concept that our genes play a major role in disease and disease suscepti-
bility. Even if not all of assigned heritability is genetic, it is likely that a great deal of this
heritability will be genetic and, since most diseases are not Mendelian or chromosomal, a
very significant proportion will fit into the classification of genetically complex disease. In
fact, chromosomal abnormalities account for less than 1% and Mendelian disease for no
more than 4% of human disease. Does this mean that 95% of human diseases are geneti-
cally complex diseases? Geneticists do not agree on how many diseases may have a genetic
component. However, all of the evidence suggests it may be a significant number of differ-
ent diseases, and these include some infectious, immune and metabolic diseases as well as
some cancers and some forms of toxicity.
In this chapter, we will consider the definition of genetically complex disease. We will com-
pare complex diseases to diseases that arise either as a result of major chromosomal abnor-
malities, single inherited genetic mutations (Mendelian diseases), or genetic abnormalities
in the mitochondrial genome. In addition, we will consider in detail three different models
for genetically complex disease using key examples to illustrate how each model is different.
a particular risk allele may affect disease outcome or phenotype, e.g. signs, symptoms, or
response to treatment. Genetic variations and the terminology that is applied to them is
discussed at length in Chapter 1 and also in the Glossary at the end of the book, which
lists some of the appropriate terminology that applies to this area of genetic research.
One allele alone leading to a disease, but the environment may play a role in
determining the phenotype so there can still be more than one phenotype.
One allele
One allele
Environment
One or
more alleles
Figure 2.1: One or more alleles acting alone or in concert illustrates the key concept in the genetics of
complex disease. The first three patterns illustrate one allele alone, more than one allele acting independently,
and several alleles acting in concert. In each case the final outcome is complicated because the phenotype
is determined by both the genotype and the environment in most cases, and so there can be more than one
phenotype as illustrated by the different facial expressions. The final example illustrates several different
frequency in the population. Most studies chose SNPs with a frequency of no less than
5% for the rarest alleles. However, as study groups are getting larger, the thresholds for the
lowest frequency SNPs are being set at lower levels (i.e. 1% or less). This is because greater
sample numbers are now available for testing which increases the statistical power of the
study enabling less common SNPs to be included. Assessing less common SNPs may pick
up hitherto missed genetic associations.
Other forms of genetic variation have been investigated including insertions, deletions,
and copy number variations (CNVs). Only a few complex diseases arising from insertions
and deletions have been identified so far. One major example where CNVs have been
reported is in autism.
It should be noted that statistical significance is based on a threshold model whereby the
statistical value obtained is expected to equal or exceed the threshold set prior to any inves-
tigation. This is discussed at greater length in Chapter 5.
The inheritance of a specific risk allele or group of alleles does not necessarily cause the
disease. Inheriting the risk allele or alleles simply increases the likelihood (risk) of disease.
We may say that inheritance of the risk allele is neither necessary nor sufficient for the dis-
ease to occur. Note also that the definition used by Haines and Pericak-Vance includes the
word “decrease.” This indicates the possibility that inheritance of some alleles may protect
from disease. Considering genetic protection is as important as considering susceptibil-
ity, especially when we consider infectious disease (see Chapter 7). This protective effect is
seen quite frequently with different human leukocyte antigen (HLA) alleles and haplo-
types. Many autoimmune diseases have genetic associations with groups of HLA alleles or
haplotypes, some of which increase disease susceptibility, while others reduce susceptibil-
ity. Comparison of the susceptibility versus the protective alleles has enabled investigation
of the molecular mechanisms that underpin these HLA associations (see Chapter 6). This
latter development is one of the keys in moving from simple genetic investigation to
understanding basic disease pathology.
It is also essential to dispel any idea of good and bad alleles. This idea oversimplifies the
concept of complex disease. Many of the known risk alleles that have been identified as
potent inducers of disease for some clinical syndromes have also been identified as protec-
tive for other diseases. This is illustrated by the example of the impact of the C-chemokine
receptor 5 Δ32 deletion (CCR5-Δ32), which protects from human immunodeficiency
virus (HIV) infection, but predisposes to infection with West Nile virus (see Chapter 7).
Genetic variation also affects the outcome of disease
Finally, genetic variation does not only determine susceptibility and resistance to dis-
ease per se, but also determines various outcomes following disease onset. Therefore, it
is appropriate to use the term “trait” in place of the word “disease” in this definition. By
restricting the definition to disease we narrow the field, and exclude the possibilities that
common genetic variations influence disease severity and outcome both before and fol-
lowing treatment. However, it is quite clear that morbidity, mortality, and response to
certain commonly used pharmacological agents (including adverse drug reactions) are all
Figure 2.1: (Continued) phenotypes resulting from the same risk portfolio. Phenotypes are identified by different
facial expressions: smile, open mouth, and star decoration on temple (in the last case only). Variable phenotype
and pleiotrophy are similar. Note that some authors use alleles and genes as interchangeable terms. However,
when talking about variation of a gene at a locus, the correct term is allele. Phenotype is a complex issue because
although each of these models indicates a single disease, the clinical phenotype is the product of both genetic
and environmental factors. Therefore, the phenotype can vary even with a single allele.
influenced by common genetic variation. The use of the word trait as opposed to disease is
important throughout the study of complex disease, but it is particularly important when
considering the role of genetic variation in infectious diseases (see Chapter 7) and also in
pharmacogenetics (see Chapter 8).
The take-home message from all of the points above is that the simple definition of a
genetically complex disease as a disease or trait where “alterations in more than one gene that
acting alone or in concert either increase or reduce the risk” needs to be deconstructed and
considered word for word. It is essential to remember we have, in the past, been dealing
with alleles that for the most part are common. However, more recently we have started
to investigate risk alleles with low frequencies. Developments in technology, statistical
analysis, and study design have all played important roles in focusing our attention on
these less common alleles. Finally, before we move on to concentrate on complex dis-
ease, let us first consider genetic diseases that are not classified as genetically complex
diseases: chromosomal, Mendelian, and mitochondrial diseases.
n n
2n
n
2n
n
n
Tetraploidy (4)
4n
2n
Figure 2.2: Numerical chromosomal abnormalities. The top of this figure illustrates three different models of
triploidy: (1) dispermy, (2) diploid egg, and (3) diploid sperm. At the bottom of this figure, (4) illustrates tetraploidy
where DNA is duplicated, but there is no cell division. (Adapted from Strachan T & Read A [2011] Human
Molecular Genetics, 4th ed. Garland Science.)
Tetraploidy occurs when offspring inherit four copies of the whole genome. Tetraploidy is
rare and is most commonly caused by the failure of the cell to divide after the first DNA
duplication. Tetraploidy is not compatible with life.
Triploidy can occur through several mechanisms, the most common of which is two
sperm fertilizing a single egg (dispermy) and, as with tetraploidy, triploidy is not compat-
ible with life.
Aneuploidy
Individuals with aneuploidy have one or more chromosomes present either with an extra
copy or with a missing copy. Aneuploidy involving the autosomes is usually lethal, but
it can be compatible with life. Key examples include: trisomy 13 associated with Patau
syndrome; trisomy 18 associated with Edward’s syndrome and trisomy 21 associated with
Down syndrome. Aneuploidy associated with additional copies of the sex chromosomes
(e.g. XXX, XXY, XYY) is mostly associated with a normal lifespan and relatively few minor
clinical problems. By contrast, aneuploidy associated with missing sex chromosomes is fre-
quently lethal, though monosomy X (associated with Turner’s syndrome) is an exception.
Ninety-nine percent of all cases involving missing sex chromosomes abort spontaneously
(Table 2.1).
Numeric abnormalities
Polyploidy Tetraploidy Four copies of whole genome
Triploidy Three copies of whole genome
Monosomy One copy of whole genome
Mosaicism Variable number of copies in different cells
Aneuploidy Trisomy Three copies of one or more chromosomes
Monosomy Only one copy of one or more chromosomes
Structural abnormalities
Deletions Unbalanced abnormalities
Inversions Paracentric or pericentric balanced abnormalities
Duplications Unbalanced abnormalities
Insertions Unbalanced abnormalities
Ring formation
Translocations Balanced abnormalities
and can occur at different phases of the cell cycle. There are natural mechanisms to prevent
cells with unrepaired chromosomes from entering mitosis. Structural abnormalities occur
when breaks are repaired incorrectly. For example, the broken ends of the chromosome
are sometimes joined incorrectly and this can lead to chromosomes without a centromere
(acentric) or with two centromeres (dicentric). Such abnormal chromosomes will not
segregate stably in mitosis and are eliminated. However, chromosomes with a single cen-
tromere carrying structural abnormalities can be propagated through mitosis.
Balanced abnormalities
When structural abnormalities are associated with no overall gain or loss of chromo-
somal material they are said to be balanced. Provided the balanced structural abnor-
malities do not result in disruption of the function of a gene through interrupted
expression, control, or activation of the gene, they are unlikely to have an adverse
effect on the phenotype. Balanced abnormalities are those involving inversions and
translocations.
Unbalanced abnormalities
When structural abnormalities are associated with gains or losses of chromosomal mate-
rial they are said to be unbalanced. In contrast to balanced abnormalities, unbalanced
chromosomal abnormalities are more likely to give rise to problems in meiosis, and they
result from deletions, insertions, and duplications. As a consequence, these abnormalities
are likely to have adverse effects on the phenotype.
As stated above, chromosomal abnormalities are rare. It is important for the reader to
note that, unlike complex disease, all carriers of these abnormalities are affected and
symptoms most often occur from birth. However, these genetic illnesses are not the
subject of this book.
(a) Two breaks in the same arm (b) Two breaks in two different arms
a a
b b
c c
d d
e e
f f
g g
Join broken
Deletion Inversion Inversion
ends
c
a a a b
b b f
d Fusion point
c c e f
e
d d d Ring chromosome
f c
g
e b
g g
Figure 2.3: Outcomes after incorrect repair of two breaks on a chromosome. (a) Incorrect repair of two
breaks in the same arm of the chromosome can lead to either a parametric inversion (such as e and f inversion)
or interstitial deletion (i.e. deletion of e and f) of the fragments. In the former case, the inversion is paracentric
as it does not involve the centromere. (b) Incorrect repair occurs following breaks in two different arms of the
same chromosome. Here, the fragments coded from b–f are inverted, and in this case the inversion involves the
centromere and is referred to as a pericentric inversion. As the fragment involves the centromere, the fragment
could also rejoin to form a stable ring chromosome. In this latter model, the fragments b–f are included in the ring
structure, but fragments a and g are missing. The banding patterns represent the karyotype represented in F
1.1. (From Strachan T, Goodship J & Chinnery P [2014] Genetics and Genomics in Medicine. Garland Science.)
Gregor Mendel’s research, or more specifically the rediscovery of his papers in 1900, is
rightly credited as the time at which modern genetics was born. However, people were
aware of the passage of traits from generation to generation long before the discovery
of the papers reporting Mendel’s experiments with pea plants. Indeed, the bible gives
the example of Jacob’s sheep. Gregor Mendel was born Johann Mendel in a village in
Moravia (later Czechoslovakia) in 1822; he entered an Augustinian Monastery in 1843.
He was a gifted teacher, though he failed his teaching certificate. Mendel’s experiments
with peas enabled him to identify different patterns of inheritance for specific selected
traits. The terminology applied to his work post-dates his discoveries because despite the
ground-breaking significance of his work initial interest in his papers rapidly faded, and
disappointed and burdened with administration he retired into monastic life and died in
1884. Nevertheless, he is and was the founding father of modern genetics, and we refer
to his name every time we look at a disease with a known pattern of inheritance.
called Mendelian patterns of inheritance, and today these patterns are used in classify-
ing genetic diseases and traits. Over 4609 traits are currently identified on the OMIM
(Online Mendelian Inheritance in Man) database for which the molecular basis of the risk
gene is known (http://www.ncbi.nlm.nih.gov/omim). Over 12,000 genes are also listed
on OMIM. Not all of these traits and genes are for Mendelian diseases, many are for
complex diseases, but those which are Mendelian disorders all involve a single gene (i.e.
they are monogenic) and all conform to one of five readily identifiable patterns: autosomal
dominant, autosomal recessive, X-linked dominant, X-linked recessive, and Y-linked
(illustrated by the pedigrees in Figure 2.4). Sex-linked disorders may also be referred to as
gonosomal as opposed to autosomal.
In addition, occasionally geneticists will use terms such as semi-dominant to describe
a family where heterozygotes have a phenotype. In many cases it may not be clear if the
heterozygote has a less severe phenotype than the homozygote, as would be expected in
a truly semi-dominant disease. However, for the most part, Mendelian diseases conform
to the expected patterns. Mendelian diseases can be described as those where a particular
genotype at one locus is both necessary and sufficient for the character to be expressed.
Mendelian genotypes have variable phenotypes
It is important to understand that biological systems are complex and interactive.
Therefore, though these Mendelian diseases are considered simple in as much as they
involve a single dysfunctional genetic character at a single gene locus, the biological
character affected by this disability is likely to be programmed by a large number of
interacting genes and also by environmental factors. This complexity may explain the dif-
ferent levels of phenotypic variation in individuals and families with the same Mendelian
genotype. Expression of the trait even in Mendelian disease is not always absolute. There
are variable levels of penetrance whereby possession of the mutant gene does not give
rise to the disease.
An extreme example of this incomplete penetrance is found in the autosomal recessive
disease hemochromatosis. This disease is characterized (as the name suggests) by iron over-
load in the liver and other tissues. The gene responsible for the disease has been identified
and is referred to as HFE. The causative gene is located on chromosome 6p21.3 at the
(a)
Male Female
Affected
Carrier
Unaffected
(b) (c)
(d) (e)
Figure 2.4: Genetic pedigrees in Mendelian diseases. The five different patterns displayed represent three
generations all modeled on the same family structure. (a) Autosomal dominant: both males and females are
affected (marked in black) and the trait does not skip generations. (b) Autosomal recessive: both males and females
are affected, but here not all generations are affected (e.g. the grandparents are unaffected carriers), carriers are
marked in gray. (c) X-linked recessive: as with autosomal recessive not present in all generations; mostly males are
affected as they carry only one copy of X, but females can be affected. (d) Y-linked: males only are affected and as
they inherit a single Y chromosome only, all Y-linked Mendelian disease assumes a dominant pattern. (e) X-linked
dominant: passed from fathers to daughters in the first generation (males receive the Y chromosome from the
affected father). When these daughters have offspring, the trait is passed to both sons and daughters, but equally
sons as daughters may inherit the wild-type gene from their mothers and not develop the trait.
extreme telomeric end of the major histocompatibility complex (MHC) and was for a
short while referred to as HLA-H. A number of studies indicate that the penetrance in
those with two copies of the HFE mutation can be extremely low, with the majority of
homozygous carriers being asymptomatic possibly unaffected. The pattern of homozy-
gosity for the disease causing mutation with extremely low levels of penetrance (disease
expression), is unusual. Of course in a late onset disease such as this one, we cannot be
certain that some of the homozygotes will not develop the disease at a later date.
is in stark contrast to complex disease where penetrance is mostly low. In many cases of
complex disease the majority of those who inherit the risk allele do not express the disease
phenotype and some members of the population develop the disease in the absence of the
risk allele. In complex disease the presence of an allele confers risk, but the risk allele is
neither necessary nor sufficient for the disease to occur. Examples of this are to be found
throughout this book. One key example is ankylosing spondylitis—a degenerative disease
of the lower spine that occurs in young adults and is associated with the HLA-B*27 family
of HLA alleles. In some populations it has been found that up to 94% of patients with
this illness have HLA-B*27. The odds ratio (OR) (a crude measure of the population
risk; see Box 2.2) for ankylosing spondylitis in those with HLA-B*27, is estimated to be
as high as 171 times greater compared with those without HLA-B*27. From a personal
point of view this increased risk seems to be high until we consider that 6% of ankylosing
spondylitis patients do not have HLA-B*27 and that on average only one out of every 25
HLA-B*27-positive individuals will develop the disease (Figure 2.5). Interestingly, recent
genome-wide association studies (GWAS) have identified a range of other risk alleles for
ankylosing spondylitis outside the MHC (see Chapter 3).
The example above is an illustration of incomplete penetrance at work in complex
diseases. Penetrance is the key to understanding the difference between complex and
Mendelian disease. Whereas in Mendelian disease the presence of the allele (whether
a mutation, polymorphism, or deletion) is considered to be necessary and sufficient to
cause the disease, this is not true for complex disease. This difference is at the heart of
understanding the genetics of complex disease and the potential applications of this
evolving branch of biomedical science.
The odds ratio (OR) is a frequently used simple statistical measure of the likelihood
of risk based on the ratio of an allele in healthy controls compared with affected cases.
It is usually calculated from the figures entered in the Punnett Square (Table 1.2) as
A × D/B × C. To understand this test, consider a population in which we test a single
gene with two alleles (α and β), in 100 cases and in 100 healthy controls. First remember
that in a diploid species we have two copies of each chromosome, therefore in a group
of 100 individuals there are 200 alleles for this single gene. We perform a genetic test on
this population and the results of our testing show that the frequency of the α allele is
higher in patients than in healthy controls, e.g. there are 150 α alleles in the patients and
only 100 α alleles in the healthy controls. Because we already know the total number of
individuals in each group is 100, we can assume that the frequencies of the β alleles for
the two groups are 50 for patients and 100 for healthy controls. The OR is calculated
as above with A × D/B × C, where the A box value is the number of α alleles in patients
and the D box value is the number of β alleles in controls, while the B box value is the
number of α alleles in healthy controls and the C box value is the number of β alleles
in patients. Numerically, A × D/B × C = 150 × 100/100 × 50; this can be simplified by
deleting the common values on either side of the equation, i.e. 100, to give 150/50 and
also removing the 0 on either side to give a simple calculation of 15/5 from which the
OR value of 3 is derived.
HLA-B*27+
AS-positive
General population
Figure 2.5: HLA-B*27 and ankylosing spondylitis. The figure illustrates incomplete penetrance for HLA-B*27
in ankylosing spondylitis (AS). Most B*27-positives are ankylosing spondylitis-negative even though most
ankylosing spondylitis-positives have B*27.
disease, early onset increases the likelihood of availability of family members for genetic
testing and also influences the likelihood of family members engaging in genetic research.
Families are a rich source of information and traditional (parametric) linkage-based anal-
ysis can be applied. Of course not all families are informative. There may be too few fam-
ily members available for testing or too few affected family members. Other options in
Mendelian disease include non-parametric linkage analysis or case control population-
based association analysis. However, these other options are rarely used in Mendelian dis-
ease, except in the more common Mendelian diseases and when parametric linkage testing
is not possible. In contrast, these latter options are widely used in complex disease studies,
as discussed later in this chapter.
Locus heterogeneity
Locus heterogeneity refers to the occurrence of the same trait arising from genetic varia-
tion at different loci. It is not uncommon for the same Mendelian disease to result from
mutations in different genes in different families.
Allelic heterogeneity
Allelic heterogeneity refers to the occurrence of the same trait/phenotype from differ-
ent mutations at the same locus. In these cases, a single gene is involved, but there are
multiple potential disease-causing mutations. An example of allelic heterogeneity is the
autosomal recessive disease cystic fibrosis, for which there are more than 1910 disease-
causing mutations listed on the CFTR mutation database (http://www.genet.sickkids.
on.ca/StatisticsPage.html). Therefore, each patient may have a different CFTR genotype
and still have the same phenotype. Note, however, as with many of these Mendelian
diseases where there are multiple disease-causing mutations at a single locus, in cystic
fibrosis, there is one very common mutation, the Δ508 mutation, that accounts for the
majority of cases and a host of less common mutations that account for the remainder
(Figure 2.6).
Clinical heterogeneity
Most human diseases are very variable. Even in Mendelian disease patients with the same
genetic mutation display variation in clinical symptoms. This is referred to as clinical
heterogeneity. It should not be assumed that this is entirely due to genetics. The environ-
ment is likely to play a significant role in the determination of clinical outcomes.
Genetic anticipation
Genetic anticipation describes the tendency of some Mendelian dominant diseases to
become more severe in successive generations. Examples of this phenomenon include
fragile X syndrome, myotonic dystrophy, and Huntington’s disease. All three of these
diseases arise from the inheritance of specific unstable trinucleotide repeats. In each case,
severity or age of onset is associated with the number of repeat sequences. The number of
repeats has been shown to increase in successive generations. Genetic anticipation in other
diseases remains controversial.
∆508
Approx. 66%
G542X G551D
Cystic
fibrosis
N1303LYS W1282X
Another
1905
alleles all
at <1%
Figure 2.6: Mutations in the CFTR gene that cause cystic fibrosis. There are over 1910 listed mutations in the
CFTR gene. Of these, one accounts for approximately two-thirds of all mutations, while most others (with a few
exceptions) are found at frequencies of less than 1%.
Genomic imprinting
Genomic imprinting is considered to be an epigenetic phenomenon and is more com-
plex than any of the above. In some families with autosomal dominant disease the trait
is only expressed if inherited from a parent of one particular sex. This occurs despite
the fact that the disease-causing mutation can be present in both male and female par-
ents. An example of genetic imprinting is Beckwith–Wiedemann syndrome, which
is only expressed in children who inherit the mutated gene from their mother. There
are other examples of human disease where imprinting has been identified as a key
factor, including Prader–Willi syndrome, Angelman syndrome, and Silver syndrome
(Table 2.2). Research has identified approximately 100 different imprinted loci on 11
mouse chromosomes (http://www.mousebook.org/mousebook-catalogs/imprinting-
resource). Even so, imprinting is the least well understood of the many complications
that apply to Mendelian disease.
An example of genomic imprinting is given by IFG2 gene encoding insulin-like growth
factor 2. Offspring inherit one IGF2 allele from their mother and one from their father,
but the paternal allele of IGF2 is expressed, while the maternal one is completely silent. If
both alleles begin to be expressed in a cell, that cell may develop into a cancer.
Children with Prader–Willi syndrome have small hands and feet, short stature, poor sexual
development, and mental retardation, while children with Angelman syndrome exhibit
frequent laughter, uncontrolled muscle movement, a large mouth, and unusual seizures.
The imprint control centers have been found to restrict small areas known as imprinting
centers. In the case of the disorders above, the imprinting centers have been identified at
the 5′ end of the SNURF–SNRPN gene. Both syndromes are associated with the loss of
the same small region on the long arm of chromosome 15.
The most common maternal form of Angelman syndrome is a 4 kb mutation on chro-
mosome 15q11–q13 that contains the UBE3A gene encoding for E6AP ubiquitin ligase.
The mutation prevents methylation of the UBE3A gene and also silences the SNRPN
Table 2.2 Some examples of disease thought to involve genomic imprinting and some of the genes
involved.
Chromosomal
location of gene Candidate gene or genes if known Disease/syndrome/phenotype
11q33 PGL1 Paragangliomar 1: hearing loss
7p11.2 H19 Silver–Russell syndrome
11p15.5
5q35 NSD1/ARA263/STO Beckwith–Wiedemann syndrome
(also: Soto syndrome, Weavers
syndrome, acute myeloid leukemia)
11p15.5 H19: cyclin-dependent kinase Beckwith–Wiedemann syndrome
inhibitor IC (p57Kip2)
11p15.5 KCNQ1 overlap transcript 1 Beckwith–Wiedemann syndrome
11p15.5 H19 Wilms’ tumor 2
13q14.1–q14.2 RB1 Retinoblastoma
Microdeletion SNURF–SNRPN Prader–Willi syndrome: hypotonic,
of 15q11–q13 male hypogonadism, mental
retardation, and severe obesity
Microdeletion UBE3A Angelman syndrome: small size,
15q11–q13 severe retardation, lack speech, and
characteristic behavioral pattern
20q31 Pseudohypothyroidism type 1A
20q31 Pseudohypothyroidism type 1B
gene. If the deletion of this region is inherited from the mother, the offspring will develop
Angelman syndrome because the paternal gene is silenced and the maternal copy is almost
exclusively expressed.
Prader–Willi syndrome involves a similar mechanism, but in these cases the maternal
genes are silenced and not expressed, and the paternal genes are mutated. There may be a
number of different genes involved in cases of Prader–Willi syndrome; however, the same
pathway is affected, and the key to understanding the mechanism of the disease rests with
SNRPN and UBE3A.
It is tempting to ask why genomic imprinting occurs. This type of pressure may have sig-
nificance in evolutionary terms, and there may be conflicting pressures acting on maternal
and paternal alleles for genes, such as those that affect fetal growth (see Section 1.9).
proportion of our DNA is inherited from the mitochondria, the so-called mitochon-
drial genome (mtDNA). The human mitochondrial genome is extremely small com-
pared with the nuclear genome (i.e. the autosome). The mitochondrial genome is a
circular molecule of 16,569 bases in length (Figure 2.7). Mitochondrial genetics is
very different to Mendelian genetics. The mitochondrial genome is passed from gen-
eration to generation down the maternal line only and each cell may contain several
thousand copies.
The mitochondrial genome encodes a total of 37 genes: 13 encoding for proteins that func-
tion within mitochondria, two ribosomal RNAs (rRNA, 16S and 23S) and 22 transfer
RNA (tRNA) genes necessary for the synthesis of the 13 encoded polypeptides. There are
two nucleotide strands of the mtDNA: a heavy strand (H) rich in guanine and a light strand
(L) rich in cytosine. The H strand provides a template coding for the two rRNAs, 14 out of
the 22 tRNAs, and 12 out of the 13 proteins. The L strand is the template for the remain-
ing eight tRNAs and one protein. The mitochondrial genome is extremely compact with
approximately 93% of the DNA sequence representing coding sequences. Consequently, all
37 mitochondrial genes lack introns and are tightly packed within the coding region. The
remaining 7% of the mitochondrial genome encodes the displacement D loop or control
region where the replication of both the H and L strands begins. This region also encodes
the promoters for transcription of both H and L strands. The primary role of mitochondria
is to provide cells with the bulk of their adenosine triphosphate (ATP), which is used as an
energy source to drive cellular reactions. The proteins produced by the mitochondrial genes
work together to create this energy by turning complex molecules such as sugars into simple
substances such as carbon dioxide and water.
OH
F T
RNR1
V CYTB
RNR2
L1
ND5
ND1
I/M H, S2 and L2
ND2 ND4L
W R
ND3
G
COI COIII
D
COII K ATP8
and ATP6
Figure 2.7: An abridged map of the human mitochondrial genome. The genome is 16.6 kb and encodes 13
essential polypeptides of the oxidative phosphorylation (OXPHOS) system (ND1–ND6, NDL4, the cytochrome
c oxidases COI–COIII, cytochrome b, and the subunits of ATP synthase ATPase 6 and 8) and the necessary
RNA components (two rRNAs and 22 tRNAs) for their translation.
Variation in the mtDNA has been widely associated and linked with
many different diseases
The mitochondrial genome has been associated with many different diseases in humans
(Table 2.3). The majority of early studies concentrated on rare diseases caused by inher-
ited mutations at specific sites, such as Kearns–Sayre syndrome where there are large-scale
deletions in the mtDNA, Pearson’s syndrome, Leber hereditary optic neuropathy, mito-
chondrial myopathy, and many others, most of which occur in infants. More recently,
studies have begun to identify genetic associations with common diseases and illnesses,
including migraines, epilepsy, dementia, cardiomyopathy, renal tubular defects, adrenal
failure, and liver failure, which occur both in infants and adults (Figure 2.8). Diagnosing
and managing mtDNA disease is a major clinical challenge in the second decade of the
new millennium. One strategy that is being applied is to manipulate the mtDNA muta-
tion rates through exercise. A lack of exercise leads to a reduction in mitochondrial enzyme
activity. In contrast, endurance training improves enzyme activity. It has been suggested
that increased exercise may help to reduce the proportion of sporadic mutations, and may
be used to combat some of the clinical symptoms and illness caused by mtDNA muta-
tions. Identification of the dysfunctional mtDNA genes and sequences in mitochondrial
syndromes enables the development of new therapies based on targeting specific gene
Table 2.3 Some of the diseases that are encoded by genetic mutations and variations in the human
mitochondrial genome.
Cardio-
myopathy
The
Parkinson’s mitochondrial Myopathy
genome
Epilepsy
and Deafness
dementia
Neuropathy
Figure 2.8: Mitochondrial genome and common diseases. Some of the medical conditions that have been
claimed to be related to variation in the mitochondrial genome.
healthy octogenarian in an HGPS patient often before the age of 10 years. This is a cruel
condition for parents and affected children alike.
Genetic testing has shown that the major mutation for this syndrome is a mutation encod-
ing an exchange of cytosine for thymine at position 1824 (C1824T) in the lamin A gene
(LMNA), which gives rise to a shortening of the lamin A protein. The gene is located on
chromosome 1q22. There is a degree of clinical heterogeneity in HGPS and this may be
in part due to genetic variation. Other, less frequent mutations have also been identified
in LMNA and these have been associated with longer life in some cases. LMNA mutations
are also associated with other genetic conditions, such as Charcot–Marie–Tooth disease
and Emery–Dreifuss muscular dystrophy.
Lamin A is an important structural protein and one of a family of intermediate filament
proteins. Intermediate filaments provide stability and strength to all our cells. Lamin A is
a scaffolding component of the nuclear envelope that surrounds the nucleus in cells, cre-
ating a mesh-like layer of intermediate filaments attached to the inner membrane of the
nuclear envelope of the cell.
In HGPS, the C1824T mutation encodes a short version of the lamin A protein called
progerin. This protein cannot be processed correctly within the cell. This can disrupt the
nuclear envelope, and over time damage the structure and function of the nucleus, caus-
ing cells to die prematurely. However, there is hope in HGPS and clinical trials have been
started, though it is early days.
Table 2.4 Some of the common major differences between genetically complex diseases compared
with Mendelian disease.
use of a genetic (Mendelian) model. Therefore, there are no preset parameters as there
are in more conventional parametric linkage analysis. Non-parametric linkage analysis
can be based on families with multiple cases of disease or on affected sibling pairs. Case
control population-based association analysis is a quite different method that simply
compares the frequency of genetic variations between groups of patients or cases and
healthy (matched) individuals from the same population. Until recently, most genetic
studies in complex disease were based on association analysis and this was thought to
be a poor mechanism for the identification of disease genes. However, for various rea-
sons, opinions have changed about the value of association studies in complex disease
(see Chapter 3). A number of other differences between studies of Mendelian versus
complex disease are listed in Table 2.4.
with the risk allele, because expression is modified by another gene causing incomplete
penetrance. Second, the presence of the disease in some without the risk allele may simply
indicate a different disease-causing mutation is active in those individuals. Taken altogether
this could explain why the disease-causing mutation is neither sufficient nor necessary for
the disease to occur and may explain a large proportion of monogenic complex disease.
risk allele is likely to be the total genetic contribution in some of the latter (i.e. they
are monogenic). In HSCR, not all cases are complex—a significant proportion are
due to chromosomal abnormalities or are familial with clear Mendelian inheritance
and only 70% of cases fit the pattern of oligogenic complex diseases. In Crohn’s dis-
ease, approximately 10% of cases may have a familial pattern of inheritance and some
of these are likely to be Mendelian cases, but the majority of cases are sporadic and
genetic susceptibility is determined by a large number of different risk alleles, indicat-
ing a polygenic model.
Dementia
Figure 2.9: The mechanisms of Alzheimer’s disease. (a) A healthy brain compared with a brain with advanced
Alzheimer’s disease. (b) The processes and some of the genes involved in the formation fibrils, plaques, DNA
damage, and neurofibrillary tangles. (From Armstrong L [2014] Epigenetics. Garland Science.)
PSEN1, and PSEN2. In LOAD, it is the apolipoprotein E4 (APOE4) allele that confers the
major risk. There is clear evidence for a causative link with APOE4 and LOAD. Prince et al.
(2004) found that the presence of the APOE4 allele was strongly associated with reduced
levels of the β-amyloid-42 protein in cerebral spinal fluid. However, there are multiple
other risk alleles that have been identified for LOAD; OMIM currently lists at least 16
possible genetic sites and many more are listed in the literature that have yet to be added to
this list. There are individual cases of LOAD where the APOE4 allele is the only identified
risk allele. APOE4 may account for as much as 50% of the genetic risk in LOAD. However,
new research may at anytime identify hitherto unknown risk alleles. In addition, we cannot
ignore the potential effects of the environment. For example, Evans et al. (2004) reported
an association between total serum cholesterol and disease progression in patients with
LOAD who did not have the APOE4 allele.
Chromosomal
Mendelian
Complex
(sporadic)
Figure 2.10: HSCR is either a chromosomal, Mendelian, or a complex disease. The majority of cases of HSCR
are genetically complex (70%); however, a substantial number are familial cases (inherited in a Mendelian pattern,
12%) and some occur from chromosomal abnormalities (18%).
identified. These include a deletion in the RET transforming sequence (oncogene RET)
that encodes a receptor tyrosine kinase on chromosome 10q in the area 10q11–q21
(HSCR1) which is found in 50% of familial and 15–35% of sporadic cases, the gene
encoding endothelin receptor type B (EDNRB) on chromosome 13q (HSCR2), as well
as 11 other gene loci. Three of these 11 other genes appear to have a minor function and
six of the 11 appear to be modifier genes. In addition, there are five identified chromo-
somal locations for which the candidate genes have yet to be identified (summarized in
Table 2.5).
The RET oncogene deletion is the major genetic risk factor for the isolated non-
syndromic form of the disease, but does not account for all cases. In addition, there
is a male bias for cases with this deletion (65% males). Linkage analysis indicates that
several other genes interact with the RET deletion to increase the risk of the disease
as noted above.
The EDNRB gene was identified as a risk factor for HSCR through a series of linkage stud-
ies in a Mennonite kindred that identified deletions and mutations in chromosome 13q.
Independent work in mouse models suggested a relationship between EDNRB, which is
located within this region. As with the RET mutation, there is incomplete penetrance and
also penetrance may be sex-specific.
Both RET and EDNRB encode receptors that are active in the process of neural crest
stem cell regulation. It is therefore easy to understand how mutations in these genes can
give rise to HSCR. As there are two pathways in the process, it is also possible to envis-
age why a mutation in one pathway does not always lead to the clinical syndrome being
expressed.
Other susceptibility alleles for HSCR have been proposed and research suggests that
mutations in the genes encoding the ligands for RET and EDNRB may be involved as
well as a number of genes encoding for other components of the biological pathways
that are involved in neural crest stem cell development (Figure 2.11). Thus far, studies
have identified mutations in GDNF and EDN3, as well as ECE1 and NTN. An asso-
ciation with SOX10 (which may interact with ZFHX1B and SOX8) has been reported,
though this remains controversial and may relate to other similar developmental disor-
ders. Another gene implicated in HSCR is the PHOX2B gene, which interacts with the
RET gene.
Some of the associated risk alleles above and a number of other more recently identified
potential risk alleles (L1CAM, which interacts with ENDRB, and NRG1, BBS4, BBS5,
and BBS6, which all interact with RET) may act as modifiers in the presence of the major
risk alleles, which may explain why they are only seen to increase disease risk in those car-
rying the major risk allele.
Inheritance of multiple risk alleles may explain clinical heterogeneity and in a complex
model may also explain incomplete penetrance. HSCR genetics illustrates the importance
of considering the whole genome and studying complete biological systems. Recent devel-
opments have enabled both of these objectives to be possible, though much more work
remains to be done before the work on HSCR genetics is complete.
Endothelin
RET pathway
pathway
Ganglia cell
colonization in
the GI tract
Figure 2.11: Major pathways in neural crest stem cell development with links to HSCR. This figure
illustrates the major pathways involved in neural crest stem cell development implicated by linkage and
association with susceptibility to HSCR. The major contributing genes are RET and EDNBR. Polymorphisms
of other genes whose products are important in these RET and endothelin pathways have been identified as
potential risk factors for HSCR. GI, gastrointestinal. (Adapted from Strachan T & Read A [2004] Human Molecular
Genetics, 3rd ed. Garland Science.)
Table 2.6 Major susceptibility regions for IBD identified by early linkage analysis and GWAS.
Ulcerative Putative
Region IBD-LABEL Crohn’s disease colitis candidate gene
16q12 IBD1 Strong None CARD15 (2001)
12q13.2–q14.1 IBD2 Moderate Moderate
6p21.3 IBD3 Moderate Moderate MHC/HLA
14q11–q12 IBD4 Moderate Moderate
5q31 IBD5 Strong Moderate Cytokine cluster
19p13 IBD6 Moderate Moderate 2007 WTCCC1
GWAS
1p36 IBD7 Moderate Moderate
2008–2009
Th17 endoplasmic
reticulum stress
and gut barrier
1988–1996 integrity
Twin Crohn’s disease
concordance 2001 meta-analysis
and ethnic NOD2/CARD15 2010
diversity Hugot/Ogura 71 alleles
Figure 2.12: Time-line of genetic investigations in Crohn’s disease. The figure shows the timing of major
developments in our understanding of the genetics of Crohn’s disease from 1988 to recent.
Table 2.7 CARD15 mutations and variations encode major risk alleles in Crohn’s disease – estimates
from case control cohort studies.
NOD2 is a NACHT-LLR family member with two N-terminal caspase activation and
recruitment domains (CARDs), a nucleotide-binding domain, and a series of leucine-rich
repeats (LLRs) at the C-terminal. NOD2 expression is mostly restricted to the cytosol in
monocytes and Paneth cells. Paneth cells, named after Joseph Paneth, are specialized cells
found in the small intestine and appendix. Their precise function is unknown, though
they are thought to have antibacterial properties, and may be important in immune regu-
lation and responses to bacteria in the gut. NOD2 activates the nuclear factor NF-κB
through interaction between the LRR domain and muramyl dipeptide (MDP) on Gram-
positive bacteria. This interaction enables the CARD domains of NOD2 to interact with
the protein kinase RICK (also RIP or TRAF). RICK induces K63-linked ubiquitinylation
of the signaling molecule IKKY (NEMO), which in turn causes phosphorylation of the
NF-κB inhibitors allowing NF-κB to initiate production of a series of pro-inflammatory
and regulatory cytokines, including interleukin (IL)-1B, IL-8, IL-10, and tumor necro-
sis factor (TNF)-α (Figure 2.13). Under normal circumstances this process leads to host
recognition and destruction of bacterial pathogens in the gut. However, when CARD15
is present in the mutant form, the normal mechanism of signaling is interrupted and
there is decreased NF-κB production. However, this is not the whole story. Crohn’s dis-
ease has been associated with infection with Mycobacterium tuberculosis, Pseudomonas
Commensal
bacteria
Gastric
epithelium
NOD2
NF-κB
Figure 2.13: NOD2 immune interactions in maintenance of homeostasis for gut commensals. NOD2
polymorphism is the major susceptibility determinant for Crohn’s disease. NOD acts through a series of
pathways to influence the immune response to gut commensals. The TLR family are major components of the
immune system involved in immune homeostasis of the gut. Missense mutations in TLR4 (D299G and T399I) are
associated with susceptibility to Crohn’s disease. TLR4 expression is increased in the intestinal epithelial cells in
active disease, but found at low levels only in healthy intestine. The figure illustrates bacterial invasion (shown
as tadpoles or small figures with tails) through the gut wall. Both TLR2 and TLR4 interact with NOD and each will
cause up- or downregulation of pro-inflammatory and regulatory cytokines. Polymorphisms that create higher or
lower activity of these immune compounds may have a significant influence on bacterial homeostasis in the gut.
This figure represents only part of what is beginning to look like a very complex picture for Crohn’s disease. NOD2
also interacts with NF-κB, and NF-κ promotes apoptosis and stimulates the production of both pro-inflammatory
and regulatory cytokines.
species and Listeria species. Crohn’s disease patients have exaggerated immune responses
in the gut and earlier genetic studies identified several other areas of significant linkage
for Crohn’s disease (Tables 2.6 and 2.7), suggesting other factors and pathways may also
play a role in susceptibility, in particular genes that regulate the immune response to
bacteria.
• ATG16L1 on chromosome 2q37 (previously IBD10, which had the strongest reported
association for Crohn’s disease in this study P < 7.1 × 10−14).
• An unknown locus in the chromosome 10q21 region around rs10761659 (previously
IBD15, which is 14 kb telomeric of the ZNF365 gene and 55 kb centromeric of the
pseudogene antiquintin-like 4).
• An area around rs17234657 within a 1.2 Mb gene desert on chromosome 5q13.1
(previously IBD18).
In addition, four novel strong associations were identified:
• Chromosome 3p21 (previously IBD12). There are a number of candidate genes in
the area, of which MST1 (macrophage stimulating 1) is the most plausible. This gene
encodes a protein that affects macrophage function and macrophages have a major role
in phagocytosis – a key process in maintaining defense against bacterial invasion.
• The IRGM (immunity-related guanosine triphosphatase) gene on chromosome 5q33
(previously IBD19). IRGM encodes a GTP-binding protein that induces autophagy
and reduced IRGM activity could lead to persistence of intracellular bacteria, which is
consistent with the function of other genes identified as those carrying risk alleles for
Crohn’s disease.
• A cluster on chromosome 10q24.2 (previously IBD20). The likely candidate from
this cluster is NKX2-3 (NK2 transcription factor-related locus 3). Disruption of this
gene homolog in mice leads to abnormal development of the intestine and secondary
lymphoid organs. It has been suggested that mutations in this gene may disrupt gut
migration of antigen-responsive lymphocytes and influence the inflammatory response
in Crohn’s disease.
• Close to the PTPN2 gene on chromosome 18p11 (previously IBD21). PTPN2
(protein tyrosine phosphatase non-receptor type 2) is a key negative regulator of the
immune response. Members of this family have been associated with increased risk of
both rheumatoid arthritis and type 1 diabetes.
In addition to these 10 highly significant loci, there are a further eight chromosomal
regions identified as having moderate associations: 1q24, 5q23, 6p21, 6p22, 6q23, 7q36,
10p15, and 19q13. Most of these associations are unsurprising, e.g. the association with
the region of 6q23 which encodes the TNFAIP3 gene (TNF-α-induced protein 3). The
product of the TNFAIP3 gene inhibits TNF-α-induced NF-κB-dependent gene expres-
sion through RIP and TRAF-2-mediated transactivation signaling in the same way that
NOD2 works on NF-κB. This illustrates the common link between many of these genes
and autophagy and phagocytosis, both of which are key processes in the defense of the
gut from pathogenic bacteria and maintenance of homeostasis for normal gut bacteria.
Even the association with the 6p21 region is not surprising. 6p21 is the home of the
human MHC within which the HLA genes are to be found. HLA antigens are key players
in both adaptive and innate immunity, and therefore an association between an immune
inflammatory disease like Crohn’s disease and genetic variation in the MHC is highly
likely. Interestingly, the genetic association between Crohn’s disease and HLA has been a
matter of dispute, and is considerably weaker than that seen for many autoimmune and
viral diseases as well as some adverse drug reactions. Some have suggested this association
is due to the close proximity of the gene for TNF-α (TNFA), which maps within the
MHC and not the HLA genes.
Biological pathways in
Crohn’s disease
CARD15 XBP1
TLR4 ATG16L1 CAPN10
IRGM ORMDL3
LKKR2 ERAP2
MTMR3
Figure 2.14: Genetically identified pathways associated with increased risk of Crohn’s disease. The
five arrows point to five different gene groups: innate immune signaling genes (CARD15 or NOD2 and TLR4),
autophagy (ATG16L1, etc.), the endoplasmic reticulum biology (XBP1, etc.) and genes involved in Th17 T
regulatory cell balance (IL23R, etc.), as well as a group of genes involved in the secondary immune response
(e.g. MHC, IL12B, etc.) The figure indicates the complexity of biological pathways and the likely interactive nature
of susceptibility alleles.
The association of such a large number of susceptibility alleles, at so many loci, most with
low relative risks, suggests that Crohn’s disease is a polygenic complex disease rather than
an oligogenic complex disease. The inherited component of the disease is low in Crohn’s
disease and it is possible that many cases involve the cumulative interaction of several
mutant alleles. In contrast, in HSCR, the inherited component is greater, there are fewer
risk alleles or mutations, and most cases involve only one major risk mutation. In some
cases of HSCR, however, the effect of the major mutation can be modified by inheritance
of other risk alleles.
Conclusions
One of the key concepts in this chapter and throughout the book is that complex diseases
are not simple and do not lend themselves to simple analysis. They are best defined using
the modified definition of Haines and Pericak-Vance (1998) as those where one or more
genetic variations (alleles) may increase or decrease the risk of a trait or disease. In this
book we are discussing disease. In complex disease we exchange trait for disease from time
to time. This exchange is not accepted by all, but using this term occasionally enables us to
be flexible when considering overlapping diseases (syndromes) and when considering dif-
ferent subgroups within a disease population. It is therefore more useful to be flexible than
proscriptive in the application of terminology. Chromosomal and Mendelian diseases are
not genetically complex diseases; however, there are some gray areas where simple defini-
tions cannot be so easily applied.
Dismantling the definition produces a clearer understanding of what this book is all about.
The most important concept throughout the book is that the inheritance of an allele at a
locus increases or reduces the risk of the disease. This means that the allele is neither nec-
essary nor sufficient to cause the disease. Inheritance of an allele is simply associated with
either an elevated or a reduced risk. In this definition, the possibility of protective as well
as susceptibility alleles is considered. Not all risk alleles are associated with the occurrence
or initial pathogenesis of a disease. Alleles may be associated with the risk of a specific dis-
ease subgroup within a population, e.g. severity defined by age of onset, disease progres-
sion (including morbidity and mortality), or response to therapy. This illustrates how not
all genetic associations can be interpreted as tags (or markers) for disease causing genes.
Complex diseases are complex and we can consider three potential models: monogenic,
involving a single gene, but not conforming to a Mendelian pattern, oligogenic involving
several, but a limited group of genes, and polygenic involving a larger group. These models
clearly overlap, but it is still useful to categorize disease like this if possible. The latter two
models illustrate not only the basic concepts of the genetics of complex disease, but also
indicate why we seek to identify risk alleles in complex diseases. Understanding the basic
principles set out here is essential in order to understand current research and the poten-
tial applications of genetic investigations such as those described in this book in modern
medical practice now and in the future.
Further Reading
Books relationship between the genetics of Crohn’s
Collins F (2010) The Language of Life: DNA disease and the developing understanding of
and the Revolution in Personalised Medicine. the disease pathogenesis that has followed from
Harper-Collins. these genetic studies.
Haines JL & Pericak-Vance MA (1998) Collins FS & McKusick VA (2001)
Overview of mapping common and genetically Implications of the human genome project for
complex human disease traits. In: Approaches medical science. JAMA 285:540–544. This is a
to Gene Mapping in Complex Human Disease very important review that provides the reader
(Haines JL & Pericak-Vance MA eds), pp 1–16. with a guide to and an illustration of the poten-
Wiley. This book provides the best definition tial for understanding complex disease in the
of complex disease and is a good starter text on immediate post-genome period.
major issues in complex disease. Cooney R, Baker J, Brain O et al. (2010)
Lewis R (2011) Human Genetics – The Basics. NOD2 stimulation induces autophagy in den-
Routledge. dritic cells influencing bacterial handling and
antigen presentation. Nat Med 16:90–97. This
Parham P (2009) The Immune System, 3rd ed.
paper relates genotype to phenotype – an essen-
Garland Science. This is an excellent textbook
tial next step in complex disease.
providing a good background on human immu-
nology and some examples of immunogenetics. Cuthbert AP, Fisher SA, Muddassar MM
et al. (2002) The contribution of NOD2 gene
Strachan T & Read AP (2011) Human
mutations to the risk and site of disease in
Molecular Genetics, 4th ed. Garland Science.
inflammatory bowel disease. Gastroenterology
This is an excellent basic genetics textbook that
122:867–874. This is a good study for students
goes into more depth on many of issues in this
to look at for critical analysis and to understand
chapter, especially in chapters 2 and 3 and also
basic pre-GWAS association work in Crohn’s
in chapters 15 and 16.
disease.
Evans RM, Hui S, Perkins A et al. (2004)
Articles Cholesterol and APOE genotype interact
Amiel J & Lyonnet S (2001) Hirschsprung to influence Alzheimer disease progression.
disease, associated syndromes and genetics: a Neurology 62:1869–1871.
review. J Med Genet 38:729–739. This is an Franke A, McGovern DP, Barrett JC et al.
excellent review of HSCR and, together with (2010) Genome-wide meta-analysis increases
Puffenberger et al. (1994), provides a basic to 71 the number of confirmed Crohn’s disease
starting point for students investigating this susceptibility loci. Nat Genet 42:1118–1125.
disease. This paper introduces meta-analysis and sum-
Beutler E, Felitti VJ, Koziol JA et al. (2002) marizes a series of GWAS of Crohn’s disease.
Penetrance of 845G-A (C282Y) HFE heredi- Henderson P & Satsangi J (2011) Genes in
tary haemachromatosis mutation in the USA. inflammatory bowel disease: lessons from com-
Lancet 359:211–218. This paper is a landmark plex diseases. Clin Med 11:8–10.
paper on HFE and raises a number of impor- Hugot J-P, Chamaillard M, Zouali H et al.
tant questions. (2001) Association of NOD2 leucine-rich
Brown MA, Pile KD, Kennedy LG et al. (1996) repeat variants with susceptibility to Crohn’s
HLA class I associations of ankylosing spon- disease. Nature 411:599–603. This, together
dylitis in the white population in the United with the paper by Ogura et al. (2001), was the
Kingdom. Ann Rheum Dis 55:268–270. This first to identify the IBD1 gene as CARD15 on
paper is the gold standard paper for ankylosing 16q12 – a major step forward in our under-
spondylitis and HLA-B*27. standing of the genetics of Crohn’s disease.
Cario E (2005). Bacterial interactions with Inohara N, Ogura Y, Fontalba A et al.
cells of the intestinal mucosa: Toll-like recep- (2003) Host recognition of muramyl dipep-
tors and NOD2. Gut 54:1182–1193. This tide mediated through NOD2. J Biol Chem
paper provides some essential insight into the 278:5509–5512.
Jostins L, Ripke S, Weersma RK et al. (2012). Sennvik K, Fastbom J, Blomberg M et al.
Host–microbe interactions have shaped the (2000) Levels of alpha- and beta-secretase
genetic architecture of inflammatory bowel dis- cleaved amyloid precursor protein in the cere-
ease. Nature 491:119–124. This paper identi- brospinal fluid of Alzheimer’s disease patients.
fies 63 risk alleles for IBD. Neurosci Lett 278:169–172.
Kullberg BJ, Ferwerda G, De Jong DJ et al. Stokin GB, Lillo C, Falzone TL et al. (2005)
(2008) Crohn’s disease patients homozygous for Axonopathy and transport deficits early in the
the 3020insC NOD2 mutation have a defec- pathogenesis of Alzheimer’s disease. Science
tive NOD2/TLR4 cross-tolerance to intestinal 307:1282–1288.
stimuli. Immunology 123:600–605. Stoll M, Corneliussen B, Costello CM et al.
Lala S, Ogura Y, Osborne C et al. (2003) (2004) Genetic variation in DLG5 is associated
Crohn’s disease and the NOD2 gene: a role for with inflammatory bowel disease. Nat Genet
paneth cells. Gastroenterology 125:47–57. 36:476–480.
Lees CW, Barrett JC, Parkes M & Satsangi J Taylor RW & Turnbull DM (2005)
(2011) New IBD genetics: common pathways Mitochondrial DNA mutations in human dis-
with other diseases. Gut 60:1739–1753. This ease. Nat Rev Genet 6:389–402. This review is
paper provides a current summary of the genet- highly informative for the student or clinician
ics of Crohn’s disease and the wider connota- trying to understand the nature of mitochon-
tions of this type of research. drial disease and the mitochondrial genome.
Lesage S, Zouali H, Cezard J-P et al. (2002) The Thousand Genomes Consortium (2012)
CARD15/NOD2 mutational analysis and gen- An integrated map of genetic variation from
otype-phenotype correlation in 612 patients 1,092 human genomes. Nature 491: 56–65.
with inflammatory bowel disease. Am J Hum The Wellcome Trust Case Control Consortium
Genet 70:845–857. This paper opened the door (2007) Genome-wide association study of
to further exploration of the CARD15 gene 14,000 cases of seven common diseases and
in Crohn’s disease and marked an important 3,000 shared controls. Nature 447: 661–678.
development in the study of Crohn’s disease This is another landmark paper highlighting
genetics. the development and application of new tech-
Ogura Y, Bonen DK, Inohara N et al. (2001) nologies (GWAS) in complex disease research,
A frameshift mutation in NOD2 associated and it provides a mass of useful information for
with susceptibility to Crohn’s disease. Nature those studying complex disease. Together with
411:603–606. This, together with the paper the other papers on Crohn’s disease, this study
by Hugot et al. (2001), was the first to identify provides some of the basic information for stu-
the IBD1 gene as CARD15 on 16q12—a major dents studying Crohn’s disease as well as a fur-
step forward in our understanding of the genet- ther six diseases not discussed in this chapter.
ics of Crohn’s disease.
Puffenberger EG, Hosada K, Washington Online sources
SS et al. (1994) A missense mutation of the http://hapmap.ncbi.nlm.nih.gov
endothelin-B receptor gene in multigenic This is the site of the results of the International
Hirschsprung’s disease. Cell 79:1257–1266. HapMap project.
Prince JA, Zetterberg H, Andreasen N et al. http://pngu.mgh.harvard.edu/~purcell/plink
(2004) APOE (epsilon)4 allele is associated
with reduced cerebrospinal fluid levels of A PLINK is a freely available, open-source,
(beta) 42. Neurology 62:2116–2118. whole-genome association analysis toolset
designed for quality control and analysis of
Schreiber S, Rosenstiel P, Albrecht M et al. GWAS data.
(2005) Genetics of Crohn disease, an arche-
typal inflammatory barrier disease. Nat Rev http://www.genet.sickkids.on.ca/StatisticsPage.
Genet 6:376–388. This is an excellent review html
of the genetics of Crohn’s disease. The informa- http://www.mousebook.org/mousebook-
tion and discussion provides insight not only catalogs/imprinting-resource
into Crohn’s disease, but is also a useful model This is an essential database for mouse imprint
to understand other similar conditions. genes and current research on imprinting.
In this chapter, we will consider the questions of how to identify risk alleles in genetically
complex disease, starting with the basic knowledge that informs study design and the dif-
ferent options that are available to us. We will look at study design from the past to the
present day and consider some of the advances. Having already seen how complex diseases
do not lend themselves to simple analysis and having been made aware in Chapter 2 that
we are dealing with mutations and polymorphisms that increase or reduce the risk of a
disease, it is possible to see how hard this task may be. In addition, it is also clear that there
may be complex interactions between different risk alleles, and between risk alleles and the
environment. Altogether this adds up to a difficult task and yet it is one that has attracted
considerable attention in medical research. We shall see how study design has advanced to
reach previously impossible heights.
sizes do not facilitate the identification of low-risk alleles. Good planning begins with a
few simple questions:
• How common is the disease?
• What is the evidence for a genetic component to the disease?
• What is known about the disease pathology that may indicate associations or linkage
with specific biological pathways?
Thus, in a population of 1000 with 10 new cases per year the incidence = 1/100.
Prevalence can be calculated from the simple equation:
Thus, if the incidence in siblings is 1/50 and the incidence in the population is 1/1000,
then λ = 0.02/0.001 = 20. Unless tested in very large populations, it is appropriate to
consider more than one estimate of λ and quote a range. When calculating this value for
the first time, we need to be aware of the problems associated with small numbers, which
often lead to high estimates that may prove false on further scrutiny.
Figure 3.1: Incidence and prevalence. The gray dots represent individual cases. The horizontal arrows are
time-lines. The figure illustrates the difference between incidence and prevalence. Incidence is a measure of
the number of new cases in a set period of time (often 1 year). Prevalence refers to the total number of cases at
a particular time. Incidence and prevalence can be very different or very similar depending on the severity of
disease (prognosis). When the disease is severe, survival times can be low and prevalence will match incidence;
where survival times or prognosis is good, then prevalence is often greater than incidence.
Prevalence
Prevalence refers to the number of cases at a given time point. On a specific date the num-
ber of reported cases may be 1000. This is different from incidence because it includes all
cases irrespective of when they were reported.
100
UK
80 France
USA
60
Frequency
40
20
X-positive X-negative
Figure 3.2: Incidence and prevalence in different populations. The figure illustrates three populations (e.g. UK,
France, and USA) with different frequencies of X disease. These values could have been measured as either incidence
(number of new cases in a set period) or prevalence (total number of cases at a time). The important point is that
incidence and prevalence for a given trait may vary between populations. The further apart these populations are
located geographically, the greater these differences are likely to be. Thus, the greatest difference in the frequency of
X disease is between the USA and the UK, while France, which is closer to the UK, has a smaller difference.
any informative (i.e. multicase) families with the disease. At this stage it is also important
to remember missing heritability. This reminds us that not all heritability is due to our
genes—the proportion of our heritability that is not determined by our genes is referred to
as missing heritability. The extent to which missing heritability occurs varies between dis-
eases. Inflated values for some of the data collected in the tests referred to in this book (see
Chapter 4 in particular) are the result of this missing heritability. For example, a sibling
relative risk of 50 may not all be attributed to sharing of the same genetic variation—the
shared environment can also play a significant role.
Family information
Any good geneticist will always look for informative families. To be informative, a family
must have more than one affected member (Figure 3.3). If there are enough informative
families, family studies can be performed. Linkage analysis based on multicase families
has been and still is considered the gold standard for genetic studies; however, multicase
families are rare in complex disease. In the absence of families, affected sibling pairs can be
studied. However, even if there are families or large numbers of sibling pairs and linkage
analysis is to be performed, the non-familial (sporadic) cases should not be overlooked.
Linkage analysis in families and sibling pairs is mostly parametric linkage analysis. This
means the parameters have to be known at the outset. One of these parameters is the pat-
tern of inheritance. Therefore, this suits Mendelian disease, but not complex disease where
the key parameter, i.e. pattern of inheritance, is by definition unknown. Non-parametric
linkage analysis can be used instead for linkage studies in complex disease.
It is important to remember that there are a number of caveats with respect to families in
genetic studies. Familial history is not always a sign of a genetic effect. Families share envi-
ronments, which can have a strong effect on susceptibility to a variety of different diseases.
The best example of this can be seen in an analysis of the likelihood of attending medical
school in the USA. This study found attendance was an autosomal recessive trait, when in
Unaffected
(b)
Figure 3.3: Informative families. This simple comparison between two families (a and b) illustrates the
importance of there being multiple cases in a family if linkage analysis is to be applied. Here, family (a) is
informative and family (b) is uninformative because for linkage there needs to be more than one affected family
member in each generation. In practice, the picture may be more complicated and the decision of whether or
not to use families may rest on the actual number of affected cases and the number of cases in each generation.
fact the real impacting factors were social status and parental income. In a sense, the trait
has a significant heritable component, but that heritability is not genetic. The issue of the
term heritability has been extensively discussed in Chapter 2.
The second caveat regarding family studies is that genetic risk in families may be attribut-
able to different risk alleles compared with those involved in sporadic disease. However,
caveats aside, families do represent the genetic high ground in such studies and for many
diseases, especially near-Mendelian and oligogenic complex disease, identifying risk
alleles in families has led to significant findings. These findings have in some, but not
all cases, also led to findings in sporadic non-familial cases. Linkage analysis is no lon-
ger restricted to accessing single genes or chromosomal regions as in the past. In the
present decade, linkage studies can include the whole genome. Indeed, many genome-
wide linkage studies (GWLS) have been performed. However, this does require large
family pools. It is also important to remember that not all families are informative
(e.g. Figure 3.3b) and even where there are multiple cases in diseases with late onset or
severe prognosis, recruiting family members for studies is often difficult.
Relative risk (λ)
Apart from using families in linkage analysis, we can perform a few basic tests to provide
an indication of the size of the heritable component of a disease. One such test is based
on relative risk, which can be calculated by comparing the known incidence or prevalence
in families, with the incidence or prevalence in the population as a whole. The risk (λ) can
be calculated for specific family members, for example λs is a measure of the risk for sib-
lings of the affected (and this is the most commonly calculated risk). The risk may be dif-
ferent in different family members. Some diseases are known to be more common in the
female members of the population, such as in many, though not all, autoimmune diseases.
In these cases, calculating the risk for daughters of affected mothers may be more appro-
priate. The λ value gives a good, but crude measure of the relative risk within a population
and in practice when looking at published data it is always better to quote ranges from sev-
eral studies rather than from individual studies. Values quoted for Mendelian disease may
be as high as several thousand for dominant conditions and several hundred for recessive
diseases. Figures for complex diseases vary from zero to the low hundreds (see Table 3.1).
It is also important to remember that this is a measure of heritability and genetic variation
may only make up part of this.
Twin studies
Twins represent a very useful study group in genetics. Twins may be either identical or
non-identical. Identical twins are the product of a single fertilization and share the same
genome, i.e. they are monozygotic twins. Non-identical twins occur when two fertil-
izations of different gametes occur at approximately the same time giving rise to two
individuals who, though born at the same time and of the same parents, are unlikely to
have a larger degree of shared genetic identity than any two other offspring of the same
parents conceived on different occasions (Figure 3.4). Non-identical twins are referred to
as dizygotic twins.
Twins can provide very useful information about the genetic component for a disease.
Simple comparison of disease concordance rates for monozygotic versus dizygotic twins
indicates the extent to which the genome plays a role in disease. Twin data has to be han-
dled with caution because concordance data for large disease populations is more reliable
n
2n
n
n
2n
n
n 2n 2n
2n
n
2n
Figure 3.4: Monozygotic and dizygotic twins. Twinning occurs in one of two ways. Fraternal (non-identical
or dizygotic) twins occur when two eggs are independently fertilized at approximately the same time. Dizygotic
twins develop in the womb at the same time, but have separate embryonic membranes. Dizygotic twins have
the same level of genetic identity to each other as their siblings. Monozygotic twins arise from the same single
fertilization event and are produced by the division of the embryo very early in gestation. The illustration shows
this division after the two-cell stage, but it may occur anytime before, such as splitting at the morula or in the
early blastocyst stage. Monozygotic twins are essentially genetically identical.
Concordance (%)
Disease Dizygotic twins Monozygotic twins
Type 1 diabetes 15 30–40
Crohn’s disease 7 37
Type 2 diabetes 10 90
Rheumatoid arthritis 3.6 15.4
Multiple sclerosis 5.4 25.3
Bipolar disorder 6 43
Autism 0–21 ~60
Schizophrenia 4 15–33
than for small patient populations. However, large numbers of twins are unlikely to be
available in all but the most common of complex diseases because twins are themselves
quite rare. Approximately one in every 200 pregnancies give rise to twins and most of
these are dizygotic twins. Some examples of concordance data for different diseases are
given in Table 3.2 and elsewhere throughout this book.
Unknown A Unknown B
Genetics
known factor C
Healthy Diseased
Figure 3.5: Disease pathology. The figure illustrates a healthy (light gray) and an unhealthy, diseased liver
(dark gray with scarring). The figure shows that the many of the factors that cause early disease are often
unknown to us, and reminds us that we tend to see patients (represented by the organs here) with specific signs
and symptoms when they present at a clinic late in disease progression. This means the opportunity to gain
knowledge of the early biological processes through which the disease arises are lost to the medical community,
preventing earlier interaction. However, our inherited genes do not change and knowledge of the risk alleles,
shown as factor C (the known factor), in complex disease is potentially the key to unraveling this conundrum.
group would report a significant association with a specific allele and another would fail to
replicate the original findings. This happened because many studies were poorly designed
using small numbers, poorly matched or inappropriate controls, and badly defined patient
groups. In general, much of this has now been dealt with. Large-size, multicenter studies
using well-defined patient and control groups have been able to identify significant risk
alleles for many diseases and many of these reports have been replicated. However, studies
still report weak associations that are not replicated by others. The reasons why this still
happens vary. Some studies are still poorly designed, but in many cases the differences
found reflect the different populations studied.
It is essential to understand the history of these studies in order to critically evaluate the
literature prior to undertaking any study of the genetics of a complex disease. Questions
about quality and reliability of previous studies are crucial if we intend to build on pre-
vious research. It is also important not to waste research resources by simply repeating
previous studies unless there is a bona fide reason to do so. Information regarding previ-
ous studies and potential risk haplotypes is essential in study planning. There may be very
little mileage in reconsidering serially confirmed associations and it may be more fruitful
to look at previously ignored chromosomal regions or genes.
Disease pathology as a means to identifying candidate pathways
Though our knowledge of early disease pathology is often incomplete there is a great deal
of information about disease pathology in the literature. Our task is to sift through the data
and identify key biological pathways that, when stressed, may contribute to the disease
expression. Biological pathways are the key to unpacking disease pathology and this process
is itself the key to identifying potential candidate genes in complex disease. For example,
pathways associated with bone formation and with inflammation may be critical in rheu-
matoid arthritis. Where there are well-established hypotheses, it may be appropriate to
consider the constituent pathways that have led to the development of that hypothesis. For
example, early hypotheses suggested that Crohn’s disease may arise from either an abnor-
mal immune inflammatory response to gut bacteria or as a food allergy. Recent evidence
favors the former and indeed appropriate risk alleles have been identified that indicate the
immune response to gut bacteria is a key process in Crohn’s disease. Not all disease-related
genes are involved in immunity. Genes impacting on lipid metabolism are known to be
important in coronary artery disease and some of the same genes may also influence risk in
Alzheimer’s disease. In neurological diseases and some mental disorders (bipolar disorder
and schizophrenia), genetic associations have been identified with genes such as DISC1,
GRM7, and GABRB1, all of which are important in brain cell activity and neurotransmis-
sion (especially GABRB1). A more detailed consideration of these is given in Chapter 4.
When considering the pathology of a specific disease we should also consider what is
known about the pathology of other similar diseases. Many diseases are grouped into syn-
dromes or collections. Sharing of susceptibility alleles between diseases is not unusual, and
therefore looking at related diseases with similar signs and symptoms may help to identify
important pathways. A primary example of the latter is autoimmune disease where many
of the major autoimmune diseases have associations with the HLA 8.1 haplotype or with
members of the HLA DRB1*04 family of alleles. Examples include type 1 diabetes mel-
litus, the autoimmune liver diseases, AIH and primary sclerosing cholangitis, rheumatoid
arthritis, and systemic lupus erythematosus (SLE).
Before we get down to the hard business of study planning there are
one or two other questions that it is important to ask
The remaining information required for planning a study is illustrated in the organiza-
tional chart Figure 3.6 and includes:
• What is the average (mean) age of onset of the disease?
• Are there any other centers investigating the genetics of the disease?
• What is the likelihood of obtaining research funding to study the disease?
Age of onset
Age of onset is an important statistic in studying any disease. Working with children
presents a number of moral and ethical issues (discussed in Chapter 11). However, with
young families, recruiting parents and other family members for studies can be easier than
working with older families where family members are more likely to be diversely spread
and less likely to be in contact. In many cases, early-onset diseases are also more likely to
have a Mendelian or near-Mendelian pattern of inheritance. However, there are always
exceptions to every rule and several common forms of complex disease, such as autism,
attention deficit hyperactivity disorder (ADHD), childhood obesity, and asthma, are all
complex disorders that present in childhood.
Other centers
It is essential to consider the competition when planning studies. There is very little to be
gained from competing with a rival who has access to larger patient numbers and better
resources. It is more useful in such a scenario to collaborate. However, collaboration is not
always possible or appropriate. Awareness of rivals can be very useful in planning. There
are usually several ways to approach key questions and taking an alternative approach
may be very fruitful. For example, rather than looking for genetic associations with the
disease per se, it may be more appropriate to consider specific subgroups within the disease
Key information
Heritable
component
Sibling
Twin Family
relative
studies studies
risk (λ)
Age of onset
incidence
Current
knowledge
Competitors
Quality of
diagnosis
Clinical
variables
Funding
options
Preliminary – Advanced
“pump priming” major funding
Figure 3.6: Planning a study: things to consider before proceeding. The organization chart illustrates the key
points that need to be considered before commencing with the detailed planning of a study. Most are common
sense, but it is important to consider each of these. This can prevent misuse of time and resources, and also lead
to new approaches to a particular disease. For example, based on the information on subgroups, we may choose
to investigate one particular group within a disease cohort rather than all cases.
particular diseases. Smaller charities are particularly helpful in funding short, pump-
priming projects as a first step in a research program.
Prevalence/incidence
Segregation analysis:
estimate of number of
risk alleles
Association analysis
Identify biological
factors in disease
Figure 3.7: Logical sequence for disease gene identification prior to the HGM. This figure illustrates the
normal sequence of events proposed for studies designed to identify disease genes prior to the publication of
the HGM. However this sequence is more suitable for Mendelian disease than for complex diseases. Segregation
analysis has rarely been performed in complex diseases. Indeed, many complex diseases have never even been
subject to linkage analysis. The sequence that suggests association analysis should only be performed after
linkage analysis is not practical in many complex diseases. This “logical” sequence suits the rare, highly penetrant
near-Mendelian complex diseases, but does not work with diseases which are mostly sporadic or where
transmission is horizontal, e.g. many infectious diseases. Only when there are sufficient family cases can linkage
be performed. Nevertheless, this approach has been used in a number of diseases at least once (e.g. type 1
diabetes). This sequence is especially useful where there are a reasonable number of cases in the collection
(i.e. the disease or trait is common) and when possible it remains a logical sequence.
cloning was mostly applied to Mendelian disease and rarely (if at all) used in complex
disease. Today, in the post-genome period, the precise location of most human genes is
known and the first steps in this method (cloning) have therefore become redundant. In
addition, since the availability of the SNP Map (http://www.ncbi.nlm.nih.gov/SNP/) and
the HapMap (http://hapmap.ncbi.nlm.nih.gov/) the majority of the mutations and poly-
morphisms are also known, so the whole process of testing candidate genes for relationships
with disease susceptibility can be short circuited.
The problem with linkage analysis is that it requires informative families. Unfortunately,
many complex disease cases are sporadic cases, without a family history, and linkage would
not or could not be applied, and the number of familial cases available for linkage analysis
would be insufficient to ensure adequate statistical power in any analysis undertaken.
Even where there are significant numbers of family members, such as in cardiovascular
disease, diabetes, and depression, cases do not cluster in a Mendelian fashion. This is partly
because most common diseases are complex and phenotype is determined by the interac-
tion of several factors, both genetic and environmental. Individual genetic variations often
have relatively small effects on the disease risk and are more likely to be detected in large-
scale case control association studies.
Association analysis in the pre-genome era
Association analysis is perhaps the most important method used to identify risk alleles in
complex disease. Case control association analysis tests for correlation between the inheritance
of one or more genetic markers or alleles and a disease. However, unlike linkage analysis,
association analysis does not establish a physical link. This approach compares the two groups
from the same population. These groups are usually (though not always) cases (patients) and
healthy controls. The history of association analysis is one of mixed fortunes and to a conven-
tional geneticist who deals with Mendelian diseases association analysis has had a very bad
reputation. However, this bad reputation is not entirely deserved, as discussed below.
Some authors refer to the common disease common variant hypothesis (CDCVH) in con-
sidering association studies and even go so far as to suggest it is the foundation for associa-
tion studies, especially GWAS. It may be useful to consider this idea before embarking on
an association study. The hypothesis predicts that common disease-causing alleles will be
found in all human populations with specific diseases.
Studies starting in the 1960s and 1970s began to identify both genetic linkage and genetic
associations in complex disease. One of the major areas for association analysis was the
human major histocompatibility complex (MHC) on chromosome 6p21.3, with the first
association reported in 1967. The clinical need to HLA-match patients requiring either
bone marrow or kidney transplants revealed an abundance of certain HLA phenotypes
in patient groups with specific presenting diagnoses. Many of these genetic associations,
though not all, have stood the test of time. We should note here that it is important to
acknowledge that there have been significant developments in the methods used for HLA
typing with a major shift from what was low (slow)-resolution phenotyping to very high
(rapid)-resolution genotyping (see Chapter 6). It is also fair to say that the majority of
these genetic associations were poorly understood at the time and to some extent geneti-
cists are still cautious when considering the human MHC.
Expectations in the earlier association studies were quite different from those of today. A
study reporting an association with an increased risk [measured as odds ratio (OR) or rela-
tive risk] of less than 3 would be unusual. Almost all studies expected large risks and studies
were often small. Many of the statistical principles considered in Chapter 5 were not con-
sidered in those early studies. Consequently, association analysis developed the bad reputa-
tion referred to above, but some good things have arisen from this bad reputation. The high
number of inconsistencies between studies created a platform from which statisticians were
able to critically review the past, and launch new, better guidelines for research in this area.
There are a few exceptions to the statement above, for example large OR values above 2 or
3 were reported for PTPN22 in type 2 diabetes and rheumatoid arthritis, and CARD15
in Crohn’s disease. Many of the early associations in pharmacogenetics also reported high
OR values (see Chapter 8). Other than these, early studies outside the MHC rarely found
strong associations with OR values of 3 or more. As a consequence, these studies were
more susceptible to finding false positives that proved difficult to replicate. Of course with
small studies many weak (but important) associations were also missed and for every false
positive it is almost certain there were several false negatives.
It is important when considering data to remember that OR values are rough calculations
at best and should always be considered within the range identified by the confidence
intervals (CIs). In addition, some of the early associations remain controversial even
today despite multiple studies and meta-analysis.
use in family-based studies. The test is based on the probability of equal transmission of
an allele or genotype from parents to affected and unaffected offspring. If the marker allele
in the group being tested is equally transmitted, then there is no disequilibrium (that is
there is equilibrium); if they are not equally transmitted, then there is transmission dis-
equilibrium. If sufficient families are tested, then the association can be statistically proven.
Consider a single gene with alleles A and B. Using heterozygous parents in a TDT test, the
genotype distribution among tested offspring should be 1:2:1, i.e. we would expect 25% to
be homozygous genotype AA, 50% to be heterozygous genotype AB, and 25% to be homo-
zygous genotype BB, if the transmission is in equilibrium. Otherwise there is transmission
disequilibrium suggesting an association with one or other allele (A or B) with the disease.
Usually the sample consists of a set of families each with four individuals, i.e. the affected
and unaffected offspring and the two parents, though in some studies trios can be used
(affected cases and parents only) and occasionally families with more than one affected
case can also be used. A variation on the original TDT is the so-called FBAT. Both TDT
and FBAT are family-based association tests and can be used in most complex diseases.
However, for many diseases they are rarely used because obtaining family material is dif-
ficult for the reasons discussed above (Figure 3.8).
A, B A, B A, B A, B
B, B A, B
A, B A, B A, B A, B
A, B B, B
Figure 3.8: TDT. The TDT test is a family-based association test. Using heterozygous parents, affected and
unaffected siblings are tested to see if the alleles are transmitted equally as expected. Unaffected siblings are not
always required for TDT. This illustration shows four families with two potential alleles A and B. All of the parents
are heterozygous AB. If the alleles are equally transmitted, then the frequency of the genotypes in the affected
offspring should be 1:2:1 in the order AA, AB, and BB, respectively, or alternatively each allele should be found at
a frequency of 50%. In this case, two of the offspring have the BB genotype and two are heterozygous AB positive
(the frequency of the B allele is 75% compared with 25% for the A allele). This suggests disequilibrium with a
preponderance of the B allele. In a larger cohort of cases it would be possible to analyze the data and be more
precise about the statistical significance of this observation.
Disease Control
DNA collection
All SNPs
Patients Controls
Frequency for a
single SNP allele
in patients (black)
and controls
(white)
Figure 3.9: GWAS. The illustration shows two populations of diseased patients and healthy controls. The DNA
samples for each individual (shown as small dots) in each group are collected. The frequency of a series of
SNPs (around 500,000 to 1 million plus in some studies) across the whole genome is tested in each group on a
commercial chip or platform. The frequency data for one SNP allele in each group is shown in the gray sphere at
the bottom of the figure. Data for each SNP are subject to statistical testing to determine whether there are any
statistically significant differences between them. The predominance of black dots in this illustration suggests an
association with this allele.
analysis within the target haplotypes to include more data (see Chapters 5 and 12). Impute
can be used provided there are sufficient fully genotyped samples with complete haplotypes
in a database. One can impute the missing genotypes for a set of study samples based on
the expected pattern of linkage. The method exploits linkage disequilibrium. Essentially, it
involves replacing missing data with substituted values. This means that any bias generated
by discarding data from samples that are not fully genotyped is avoided.
The level of resolution from a study can be significantly increased using impute software
to identify SNPs through known linkage disequilibrium, so that two or even four times
as many SNP genotypes can be assigned than were actually tested (Figure 3.10). Impute
requires complex computer software such as PLINK and access to good-quality databases
such as that generated by the 1000 Genomes Project. The greater the number of whole-
genome sequences put into public access systems, the greater the quality of the imputed
data will be. Of course the success of this method depends on the degree of linkage dis-
equilibrium between tagged SNPs on the target haplotypes. The beauty of genome-wide
studies, at least in the initial stages, is that they do not require a hypothesis. This allows us
1 2 3 4 5 6
Figure 3.10: Imputation analysis. The figure illustrates six different haplotypes in six individuals numbered 1–6
for a combination of six biallelic genes illustrated as double rings. Each gene is polymorphic; five of the genes
have been genotyped. There are two possibilities for all six genes (i.e. two alleles). The alleles are illustrated
by gray shading or no shading in those that have been genotyped. The gene illustrated in black has not been
genotyped. Imputation analysis can enable the genotype of the black gene on this haplotype to be estimated
based on the known pattern of linkage disequilibrium for the surrounding genes for which this data has been
obtained. Therefore, if we know that individuals with the haplotype which runs (top to bottom) gray, white, gray,
white always carry the gray allele at the fifth position, then we can assign this allele to individual 1 at position 5. If
we were to know that all carriers of white at position 4 and gray at position 6 also carry gray at position 5, we can
also impute the gray allele for these individuals even though they have not been genotyped for this position. This
imputed data can be included in the analysis. As linkage disequilibrium is so strong in certain areas of the human
genome, imputation analysis from initial GWAS data can massively extend the studies. Thus, an initial GWAS
study based on 300,000–400,000 SNPs can generate data for up to 1 million or more markers.
look more freely at the genome and identify hitherto unthinkable genetic relationships.
This can lead to favored hypotheses being rejected or being accepted, or in extreme cir-
cumstances it can turn some favored hypotheses on their heads. In all cases it is likely to
identify new risk alleles.
Selecting this approach in an association study would depend very heavily on previous
studies and sample size (see below). In the absence of any data at all, provided sufficient
candidates could be recruited, this strategy would be a good first choice. Even when num-
bers are small, GWAS can be performed, but it is essential to understand the limits of
using these studies on small numbers. The reason GWAS are based on large numbers is
that they are required to have sufficient statistical power (see Chapter 5) to detect weak
genetic effects with a high degree of confidence. If we look at small numbers we may only
be able to detect strong associations with a reasonable degree of confidence.
If we apply a non-hypothesis-constrained approach we cannot claim to be testing a spe-
cific hypothesis, we are simply looking for associations (or linkage) with tagged SNPs or
other genetic variants that mark areas of statistical significance and therefore genetic inter-
est. The great thing about this approach is that it is hypothesis generating. Thus, from the
ashes of hypothesis-free GWAS a mass of data arise that can identify regions around genes
that contribute to biological systems that may be of specific importance in understanding
the pathology of the disease under study. This is very much a first phase and a first study
would require replication to validate any findings.
Investigating pathways
Proteins almost always work in groups or pathways. Systems biology is a new approach to
biological processes that concentrates on the study of interaction of pathways. Those who
study the genetics of complex disease can learn a lot from considering the ways in which
pathways interact to form systems. If we take this approach to studying a complex dis-
ease, we would need to have a reasonable amount of prior data or a plausible hypothesis.
This is likely to be a phase 2 or phase 3 investigation based on prior knowledge. Evidence
from recent studies in Crohn’s disease indicates the value of considering processes com-
pared with considering individual genes in complex disease. The identification of nod2
(CARD15) and the autophagy gene (ATG16L1) both indicate links to pathways associated
with immune tolerance to Gram-positive commensal bacteria in the gut. Most studies of
Crohn’s disease now look for links to these and related pathways.
specified loci. Gene loci were selected where there was at least some potential functional
relationship between the product of the gene under investigation and the disease. The
problem with this candidate gene approach is that our knowledge of the biological rel-
evance of most human genes is limited by our lack of understanding of human biology
and disease pathology. If we take a narrow view and study only one or two genes, we will
almost certainly miss a great deal. If we cast our nets wider, perhaps studying genes in
whole pathways and systems, then there is more chance of finding significant associations.
Investigation of a single gene is less likely in the present era than it would have been 20
years ago. However, where prior studies indicate a strong relationship with a particular
gene or region close to a gene, then it may be relevant to perform high-resolution genotyp-
ing of that gene to identify polymorphisms (alleles) associated with susceptibility to the
disease under study. It may occasionally be prudent to investigate a region or haplotype
one gene at a time. This can be useful if resources are limited and it can also be helpful
to produce data for initial funding applications. Often researchers will use small funds
to pump-prime preliminary studies to gain data for larger research funding applications.
Candidate gene studies have rarely been very successful in identifying associations with dis-
eases. Most associations that were initially positive have failed to been confirmed. The reason
for this is in the selection of the candidate, which depends on the validity of the biological
hypothesis being tested, which in turn depends upon our knowledge of the disease patho-
genesis. Our knowledge of the early pathogenesis for most diseases is relatively poor; patients
present late in the disease process and identifying biological features of the early disease is
impossible. However, the good news for geneticists is that while pathology changes, the
genome does not. The future of medicine lies in early identification and prevention of dis-
ease. To achieve this goal it is necessary to identify risk alleles and use genetic studies to
inform the debate on disease pathogenesis. In 2015, this almost always means throwing the
net as wide as possible, i.e. genome-wide, at least in phase 1 of the study and coming back to
regions, pathways, haplotypes, and candidate genes later in the study as our knowledge base
progresses and we focus on the real disease-causing polymorphisms (Table 3.3).
Table 3.3 Current organization for genetic studies in complex disease including options for genome-
wide linkage and genome-wide association analysis.
Linkage Association
Collect families Collect cases
Genome-wide studies (GWLS) Genome-wide studies (GWAS)
Replication Replication
Chromosomal regions Chromosomal regions
High resolution High resolution
Replication Replication
Extended haplotype(s) Extended haplotype(s)
High resolution High resolution
Target gene Target gene
OR (a measure of the genetic impact) is high, then a smaller sample size can be used with
a reasonable degree of statistical confidence. When the expected OR is low, the sample size
needs to be greater to achieve the same level of statistical confidence. Power calculations
are particularly important in relation to negative studies that fail to find any association. It
has been suggested that the reason many studies have failed to replicate original results is
due to the lack of statistical power in the replication cohort. This is true in many studies,
but not in all. Once someone else has published their work it can be very difficult to pub-
lish results if they contradict the original findings. Indeed, there is likely to be a publica-
tion bias in favor of the initial studies. False negatives do occur, but so do false positives.
Looking at the early history of association studies we can see that the most frequently rep-
licated findings are those where the level of association is high, with an OR greater than 3.
Healthy controls
Most studies refer to their control group as healthy controls. This is at best a vague descrip-
tion and is mostly incorrect. What are healthy controls? In general terms, the authors of
such studies mean individuals who have not been diagnosed as positive for the disease under
study. It is arguable that comparator populations should contain the normal number of cases
of other diseases that are not included in the study. Exclusion of all diseases could create a
genetic bias in the controls. Therefore a good comparator population will not all be healthy.
Matched controls
Matched implies a degree of similarity between the patient group and the control group.
Most studies use racial and ethnically matched controls. This makes sense because there
are racial and ethnic differences in the distribution of different gene polymorphisms.
In addition, there are racial and ethnic differences in environmental factors, examples
include smoking, diet, and alcohol consumption, that may impact on disease susceptibil-
ity or severity. This is the definition of matched for most studies and no further matching
applies. However, further matching may be appropriate where there are subpopulations
(population stratification) within the study group, or where there is a preponderance
of male or females within the disease group. For example, many forms of autoimmune
disease are more common in women than in men and we could argue that this differ-
ence should be reflected in the control population. Body mass index may also be a factor.
Similar caveats apply when selecting controls for studies of diseases in children or diseases
in the elderly population. However, though matching controls and taking the age of onset
into account is desirable, it is not always practical, particularly in children.
Some studies, such as the WTCCC1 study, which is referred to throughout this book, used
a common control set. However, even this can be criticized because there is no consideration
of latent onset of disease in the control cohort. Indeed, we may question the whole idea of
a healthy control. It is quite common to find studies that have used a workplace population
as controls (Table 3.4). In these circumstances this population may be very different from
the general population, reflecting bias in terms of the socioeconomic group from which the
controls come. Finally, on some occasions there will be no healthy controls; instead, there
may be patient subgroups where each subgroup has a different disease phenotype.
Number of controls
General advice is that the number of controls should be at least the same as the number of
patients. If possible, the number of controls may exceed the number of patients provided
Health Negative for the disease under study at time of recruitment; cannot
control for latent onset
Race/ethnicity Important to reduce effects of population stratification
Age of onset Age of onset is especially important if the illness is likely to have a
high rate of mortality and is common
Sex Male/female differences are common in many common diseases
Socioeconomic factors In some populations this is more important than others and
populations are stratified on this basis
the number is not too excessive. As a rough guide, a control sample of up to but not
exceeding five times the patient sample size is considered acceptable. The upper limit is
important because the greater this number becomes, the greater the possibility of intro-
ducing unmatched controls and bias into the control sample.
Allele frequency
When selecting arrays it is important to consider the frequency of the alleles under test.
Alleles that are present at low frequencies (minor alleles) are not informative in association
analysis and it is important to filter out any markers that are unlikely to be informative.
The reason for this is that the statistical power of a study depends partly on the allele fre-
quency and is lower when allele frequencies are low. Previously, the threshold chosen to
exclude SNPs and other markers was 5%, but in 2015 as much larger sample collections
have become available, SNP frequencies of 1% or lower are being included in studies. This
addresses some of the concerns that have been raised about excluding rare polymorphisms,
which suggested some associations may be missed and therefore exclusion of rare SNPs
may be counter-productive.
Where there are multiple alleles tested, appropriate statistical correction (thresholds)
must be applied and the corrected probabilities (Pc) shown. All of the tests per-
formed should be included in the consideration of the appropriate correction factor.
Confidence intervals (CIs) should be shown as standard.
• If possible, closely linked loci should be investigated to provide haplotype data and
define regions of linkage disequilibrium. Linkage disequilibrium is a very useful tool
in confirming genotypes and haplotypes for genes with multiple polymorphisms,
such as within the MHC.
Guidance on strategy is not included here as a specific point. The choice we make when
planning these studies is made up of a cocktail of information. The plan (phase 2) is also
based on that information, but guided by some of the considerations in this box.
been considerable discussion of publication bias and some effort to encourage journals
to be more accepting of good quality studies that carry negative data. In addition, online
publication of such studies is becoming increasingly popular, creating a pathway through
which this bias can be reduced.
Disease Control
Association analysis
round 1
(P<10-7or P<10- 5)
Replication
Alternative round 1
replication If round 1 used P<10-5
round 1 then P<10-7 would be
P<10-3 to10-5 preferred for replication
in round 2
Populations
Figure 3.11: Schematic illustration of the study design of a GWAS and the follow-up studies, including
fine mapping and functional investigations. The study employs a multistage approach for replication. First, the
entire genome is scanned for potential associations using approximately 500,000 SNPs and selected significance
thresholds (P < 10−5 or P < 10−7). Then either replication at a significance threshold of P < 10−7 or sometimes
between P < 10−3 and P <10 −5 can be applied. After re-testing, a second case control cohort focusing on the alleles
identified in the first analysis provides confirmation that the associations are valid or not. In most cases, third-
step validation is then applied focusing on other populations if possible. Follow-up studies include fine mapping,
focusing on chromosomal regions, haplotypes, and genes with potential causal links with the disease. The final step
once genes and regions have been identified is to perform functional studies linking phenotype to genotype, and
to determine the relationships between phenotypes and specific systems that may explain the disease.
New developments will come from the ENCODE project, and will also
involve more epigenetics and imputation analysis
The ENCODE project was published in 2012 (http://www.genome.gov/10005107). The
project name refers to the Encyclopedia of DNA Elements and was started in 2003. It was
designed to identify all functional elements in the human genome. Most of the human
genome has been labeled as junk DNA, but this is not quite correct. ENCODE set out
to look at the functional elements only, thereby restricting the workload to manageable
proportions.
In today’s world, genotyping for all but the most complicated genetic systems is mostly
being outsourced. Few laboratories now perform all of their genotyping in situ and the
commercial cost of genotyping is falling. The 1000 Genomes Project is coming to a close
(with much of the preliminary data available now) and many centers will use whole-
genome sequencing in the future rather than SNP genotyping by GWAS, especially as
the cost is reduced. In addition, epigenetics offers a different look at complex disease
and imputation analysis offers an easier way to look at the genome at high resolution.
Currently, few studies have looked at systems, but the indication from some of the more
studied diseases is that systems analysis offers a way forward.
It is important to keep our feet on the ground in the face of these massive developments.
Studies still depend on good quality collections of patients and appropriate controls.
However, everything else has changed or is changing. Increasingly, studies are collabora-
tive rather than being based on a single center. This has obvious benefits in terms of qual-
ity, but it can introduce problems too.
The real debate about the future of complex disease research lies not
in the genetics itself, but downstream from the genetics
Genetics only provides one piece of the jigsaw. The whole picture can only be achieved
by considering the genetics, the systems that risk alleles map to, and the environment.
Systems biology provides the basic science for the biological approach. The environment
is more difficult to tackle. Genes and the environment interact through systems. Consider,
for example, the impact of lipids in the diet and the way in which lipid genes may play a
role in obesity through regulating lipid metabolism. However, it is not so simple. Genes
interact, often in different directions. There is also redundancy in biology to allow for
biological failure or breakdown and also for the impact of the environment. Redundancy
exists in the environment too, for example there is more than one source of most essential
food elements.
A reductionist approach will not be sufficient to understand human disease, because
human biology is complex. There is interaction and redundancy within systems (as stated
above). Thus, deconstructing a genetic model can be interesting, but it does not necessar-
ily provide a clear indication of the relationship between genotype and phenotype.
Epistasis or gene–gene interaction is essential in biology and is increasingly spoken of in
the genetics of complex disease. Redundancy is one example of interaction (albeit nega-
tive) that illustrates this point. Redundancy can be seen in many systems through gene
duplication where multiple copies of genes exist or through more subtle dual coverage,
whereby there are two expressed isoforms of a gene with overlapping functions. It is pos-
sible that this redundancy or genetic compensation allows for some, but not all situations.
For example, the functionally expressed complement gene C4B may be able to compen-
sate for the deleted C4A gene except when the system is stressed, at which point the
absence of C4A may create a dysfunctional environment, permissive for the accumulation
and deposition of immune complexes in the small blood vessels, kidneys, and joints lead-
ing to intense outbreaks of inflammation as seen in SLE.
Conclusions
Following the completion of the Human Genome Mapping Project, HapMap, and SNP
Map, designing studies in complex disease has become relatively easy. However, despite all
of the technological advances, the current options in study design in complex disease are
still linkage and association analysis. The real difference between the past and the present
is one of scale. At one end, it is possible to concentrate on a single gene; at the other, it is
possible to search the whole genome. There are pros and cons for both of these options,
and a number of factors influence the choice of which to apply.
There are fewer restrictions now than there were in the past. In the post-genome era there
is a wealth of new information and technology to latch on to. Genome-wide association
analysis has become the most favored form of study for the past few years. Unrestricted
by prior knowledge and more-or-less hypothesis free, it has given researchers the free-
dom to roam across the genome screening for polymorphisms with relatively small effects.
Multiple GWAS has led to accumulation of huge DNA banks and collaboration on a
global much greater scale as well as meta-analyses. This has produced some very significant
reproducible, if small associations. With the right information at the beginning, a good
study plan, and selection of the most appropriate method, which can be a low-technology
candidate gene association study or a full-blown GWAS, the results can be very promis-
ing. However, caution needs to be applied when designing studies. Good planning will
produce good quality results; poor planning leads to irreproducible results and wastes
resources. The lessons of the past are well worth consideration. The literature from the
1960s and 1970s onwards is full of unconfirmed associations. However, we should not
throw the baby out with the bathwater, as the saying goes. Some of the most important
genetic associations in complex disease also date back to 1960s and 1970s. Some would
even go so far as to argue that interest in the genetics of complex diseases, especially those
where there are no multiplex families, would never have been considered without these
early studies.
Careful consideration at the early stages of any research project will help to identify poten-
tial pitfalls and also identify potential new avenues for research. Resources, both financial
and material, are limited. It is therefore essential to make the most out of genetic studies.
The aim should be for high-quality, reproducible results. Choice of options depends on
indicators; of a genetic component in the disease (such as concordance in twins, sibling
relative risk), the available materials (mostly DNA banks), access to patients and their col-
laboration, prior knowledge of the disease pathology, and any results of previous studies.
Further Reading
Books with a guide to and an illustration of the poten-
Frank L (2011) My Beautiful Genome: tial for understanding complex disease in the
Exposing our Genetic Future, One Quirk at a immediate post-genome period.
Time. Oneworld. Davies JL, Kawaguchi Y, Bennett ST et al.
Spector T (2012) Identically Different: Why (1994) A genome-wide search for human
You Can Change Your Genes. Weidenfeld & type 1 diabetes susceptibility genes. Nature
Nicholson. 371:130–136. This is possibly the first genome-
wide search in complex disease and one which
Strachan T & Read AP (2011) Human used very different techniques compared with
Molecular Genetics, 4th ed. Garland. Chapter the large-scale GWAS of today.
16 deals with identifying human genes and sus-
ceptibility factors and provides excellent addi- Hallmayer J, Cleveland S, Torres A et al. (2011)
tional data for students wishing to delve deeper Genetic heritability and shared environmental
into medical genetics. There is also additional factors among twin pairs with autism. Arch Gen
relevant information in the other chapters. Psychiatry 68:1095–1102.
Hirschhorn JN & Daly MJ (2005) Genome-
wide association studies for common diseases
Articles and complex traits. Nat Rev Genet 6:95–108.
Brewerton DA, Hart FD, Nicholls A et al. This review puts the idea of GWAS into con-
(1973) Ankylosing spondylitis and HL-A27. text, and provides useful insight into how such
Lancet 301:904–907. One of the earliest studies are planned and executed.
reported genetic associations with HLA— Johnson GCL & Todd JA (2000) Strategies
an association that has stood the test of time in complex disease mapping. Curr Opin Genet
despite changes in HLA phenotyping and Dev 10:330–334. This is an alternative review
genotyping methods and nomenclature (note of the issues in planning studies of genetic asso-
HL-A27 is now correctly referred to as HLA- ciations in complex disease. The paper provides
B*27). This illustrates the point that not all interesting insight into the future development
genetic associations are recent and not all early of GWAS, unachievable in 2002, but since
reports failed to be replicated. realized.
Bhugra D (2005) The global prevalence of Kieseppa T, Partonen T, Haukka J et al. (2004)
schizophrenia. PLoS Med 2:e151. High concordance of bipolar I disorder in a
Cardon LR & Bell JI (2001) Association study nationwide sample of twins. Am J Psychiatry
designs for complex diseases. Nat Rev Genet 161:1814–1821.
2:91–99. This is an excellent review of associa- Satsangi J, Parkes M, Louis E et al. (1996) Two-
tion studies of complex disease. The timing of stage genome-wide search for inflammatory
this review reflects the immediate post-genome bowel disease provides evidence of susceptibil-
era and pre-GWAS era, providing an interest- ity loci on chromosomes 3, 7 and 12. Nat Genet
ing historical insight into the genetics of com- 14:199–202. This paper describes one of the
plex disease. first genome-wide studies in complex disease.
Colhoun HM, McKeigue PM & Smith GD The 1000 Genomes Project Consortium
(2003) Problems of reporting genetic associa- (2010) A map of human genome variation
tions with complex outcomes. Lancet 361:865– from population-scale sequencing. Nature
872. This work provides an excellent guide and 467:1061–1073.
critique of the common problems in studies on The 1000 Genomes Project Consortium (2012)
the genetics of complex disease. It is an infor- An integrated map of genetic variation from
mative and well-scripted paper, and recom- 1,092 human genomes. Nature 491:56–65. The
mended for students wishing to develop their Human Genome (2001) Nature 409:813–958.
critical skills. The complete issue of Nature mostly dedicated
Collins FS & McKusick VA (2001) to the HGM with multiple papers, letters, and
Implications of the human genome project for editorial commentary from multiple authors. It is
medical science. JAMA 285:540–544. This is a full of useful insight and critical discussion. Any
very important review that provides the reader student of human genetics should read this issue.
The International HapMap Consortium (2005) critical insight into the thinking behind this
A haplotype map of the human genome. Nature now widely adopted strategy.
437:1299–1320. Willer CJ, Dyment DA, Risch NJ et al. (2003)
The International HapMap 3 Consortium Twin concordance and sibling recurrence rates
(2010) Integrating common and rare genetic in multiple sclerosis. Proc Natl Acad Sci USA
variation in diverse human populations. Nature 100:12877–12882.
467:52–58.
The Wellcome Trust Case Control Consortium
(2007) Genome-wide association study of Online sources
14,000 cases of seven common diseases and http://hapmap.ncbi.nlm.nih.gov
3,000 shared controls. Nature 447:661–678. This site is a useful source for the results of the
This is another landmark paper highlighting International HapMap Project.
the development and application of new tech-
nologies (GWAS) in complex disease research, http://www.ncbi.nlm.nih.gov/omim
and it provides a mass of useful information for This is an essential site for updates and informa-
those studying complex disease. This study pro- tion on genetic disease of all types.
vides some of the basic information for students http://www.ncbi.nlm.nih.gov/projects/SNP/
studying Crohn’s disease as well as a further six snp_summary.cgi
diseases not discussed in this chapter. This site provides information about reference
Wang WY, Barratt BJ, Clayton DG & Todd SNP clusters in the genome.
JA (2005) Genome-wide association studies: http://www.ncbi.nlm.nih.gov/SNP
theoretical and practical concerns. Nat Rev
Genet 6:109–118. This early discussion of the This is an essential site for updates on SNPs.
practicalities of GWAS indicates the way in http://www.genome.gov/10005107
which studies of genetics of complex diseases The ENCODE project – a useful site from
were about to develop and provides some which information is freely available.
We have already seen that complex diseases do not lend themselves to simple analysis, and
by definition we are dealing with mutations and polymorphisms that increase or reduce
the risk of a trait. It is also clear that there may be complex interactions between groups
of risk alleles (epistasis), and between risk alleles and the environment. All of the above
suggest that identifying genetic components in complex disease is and will be a difficult
task, and yet it is a task that has attracted a considerable amount of attention in biomedical
research. The potential rewards from this research are great and the technological advances
made over the past 20 years have been astounding.
In this chapter, we will consider the question of why we invest so much time and so much
of our resources to identify risk alleles in complex disease. We will use five key examples
of complex disease to illustrate these: ankylosing spondylitis, cardiovascular disease, hep-
atitis C virus (HCV), rheumatoid arthritis, and bipolar syndrome or bipolar disorder.
These five examples illustrate a host of key points, but they are selected here specifically
to illustrate the potential to use genetics in the differential diagnosis in disease (ankylos-
ing spondylitis), in understanding the response to infection and potentially planning for
patient management (HCV), and also to increase our knowledge of disease pathogenesis
(cardiovascular disease, rheumatoid arthritis, and bipolar disorder). These are complex
issues as the text will show, but the central principals are simple.
There has to be a
simple solution
can proceed to answer the basic question. Why investigate the genetics of complex disease?
This is not pessimism, it is realism and that "realism" is essential in science.
Part of understanding the reasons for the mass of research in complex disease is born
out of an appreciation of the diseases themselves. Complex diseases are common, if we
include cancer in the count they may account for as much as 95% of all forms of human
disease. If we exclude cancer then the figure is likely to be closer to 75%. Not all cancers
are complex; some are Mendelian and some are mosaics that develop during the life of an
individual, that is they are not inherited from the parents. However, some mosaics can be
stimulated into production by inherited genetic variations. Cancer as we shall see later in
the book is a difficult area in complex disease. However, overall complex diseases whether
75% or 95% of all forms of human disease still have a major social and economic impact.
Understanding genetics helps us to understand disease. For example, we currently know
very little about the pathogenesis of many common diseases. This is because patients pres-
ent relatively late in the disease process, sometimes long after the disease-initiating pro-
cesses have done their work. Trying to understand disease pathogenesis by looking at late
stages of a disease can be like looking for fingerprints in a room where the criminal wore
gloves. However, genetics can help. As individuals, our whole genome does not change,
while the clinical symptoms of diseases do. There are some changes in specific cells due to
mutations during aging and in some cancers, but these are not present in all cells.
Human Genome
Project
Diagnosis Pathogenesis Figure 4.2: Clinical outcomes of the HGP. The figure illustrates the
potential clinical outcomes (promises) of the HGP in medical sciences
and the way in which the three main promises are interrelated. The
outcomes are: to aid diagnosis, to help with patient management and
care (including development of new therapies), and to improve the
Treatment understanding of disease pathology. All are achievable to some extent,
and care though some will take longer than others.
genetics in complex disease (below). The idea of individualized therapy and identifying
key pathways in disease pathogenesis are those that have been most achievable so far, but
ultimately all of these promises are linked. It is important to consider not only the data
that studies produce, but also the application of the data in terms of meeting these expec-
tations or promises.
The problem with HLA genes is that they reside within the major histocompatibility
complex (MHC), which is so-called because of the role many of the genes encoded
there play in determining compatibility of tissues. A huge interest in HLA grew out of
the practical need to find matched organs for transplantation, particularly kidney trans-
plantation. In studying complex disease, the HLA genes are the most problematic. There
are two reasons for this: they are the most variable genes in the human genome and they
have the highest levels of linkage disequilibrium in the human genome. HLA antigens
are not expressed on red blood cells. HLA typing was initially performed using serum to
detect antigens on human white cells (leukocytes—hence the name “human leukocyte
antigens” for the genes). The HLA phenotypes, as they were correctly called, identified
a range of different antigens on four different HLA loci: HLA-A, HLA-B, and HLA-C
(all class I loci), and HLA-DR (a class II HLA gene).
The quality of early studies was based on the availability of banks of sera to identify a range
of antigens. Rare antigens were often missed or not tested and many remained undiscov-
ered until the 1990s. Later studies were able to capitalize on the introduction of molecular
genetics, first restriction fragment length polymorphism (RFLP) and then polymerase
chain reaction (PCR), to genotype the HLA genes. The introduction of molecular genet-
ics led to the discovery of other HLA genes, in particular the DQ and DP family.
As methods changed, so the names changed. In the 1970s and 1980s, the genes were
called HLA A, B, Cw, and DR, with the antigens identified in numbers without a space
(e.g. A1) and roman script used. In the 1990s, the names were changed again, e.g. HLA-
A, HLA-B, and HLA-C (the “w” was dropped from the name for all HLA-C antigens and
genes). For each allele, the gene name is followed by an asterisk (*) and then a number
starting with 01, e.g. HLA-A*01. If a variant of this allele family is known, then this is
labeled with an additional colon (:) and two more numbers, e.g. HLA-A*01:01. The
alleles are all given in italics.
Two additional essential facts are that unlike HLA-A, -B, and -C, which encode only a
single α-polypeptide, the HLA class II antigens are the product of two expressed HLA
genes: an A gene and a B gene encoding an α-peptide and a β-peptide, respectively. Both
are found close together on chromosome 6p21.3. In all cases, these genes are polymor-
phic, but this polymorphism is limited in the case of the DRA gene.
One of the most difficult issues relating to the MHC is the extreme linkage disequilib-
rium between the alleles encoded within the region. This can make dissecting the details
about which candidate gene may be responsible for increasing the risk difficult. It is
also possible that the conserved combinations indicate multiple risk alleles on the same
haplotype. The MHC plays a central role in complex disease and cannot be dismissed.
More details about the MHC are given in Chapter 6 and associations with disease appear
throughout this book.
for ankylosing spondylitis to occur, then the disease would be found in a much greater
proportion of the population and all ankylosing spondylitis patients would be expected
to have HLA-B*27. In practice, only a small proportion of HLA-B*27-positives develop
ankylosing spondylitis; this is an example of incomplete penetrance (see Chapter 2). This
Confirms
diagnosis A
Diagnosis B
Refutes
diagnosis B
Figure 4.3: Use of genetic tests in differential diagnosis in complex disease. The patient illustrated has two
potential diagnoses. For each diagnosis, genetic tests can help to determine which diagnosis is more likely to be
correct. This is called differential diagnosis and is widely used, and genetics is likely to play an increased role in
clinical practice. However, the genetic test is not being used here to diagnose the disease per se, but to add weight
to the likelihood of one or other diagnoses. Note this example is not a Mendelian disease, but a complex disease.
departure from the expected Mendelian norm is exactly what we expect to find in geneti-
cally complex disease, because complex diseases do not conform to Mendelian patterns of
inheritance. However, this does not mean that it is impossible to achieve the goal of using
risk alleles as a diagnostic tool. Associated alleles can be used provided the limitations are
correctly understood. In ankylosing spondylitis, HLA-B*27 testing can be helpful in mak-
ing a differential diagnosis as the simple illustration in Figure 4.3 shows.
1 5 1 3 2
5 8 4 7
2 6
C D
3 7
1 3 1 2
8
4 8 4 7 5 6
Figure 4.4: Risk portfolios in complex disease. The figure illustrates risk alleles numbered 1–8 in the circles
on the left and four cases in square boxes labeled A–D. Not all cases may have the same risk alleles. This figure
illustrates the idea that individual cases may have different risk portfolios, i.e. groups of risk alleles (shown in the
black boxes). Here, case A has 1–3–5–8, case B has 2–4–7, case C has 1–3–4–7, and case D has 1–2–5–6–8. Note
that each case has a different set of risk alleles and some cases have more risk alleles than others, e.g. B has three
alleles, D has five alleles, while A and C each have four alleles. Note that whatever the risk portfolio, the sum of
the risk is unlikely to be 1. These are complex diseases, thus being a carrier of a group of risk alleles will be neither
necessary nor sufficient for the disease to occur.
It is very easy in a research laboratory to forget that patients are individuals. In a labora-
tory, patients can become numbers within a population. We tend to lose sight of the fact
that each individual patient is different and we tend to assume that all cases with the same
disease are exactly alike. This is not true. A disease group is a collection of patients with
the same signs and symptoms. However, there is usually a great deal of variability in terms
of age of onset, severity, response to treatment, and prognosis.
outcome following HIV-1 infection and are discussed in Chapter 7. Response to HCV
has also been found to be strongly associated with polymorphism in the HLA‑DQB1
gene. This association is thought to be primarily with the DQB1*03:01 allele. The
DQB1*03:01 allele is in linkage disequilibrium with both DRB1*04 and DRB1*11. In
the UK DQB1*03:01 is most often found with DRB1*04 and in Southern France it is
most often found with DRB1*11. The DQB1*03:01 allele has been associated with a
higher incidence of self-limiting (acute) HCV infection compared to chronic HCV infec-
tion (Figure 4.5). Further studies have also shown that this polymorphism is associated
with a stronger host T cell response to synthetic HCV peptides. The clinical significance
of this observation is that HCV infection is very common and the majority of cases do
not spontaneously clear the virus. Most cases (50–80%) will develop a chronic infection
that over a long period of time can progress to liver fibrosis, liver failure, liver cancer, and
death. Interestingly, a significant proportion of cases do not respond to treatment when
treated. Patients with progressive disease may eventually require liver transplantation.
From a diagnostic perspective this association is of limited value. DQB1*03:01 is com-
mon (about 30% of the UK population have a DQB1*03:01 allele), but from a biological
perspective it is interesting. What is it about DQB1*03:01 that enables the majority of
those who carry this allele to clear the virus? In other words, what is DQB1*03:01 doing or
not doing? Information about this allele and how it works may be used in clinical practice
to develop a vaccine or to manipulate the host response to the virus in such a way that it is
spontaneously eliminated. At the time of writing, new treatments in HCV are having very
significant effects so the potential use of host genetics may change. More recently, there
have been quite a number of studies identifying genetic variations that are associated with
the outcome following HCV infection and response to treatment. These have identified a
number of interferon (IFN) genes, among others, that may prove important in develop-
ing new personalized therapy plans for patients with HCV and other infectious diseases.
70
Acute
60
Chronic
50
Frequency
40
30
20
10
Figure 4.5: HLA class II determines outcome following infection with HCV. The figure shows real data from
two different studies. Both studies reported genetic associations with HLA in two different patient groups: those
who developed chronic infection and those who had an acute (self-limiting) infection following HCV infection.
A group from the UK and one from France each reported a lower frequency of the DQB1*03:01 allele in patients
with chronic HCV infection. In the UK population, this allele is most often carried on the DRB1*04 haplotype
(data for DRB1*04 shown); in France, this allele is found more frequently with DRB1*11:01 (data not shown). The
important observation is that acute infection is associated with DQB1*03:01 in both populations.
UK Approximate
waiting number
list of donors
Figure 4.6: Estimated donor organ availability versus demand. The figure illustrates the difference between
donor organ availability in the UK and need for four major organs in 2013. Note that the available figures are
constantly updated. These figures illustrate a major clinical issue, which is that in most years only the donor
numbers for livers seem to match their target. In some years there is even a deficit in liver donors. A large number
of the transplanted kidneys are from matched, related donors. Approximate donor figures are based on the
previous year’s transplant figures as there is no way to know the actual number of donors for the year ahead. On
a global scale, the donor deficits are much more pronounced and there is a serious debate over the advantages
of opt-in versus opt-out systems.
1 3
5 8 Rapid progression
Patient B
4 7 Slow progression
Figure 4.7: Disease progression and risk portfolios. This figure shows the potential to apply genetic testing to
identify patients at risk of rapid progression versus those with slow or moderate progression. Such information
could be used to determine the best course of treatment and management for individual cases. The use of
genetics in this way may also maximize the use of key resources, such as donor organs for transplantation.
to use genetic profiling to determine the risk portfolio for the two patients. This genetic
risk portfolio could then be used to determine the likely rate of progression for the two
patients and this information could be used to inform the selection about who (A or B)
to list for transplantation. This would lead to a more rational use of a precious resource
(transplant donors) and be of benefit to both patients as the unlisted patient could be
listed for transplantation later.
with a significant effect. This means that even markers for risk alleles associated with
small effects are of value. Consequently, it is in this arena that the majority of studies
have been focused and there have been some notable successes, and even though there is
still much, much more to come, we do not need to look back to the pre-genome era to
find good examples.
The Wellcome Trust Case Control Consortium (WTCCC1) study published in 2007
included seven diseases: Crohn’s disease (considered in Chapter 2), type 1 diabetes (T1D)
and type 2 diabetes (T2D) (both included in Chapter 10), rheumatoid arthritis, bipolar
disorder, coronary artery disease, and hypertension. There had been considerable work
prior to 2007 in each of these diseases but in each case the WTCCC1 study represented a
milestone in genetic investigation, though not necessarily the first genome-wide associa-
tion study (GWAS) in each case.
The WTCCC1 study confirmed strong associations, i.e. with a significance threshold of
P < 10−7 with the MHC region and PTPN22 in rheumatoid arthritis, and identified novel
significant associations with rs420259 on chromosome 16p12 in bipolar disorder and
with rs1333049 on chromosome 9p21 in cardiovascular disease. In addition, the study
identified nine moderate (i.e. significance threshold of P < 10−5) associations with rheu-
matoid arthritis, 13 with bipolar disorder, and six with hypertension.
In the remainder of the present chapter we will consider four diseases: ankylosing spon-
dylitis, rheumatoid arthritis, bipolar disorder, and cardiovascular disease. Three of these
are from this list of seven examples in the WTCCC1 study above, and will be considered
in detail to illustrate the concept that genetics can help us understand disease pathology.
Ankylosing spondylitis was not considered in the WTCCC1 study, but has been consid-
ered more recently in the WTCCC2 study. We will look at early pre-GWAS, early GWAS
(2007), and more recent studies for each disease. Regarding the other diseases in the
WTCCC1 study, Crohn’s disease is discussed in Chapter 2, T1D and T2D are discussed
in Chapter 10, and hypertension is not discussed, though many other diseases and traits
are discussed throughout the book.
Figure 4.8: Molecular mimicry in ankylosing spondylitis. The figure illustrates one of the major theories
behind the genetic association of ankylosing spondylitis with HLA-B*27. The pathogenic peptides (represented by
the dark gray smiley face with a hat) have sequences that are very similar to the host (represented by the smiley
white face without a hat).
Later GWAS have offered even further insight into the biology of
ankylosing spondylitis
The WTCCC2 study and others, including the International Genetics of Ankylosing
Spondylitis Consortium (IGAS), have identified at least 25 risk alleles for ankylosing spon-
dylitis, all of which are linked to immune function. The IGAS study alone found 25 risk
loci at a significance threshold of P < 5 × 10−7. Twelve of these had been previously identi-
fied between 2007 and 2011 (including ERAP1, IL12B, and IL23R). ERAP1 (also known
as ARTS) stands for endoplasmic reticulum aminopeptidase 1. In addition, in the Han
Chinese population, two other loci have been reported: HAPLN1–EDIL3 and ANO6.
A large number of the 25 identified genes produce proteins involved in immune inflam-
matory activity, e.g. IL6R and IL23R. The impact of the cytokines and their receptors has
received considerable attention with interest in interleukin (IL)-23 and IL-17 interaction
in pathogenesis, and the development of potential anti-IL-17 therapy. It is possible that in
ankylosing spondylitis we are now quite close to reaching the stage where understanding
the genetic basis of disease pathology will lead to the development of individualized treat-
ment and better diagnosis based on knowledge of each individual’s genetic risk portfolio.
300,000
16 – 44
250,000 44 – 64
200,000 64+
Prevalence
150,000
100,000
50,000
Male Female
Figure 4.9: Prevalence of rheumatoid arthritis in the UK. The figure shows the prevalence (total number of
cases) in each age category for male and female patients from Symmons et al. (2002). Note that the prevalence
rises in each age group compared with the previous group and the greater prevalence in females compared with
males for all three age groups. (Data from Symmons D, Turner G, Webb R et al. [2002] Rheumatology 41:793–800.)
anti-inflammatory mediators are present and activated they fail to adequately downregu-
late the immune response. Genetic studies in seropositive European cases have identified a
large number of risk alleles for rheumatoid arthritis and these alleles are mostly associated
with immune or immune-related function. In the sections below we will briefly consider
some of the early studies of rheumatoid arthritis and then concentrate on three recent
studies, all based on GWAS and meta-analysis of GWAS data.
The early history of rheumatoid arthritis, the MHC, and PTPN22
Evidence of a genetic component in rheumatoid arthritis is based on familial occurrence
and elevated sibling relative risks (λs) of approximately 10. Early studies identified strong
reproducible associations with the HLA genes on chromosome 6p21.3 and also with the
PTPN22 gene on chromosome 1p13, which encodes the gene for the lymphoid-tyrosine
phosphatase (Lyp) protein tyrosine phosphatase non-receptor type 22.
The MHC in rheumatoid arthritis
The WTCCC1 study identified HLA as the major risk region for rheumatoid arthritis. This
finding was not a surprise as the association had been described long before this. The genetic
association between rheumatoid arthritis and HLA is very neatly summarized by Gerald
Nepom in his chapter in the book HLA in Health and Disease (Lechler and Warrens, 2000)
(summarized here in Table 4.1). The primary associations in each case are with alleles from
the DRB1*04 family, with the exception of Native Americans, where the primary associa-
tion is with the DRB1*14:02 allele. All of these susceptibility alleles either encode the amino
acid sequence QKRAA or QRRAA (Q = glutamine; K = lysine; R = arginine; A = alanine)
at positions 70–74 of the DRβ polypeptide of the HLA DR molecule. This shared epitope
is within the HLA-binding groove and a site where antigenic side chains engage with the
HLA molecule. At the molecular level, the crucial consideration is the e lectrostatic charge
in and around the binding pocket. The electrostatic charge at each pocket and within the
groove will determine which peptide antigens are preferentially bound and presented to the
T cell receptor (see Chapter 6). Allelic variation in the molecular structure of the expressed
Table 4.1 HLA and shared epitopes associated with rheumatoid arthritis.
Lechler R & Warrens A (eds) (2000) HLA in Health and Disease, 2nd ed. Elsevier. With permission from Elsevier.
HLA molecule encoded by different HLA alleles may explain the genetic association in
rheumatoid arthritis, providing a platform for further work to identify athritogenic pep-
tides and potentially develop novel therapies for the disease.
In addition, further work in rheumatoid arthritis suggests that there may be a gene dose effect.
Not all DRB1*04 family members carry the QKRAA or QRRAA sequence. In those that do
carry this sequence at the DRB1 70-74 position, especially: DRB1*04:01, DRB1*04:04, and
DRB1*04:05, all have a greater risk of RA. This is higher in those who carry two copies of
this sequence (that is they are homozygous), than in those who carry only one copy (that is
they are heterozygous). Despite the strong association between HLA and rheumatoid arthri-
tis, the risk ratios listed indicate that HLA-DR genotyping would be of very little value in the
diagnosis of rheumatoid arthritis. However, it may be of some use as a prognostic indicator.
The DRB1*04 alleles are associated with more progressive severe erosive disease.
The story of rheumatoid arthritis is more complex than first thought
It is always nice to see a clear explanation for HLA associations with disease, such as the
shared epitope hypothesis above for rheumatoid arthritis. However, recent GWAS indicate
a more complex situation in rheumatoid arthritis with at least three different HLA genes
involved. DRB1 remains the strongest, but independent associations with HLA‑B*08 and
DPB1 have also been identified. In this analysis some elements of the shared epitope hypoth-
esis for DRB1 still stand, but the amino acids at positions 11, 71, and 74 are highlighted
as the most important, and potentially the only DRB1 amino acids with causal effects in
rheumatoid arthritis. This idea is supported by observations of significant, if weaker associa-
tions with amino acids at position 9 in the binding grooves of both HLA‑B*08 and DPB1.
This is a departure from the idea of a string of amino acids and focuses more on specific
isolated amino acids only. These studies involved a complex analysis with a mix of GWAS
using 3117 single nucleotide polymorphisms (SNPs) across the MHC in over 5000 cases
with seropositive rheumatoid arthritis and close to 15,000 controls. Imputation analysis
was used to identify the HLA alleles and haplotypes. Rheumatoid arthritis is not a rare ill-
ness and therefore these findings are likely to be either confirmed or challenged by follow-
up studies. It will also be interesting to see how direct sequencing impacts on these findings
and how imputation analysis compares with direct sequencing.
in the PTPN22 gene leads to lower Lyp expression, and thus to T, B, and dendritic cell
hyper-responsiveness, such as that seen in rheumatoid arthritis and T1D.
The genetic associations with the MHC and PTPN22 in rheumatoid arthritis are not
surprising as both are associated with a wide variety of other autoimmune diseases, includ-
ing T1D, systemic lupus erythematosus, and multiple sclerosis. In rheumatoid arthritis,
estimates suggest that together the MHC and PTPN22 account for as much as 50% of the
familial genetic risk of the disease.
Table 4.2 A short list of selected major risk genes for rheumatoid arthritis.
Where two sets of figures are presented, the top figures are from the WTCCC1 (2007) study (not italic) and the
lower figures are from Stahl et al. (2010) (italic). Where only one set is represented for a gene, not italic or italic
indicates the source as WTCCC1 or Stahl et al. respectively.
disease diagnosis and patient management, but they do point at the disease pathology. As
almost all of the risk alleles point to the immune system, this suggests that using genetic
associations to target and unpick specific immune pathways and/or correct defects in spe-
cific pathways may be the way forward. There are two elements to this: (1) identifying
pathways helps us understand the pathogenesis of this disease, and (2) by improving our
understanding of the disease, we may be able to produce better treatment and manage-
ment plans for patients. For example, we could consider the IL4–STAT6 pathway. These
Table 4.3 Weak (suggestive) genetic associations with rheumatoid arthritis from Stahl (2010) and listed
on OMIM.
two genes interact. IL-4 activates STAT-6, a signal transducer and activator of transcrip-
tion (type 6). This pathway has been found to be activated in patients with both short-
and long-term rheumatoid arthritis. If those carrying risk alleles could be given a dietary
supplement to enhance or reduce the activity in this pathway, then maybe it would be
possible to reduce or even eradicate rheumatoid arthritis. This may seem overly optimistic
considering the statement that Hugo Mencken applied to seeking solutions to complex
problems, and maybe we are over-inclined to seek simple solutions, but equally the best
solutions often turn out to be simple.
MHC
T1D
MS
SLE
PTPN22
CD6
T1D
MS
Crohn’s
Figure 4.10: Overlap between risk alleles of specific
Rheumatoid genes and different diseases focusing on rheumatoid
arthritis arthritis. This diagram illustrates some of the overlap
between associated genetic markers in rheumatoid
IL2RA arthritis and other diseases. Overlap is common among
IRF5
T1D complex diseases and it is helpful in identifying common
SLE
MS pathways in pathogenesis. Each of the major risk alleles
or genetic regions identified in the circles is associated
IL2/IL21 with rheumatoid arthritis. In each case, other diseases
Celiac that share some degree of overlap are illustrated, e.g.
IRF5 and systemic lupus erythematosus (SLE), and CD6
and multiple sclerosis (MS).
3.0
2.5
2.0
OR 1.5
1.0
0.5
Figure 4.11: ORs for six of the most strongly associated genetic risk markers for rheumatoid arthritis. The
figure shows the small size of the effect for most risk alleles as determined by the OR. Figures vary from study
to study, and indeed even within studies, and therefore it is advisable to consider these as approximations or
estimates rather than absolute values, see Box 2.2.
The story of rheumatoid arthritis so far tells us that there are difficulties in reaching our
goals, but we should be optimistic about the possibilities of achieving at least one of the
aims of the HGP with regard to this disease, that of understanding pathways that lead to
pathogenesis. The problem is that we have more data than we can handle at present. In
one respect, it is reassuring that there are so many identified alleles because this will enable
us to explore multiple pathways and interactive networks. However, this takes time and
resources, and pathways and networks are complex. We need time to work on the data,
but more genetic data is piling up all around us as we work and it can be overwhelming.
The answer is currently out of our reach, but as we shall see in some of the other chapters
in this book, it is getting nearer by the day. It will take time to complete this exercise in
rheumatoid arthritis, but hopefully the final outcomes will be well worth the wait and the
investment in time and money.
Bipolar disease is a disease for which there are many weak genetic
associations, but few strong consistent associations
Bipolar disease is a manic depressive illness characterized by recurrent episodes of distur-
bance in mood, ranging from extreme elation or mania to severe depression. The pathogen-
esis of this disorder is poorly understood. There is clear evidence of a heritable component
with a sibling relative risk of 7–10 and heritability rated at 80–90%. Twins studies have
shown variable rates of concordance with values from 33–75% for monozygotic and
0–13% for dizygotic twins. Differences in the diagnostic criteria applied, a scertainment
bias, and selection of controls all help to account for this variation. However, heritability
is not just about genetics, it involves many other factors and thus high heritability does
not mean strong genetic associations.
Regions identified in early linkage and the 2007 WTCCC1 study
Several genome regions have been implicated in linkage studies (Table 4.4), and there is
clear evidence of overlap with genetic susceptibility to schizophrenia and bipolar disorder.
Gene/allele Location
MAFD1 18p
MAFD5 2q22–q24
MHW1 4p
MHW2 4q
MADF4 (region includes PALB2, NDUFABI, 16p12
and PCTN5)
MHW indicates mental health and welfare. MHW1 and MHW2 are based on linkage studies in the Amish. MAFD
is the name for bipolar disorder, based on manic affective disorder, then a number. Interestingly, OMIM reports
linkage on chromosomes 5, 11, and 18, which have not been identified in association studies.
Associations with DAOA (d-amino acid oxidase activator), DISC1 (disrupted in schizo-
phrenia 1), NRG (neuregulin 1), and DTNBP1 (dystrobrevin binding protein 1) have all
been reported. In the 2007 WTCCC1 study, the strongest signal was on chromosome
16p12 (Table 4.5). Genes close to this region with potential clinical relevance include:
• PALB2 (partner and localizer of BRCA2), which is involved in key structures within
cells including chromatin.
• NDUFAB1 [NADH dehydrogenase (ubiquinone) 1, α/β complex 1], which encodes a
mitochondrial respiratory chain subunit.
• DCTN5 (dynactin 5), which encodes a protein involved in intracellular transport that
is known to interact with the DISC1 protein, the gene for which is already implicated
in susceptibility to both bipolar disorder and schizophrenia.
Expanded analysis in the WTCCC1 study found strong associations with four regions
of the genome, one of which is close to the KCNC2 gene that encodes the Shaw-related
voltage-gated potassium channel. Changes in the dynamics of ion channel activity are
known to cause episodes of central nervous system disorders, including seizures, ataxia,
and paralysis. This may include the extreme mood changes seen in bipolar disorder.
Other highly ranked SNP signals indicated the importance of the γ-aminobutyric acid
(GABA) neurotransmission pathway identifying three signals in particular rs7680321,
rs1485171, and rs11089599, which are close to or within the genes for GABRB1 (encod-
ing the GABA A receptor β1), GRM7 (glutamate receptor, metabotropic 7), and SYN3
(synapsin III).
More recent GWAS
Following on from the 2007 WTCCC1 GWAS, additional studies of bipolar disorder
have identified novel/different SNP associations that include:
• DGKH (diacyclglycerol kinase).
• MYO5B (myosin 5B).
• NCAN (neurocan), which is an extracellular matrix glycoprotein involved in cell adhe-
sion and migration.
• MAD1L1 (meiotic arrest deficient like 1), which is important in cell division.
OMIM reports linkage on chromosomes 5, 11, and 18 which have not been identified in association studies. The
WTCCC1 study data can be read in several ways; there are major associations and moderate associations, and
associations only reported in the supplementary texts. In this chapter, the major associations are listed together
with some of those from the supplementary text that could be identified by gene name, this leaves 13 listed as
moderate unaccounted for in T4.4 (these are on chromosomes 1p31, 2p25, 2q12, 2q14, 2q31, 2q37, 3q37,
6p21, 6q22, 9q32, 14q22, 14q32, and 20p13). It is assumed that the tagged SNPs for 3p23, 8p12, and 16p12 are
markers for GRM3, NRG1, and PALB2, respectively. DAOA, DTNBP1, and DISC are also not listed in T4.3 as
they fail to reach minimum values for moderate association P < 10−5.
The DGKH and MY05B genes are not currently listed as potential candidates in bipolar
disorder, and interestingly these post-WTCCC1 studies did not immediately replicate the
findings of the WTCCC1 study itself. Altogether, it is quite challenging trying to find a
consistent message when reading through all of the papers on genetic association studies
in bipolar disorder. On first review of the data there appears to be no consistency between
them. However, further analysis, including meta-analysis, has revealed some important
new and consistent associations. These include novel associations with the CACNA1C
(calcium channel, voltage dependent, L-type α 1C subunit), and ANK3 [ankryin 3, node
of Ranvier (ankryin G)] genes, and confirmation of the NCAN association.
There are many reasons for these reported differences between initial studies and the
meta-analysis studies. Bipolar disorder is not easily diagnosed, the genetic associations
with the disease are quite weak, and some of the studies have been based on quite small
case cohorts. Meta-analysis can iron out some of these wrinkles. One particular study
that stands out and illustrates this well is that of Ferreira et al. (2008). In a GWAS of
1098 cases, Ferreira et al. first reported no major associations with bipolar disorder. The
authors decided that this inconsistency may have been due to the small number in their
own study cohort or the use of different tagged SNPs for their analysis. However, the
authors were not discouraged. They decided to perform further analysis focusing on areas
of known association from other studies, especially the recently identified weak associa-
tion with the CACNA1C gene that had been identified by combining two data sets in a
meta-analysis. When the authors fed their own data into a meta-analysis with the two
previous studies, there was a weak association with the CACNA1C gene. In addition, by
referencing the existing data to HapMap they were able to impute the genotypes for each
case using PLINK software. This software uses the expected pattern of linkage disequilib-
rium to assign haplotypes and increases the genome coverage—in this case from approxi-
mately 350,000 to over 1.5 million SNPs (see Chapter 12). Imputation is increasingly
used on GWAS, provided there is an adequate database of fully genotyped individuals
on which to make the haplotype assumptions for the test sample. In this case, assigning
non-genotyped alleles to the test samples on the basis of known linkage disequilibrium
is a very profitable exercise, reducing cost and increasing efficiency. However, the quality
of the study relies on the control data bank. This analysis identified the ANK3 gene on
chromosome 10q21 as the major susceptibility gene in bipolar disorder, with CACNA1C
as the second strongest.
It must be stressed that many of these associations with bipolar disorder did not reach
the significance threshold of P < 10−7 in any of the initial analyses or for any individual
group. Only the combination of the data provided P values of the required acceptable
magnitude.
The future of genetic studies in bipolar disorder
Clearly, much more needs to be done in bipolar disorder, but taken together the observa-
tions identify a number of potentially important pathogenic pathways in bipolar disor-
der, particularly the GABA neurotransmission pathway. These studies also indicate the
potential for polymorphisms in the genes involved in calcium channel activity essential
for neurotransmitter release and genes from the solute carrier family (CACNA1C and
SLC39A3, respectively) to be functionally associated with bipolar disorder. The SLC39A3
gene encodes a solute carrier family 39 (zinc transporter) member 3. The association with
GRM7 illustrates the possibilities of unraveling the complexities of genetic links in bipolar
disorder. GRM7 encodes a member of a family of glutamate receptors. l-Glutamate is a
major excitatory neurotransmitter in the central nervous system. l-Glutamate activates
glutamate receptors such as GRM7 and is active during most normal brain activities.
GRM7 is one of a family of metabotropic G-protein-coupled receptors that has been
linked to the inhibition of the cyclic AMP cascade. It is quite easy to consider how poly-
morphisms that regulate this receptor could have functional effects in terms of neurotrans-
mission and how the downstream consequences of such interpersonal variation may lead
to a trait such as bipolar disorder (Table 4.6).
Table 4.6 The function of selected genes associated with increased risk of bipolar disorder.
Gene/allele Function
PALB2 Partner and localizer for stabilization of BRCA2. Acts as a molecular
scaffold attracting BRCA1 and RAD51 ensuring stabilization of the
BRCA1–PALB2–BRCA2 complex, which is required for homologous
recombination.
KCNC2 Potassium-gated voltage channel Shaw-related family member 2. As the
name implies, this mediates potassium ion permeability across the cell
membrane.
NRG1 Glial cell growth factor interacting with tyrosine kinase receptors to recruit
ERBB1 and ERBB2 co-receptors resulting in ligand-stimulated tyrosine
phosphorylation and activation of ERBB receptors. Involved in induction
of growth of glial and neuronal cells, among other cells Hence, early
identification as a glial cell growth factor.
GABRB1 γ-Aminobutyric acid A receptor β1. Encodes a subunit of a chloride
channel that mediates rapid inhibition of synaptic transmission in the
central nervous system. This gene is strongly associated with schizophrenia.
GRM7 Metabotrophic G-coupled-protein receptor for glutamate associated with
inhibition of the cyclic AMP cascade.
SYN3 May be involved in the regulation of neurotransmitter release and genesis
of synapse.
CACNAIC Calcium channel voltage-dependent type 1 α1 C subunit gene. Mediates
influx of calcium into cells and is involved in neurotransmission.
ANK3 Ankyrin G is specifically found at a neuronal junction or axonal segment
known as the node of Ranvier in the central and peripheral nervous
systems. It is one of a group of proteins thought to be associated with
various functions in the cytoskeleton.
JAM3 Junctional adhesion molecule 3. Promotes cell–cell adhesion.
SLC39A3 Solute carrier family member 39 family 3. Zinc transporter. Another
member of the SLC family, SLC6A4, may be important in serotonin
release.
NCAN Neurocan. A chondroitin sulfate proteoglycan that is thought to be
involved in control of cell migration and adhesion. Studies show that it is
expressed in the hippocampus of mice and humans.
MAD1L1 Meiotic arrest deficient-like 1. Important in cell division and regulation of
processes in meiosis.
PBRM1 Polybromo-1. Involved in transcriptional activation and repression of select
genes by chromatin remodeling. It is thought to act as a negative regulator
of cell proliferation.
ODZ4 The ODZ4 gene encodes a human homolog of the Drosophila
pair‑rule gene Tenm (odz). The gene product may function as a signal
transducer.
Functional studies on human tissue samples have shown that the NCAN and MAD1L1
genes are both expressed in the hippocampus, and studies on mice show NCAN to be
present in the cortical and hippocampal areas of the brain that are involved in cognition
and regulation of emotions.
The study examples above also indicate that similar diseases can share risk alleles, but they
also illustrate some of the difficulties of working with disorders that are not easily defined,
such as bipolar disorder, compared with working with a disease with more physical mani-
festations, such as rheumatoid arthritis. Altogether, these studies serve as reminders about
study planning, design, and the limitations of association studies (discussed in Chapter
3 and 5). Finally, these findings also illustrate the importance of taking a holistic view
and not dismissing non-genetic factors. Many factors are known to contribute to bipolar
disorder, not all of them are genetic (Figure 4.12).
Bipolar
disorder
Figure 4.12: Different potential causes of bipolar disorder. The figure illustrates probable contributing
factors to bipolar disorder. They may act individually or in combination. There is no restriction on the number
of elements active. For example, stress in a genetically susceptible individual may trigger bipolar disorder, but
nutritional status may also be a factor. However, it must be noted these are only possible relationships and not
definite relationships.
CHD Cancer
CVD Other
Figure 4.13: Impact of coronary artery disease based on figures from the British Heart Foundation. These
figures for the UK population in 2008 show morbidity for coronary heart disease, cardiovascular disease, stroke,
and cancer against other diseases and pathologies in men and women. It is quite clear that coronary artery
disease-related conditions [coronary heart disease (CHD), cardiovascular disease (CVD), and stroke] account for
a significant portion of deaths considering that coronary heart disease, cardiovascular disease, and stroke may
all be the consequences of coronary artery disease. Compare the figures given for these three problems with
diabetes, which is considered a major scourge and accounted for approximately 1% of deaths in 2008 in the UK.
In comparison, the figures for coronary heart disease, cardiovascular disease, and stroke were 18, 8, and 7% in
men (total 33%) and 13, 12, and 8% in women (total also 33%) that same year.
Table 4.7 Risk loci for coronary artery disease identified in the 2007 WTCCC1 GWAS.
The lower case “rs” denotes the tagged SNP for the region. Not all potential regions identified were associated
with specific genes in the WTCCC1 study. Many potential candidates were suggested following further analysis.
Note association with ALOX5AP and LTA4H identified as candidates prior to 2007 were not confirmed in the
WTCCC1 study. The values quoted depend on the reading of the paper and which table they were extracted from,
as a range of possibilities are given in the different analyses. Note that not all associations were replicated in later
studies.
The WTCCC1 study was also unable to replicate previously reported associations with the
APOE gene. This problem may have been due to the fact that the tagged SNPs on the
Affymetrix gene chip that was used in the study were not good markers for identifying
APOE compared with direct genotyping. Interestingly, this explanation appears to be sound
because studies in 2011 confirmed the association with APOE in cardiovascular disease.
Considering the function of APOE, which mediates the binding, internalization, and catab-
olism of lipoprotein particles, this gene makes a very likely candidate. The APOE protein
can serve as a ligand for the low-density lipoprotein receptor as well as interacting with the
APOE receptor.
The 2007 WTCCC1 study and cardiovascular disease
The most important single observation on cardiovascular disease in the WTCCC1
study was a new and strong (P < 10−14) association with a SNP on chromosome 9p21.3.
The region contains genes for two cyclin-dependent kinase inhibitors (CDKN2A and
CDKN2B). Both genes are widely expressed and are involved in the regulation of the cell
cycle. CDKN2B is expressed in macrophages but not smooth muscle cells with fibrous
lipid lesions. CDNK2B expression is induced by transforming growth factor-β, a signal-
ing system that has been associated with cardiovascular disease. In addition to the two
CDNK genes, the SNP on 9p21.3 maps close to the MTAP gene, coding for a methyl-
thioadenosine phosphorylase enzyme that contributes to processes involved in adenine
and methionine salvage. This is a ubiquitously expressed gene present in the cardiovascular
system and it is also listed as a candidate for cardiovascular disease. Genetic variation in
any one or all of these genes could contribute to the risk of cardiovascular disease. The
association with CDNK2A–CDNK2B remains the strongest reported and most consistent
association for cardiovascular disease.
The other associations with cardiovascular disease reported in the 2007 WTCCC1 study
include modest associations with a SNP (rs6922269) within the MTHFD1L gene [meth-
ylene tetrahydrofolate dehydrogenase (NADP+-dependent) 1-like]. This gene encodes the
mitochondrial isozyme of C1 tetrahydrofolate (THF) synthase, which converts the single
carbon units carried in folic acid. C1 THF synthase is used in several biological processes,
including purine synthesis. Cardiovascular disease has also been associated with mutations
in another THF enzyme, i.e. THF reductase (encoded by MTHFR). Interestingly, both
MTHFD1L and MTHFR activity can influence plasma levels of homocysteine, indicating
a common pathway for cardiovascular disease. One other gene identified in the expanded
analysis as having a major risk for cardiovascular disease is the ADAMTS17 (a disintegrin-
like and metalloproteinase with thrombospondin type 1 motif 17). This gene encodes a
protein involved in vascular extracellular matrix degradation and remodeling—a process
central to atherosclerosis.
GWAS in cardiovascular disease beyond 2007
Since the publication of the WTCCC1 GWAS there have been many other studies of car-
diovascular disease using GWAS. Two studies published in 2011 stand out. The first study
was a meta-analysis of data from 14 GWAS comprising 22,233 cases and 64,762 controls
with a replication cohort of 56,682 cases performed by the CARDIoGRAM Consortium.
In this analysis, 13 loci were identified for the first time as risk loci for cardiovascular dis-
ease and 10 of 12 previously reported risk loci were confirmed. The data from this study
are summarized in Table 4.8. The authors of this study found that only three of the new
loci were associated with traditional markers for cardiovascular disease, with the remain-
der being in gene regions not previously implicated in the pathogenesis of cardiovascular
disease. The report also highlights the overlap between risk markers as five of the new loci
have strong associations with other diseases or traits.
The second study in 2011 (also summarized in Table 4.8) used the data from the study
above and a second data set produced by the C4D Consortium based on approximately
40,000 cases. This combined report identified 35 common gene variants at 34 loci as risk
factors for cardiovascular disease and suggested that these common variants account for
only 13% of the overall genetic risk of cardiovascular disease (Table 4.9).
The reports themselves are too detailed to reiterate every fact in this chapter, but they raise
a number of important issues for anyone looking at published data or about to set out on
a GWAS.
Many of the associations were not replicated
A review of the report of the 2007 WTCCC1 study in the light of the 2011 studies
shows that only one of the associations for cardiovascular disease reported in 2007 has
been replicated and all of the others are not confirmed (ADAMTS7, MTHFD1L, and
MTHFR in particular) at a significance threshold of P > 5 × 10−8. One explanation for
these failures in replication is the application of different gene chips or the closer asso-
ciation with tagged SNPs in a different location which may be identified in subsequent
studies, but not identified (or perhaps even tested) in the initial study. However, all is not
lost as some of this discordance may be explained through linkage disequilibrium whereby
SNPs associated with CAD in original studies are in close linkage with other loci that have
themselves been claimed as risk genes for cardiovascular disease. For example: MIA3 (mel-
anoma inhibitory activity family 3) on 1q41, APOE on 19q13, and ADAMTS17 (ADAM
The CARDIoGRAM study data presented here indicates confirmation of 12 locations and 13 novel genes with
P <10−8, with one exception (SH2B3 at 6.35 × 10−6).
Table 4.9 Risk loci identified in other GWAS (excluding WTCCC and CARDIoGRAM reports shown
in T4.7 and 4.8 respectively), but included in the list of 35 loci reported by Peden and Farrall
(2011).
The data presented here represent those loci not listed in T4.7 and 4.8. Interestingly, the list includes APOE,
which was not reported by the WTCCC1 2007 study, and IL5. Both of these gene loci have been associated with
susceptibility to other diseases.
new loci at the 5 × 10−8 threshold of significance and confirmed the associations with
CDKN2A–CDKN2B reported in the European population, as well as the associations
with PHATCTR1 (phosphatase and actin regulator 1), TCF21 (transcription factor 21),
and C12orf51 (chromosome 12 open reading frame 51).
On close reading, the Chinese study illustrates a number of important issues in this area
of research. Most important among these is the observation that not all of the associations
reported in Europeans are replicated in the Chinese population (Table 4.10). This popula-
tion variation is well demonstrated by the SNPs that identify common genetic variations
in the 12q24 region. The variations in this region that are common in Europeans are not
Table 4.10 Risk loci for coronary artery disease reported in the Han Chinese population.
This table illustrates the importance of studying several different populations. The listed risk loci for
coronary artery disease in the Han Chinese population includes some loci also identified as risk loci in
Europeans and some only found in the Han Chinese. Not all the loci identified in Europeans are confirmed in the
Han Chinese.
found in the Chinese (i.e. the Han population appears to be monomorphic with respect
to these particular variants), whereas all of the variants at 12q24 reported in the Han
Chinese appear to be monomorphic in the European population. Population variation is
a major factor in genetic association studies. Even where there is general agreement about
associations, there are often subtle differences between populations. This can be seen with
respect to the CDNK2A–CDNK2B loci, where the pattern of linkage disequilibrium var-
ies between the Han Chinese and the European populations even though both popula-
tions carry genetic associations with the CDNK2A–CDNK2B haplotype. However, not all
associations are restricted to single populations and there is considerable overlap here. In
addition to the four confirmed associations reported, the Chinese study investigated 29
different cardiovascular disease associated loci identified from European studies and found
a further seven with consistent nominal associations (1p32.2, 1q41, 10q23.31, 10q24.32,
11q22.3, 15q25.1, and 17p13.3). Furthermore, 11 of the SNPs (at 10 loci) associated
with cardiovascular disease in Europeans were monomorphic in the Han Chinese popula-
tion, and use of proxy SNPs as substitutes for the monomorphic SNPs identified three
further associations at 3q22.3, 6p26, and 17p11.2. In their conclusion, the authors stated
that both shared and unique genetic associations occur in this complex disease. This idea
of a mixture of shared and unique susceptibility alleles in different populations is funda-
mental in complex disease. Comparison of populations may be very fruitful in helping us
to understand diseases pathogenesis. In this compare and contrast model, comparing
and contrasting data may pull out the real plums.
APOE
IL5 MIA3
APOA5-A1
CXCL12 ADAMTS7
PSK ABO
SH2B3 COL4A1
LIPA
PDGFD COL4A2
LDLR
Cholesterol Immune
and lipid regulation Coagulation Fibrosis
cells T cells
CAD
Figure 4.14: Systems and pathways in coronary artery disease (CAD). This simple picture illustrates the
broad concept that identifying risk genes may help us understand disease pathology. Some of the candidate
alleles identified (e.g. cholesterol and lipid) are quite obvious candidates for coronary artery disease. It is not
possible to list all of the susceptibility alleles nor is it yet possible to make a clear link between some of the
identified associations and disease pathology. Genes that are associated with cell growth (e.g. CDKN2A and
CDKN2B) are not included here nor are genes that are associated with other activities, including formation of
cellular particles (WDR12), different kinase pathways (MRAS), phosphatase activity (PHACTR1), mitochondria
(MRPS6), cell differentiation (TCF21), a number of zinc finger genes (ZC3HC1 and ZNF259), a number of
cytochromes (CYP17A1), ATP synthetase (ATP5G1), and nitric oxide signaling (GUY1A3). Finally, this illustration
does not take account of the strong environmental impact in coronary artery disease.
confirmed are the best place to start when we begin to construct a picture of the disease
pathology. These genetic associations are after all the most reliable and most frequently
confirmed, and they also tend to be those which are seen in multiple populations. It is
more difficult to link function to other genes (e.g. the zinc finger genes) and cardiovas-
cular disease, but these too are also associated with an increased risk of cardiovascular
disease and therefore a better understanding of genotype–phenotype relationships is
required.
Shared risk alleles between cardiovascular disease and other diseases
We should also not overlook the genetic similarities between diseases. Autoimmune dis-
eases share risk alleles. Genetic variants associated with the immune response play a major
role in many diseases and often the same allele is listed in two or more. With reference to
cardiovascular disease, it is tempting to look for links with T2D (Chapter 10). However,
despite there being considerable overlap between cardiovascular disease and T2D, studies
so far have identified very few shared risk alleles. The two genes that have been identified
in both T2D and cardiovascular disease are CDKN2A, which encodes a cyclin-dependent
kinase inhibitor on chromosome 9p21.3, and SH2B3 on chromosome 12q13. CDKN2A
is one of two cyclin-dependent kinase inhibitors associated with CAD with cardiovascular
disease. SH2B3 maps close to a tyrosine kinase precursor gene ERBB3. A functional link
is yet to be established between these genes and either of the two diseases, but associa-
tions shared between diseases are always interesting because they may indicate a common
functional link in the disease pathology. Looking at cardiovascular disease and T2D, it is
surprising to see so few shared associations. However, it is possible that there are shared
associations out there, and with larger studies and higher-resolution genotyping or even
direct sequence analysis they will be cataloged eventually.
Population 1 Population 2
A C B
Figure 4.15: Similarities and differences in risk alleles in different populations with the same diseases.
The figure illustrates the genetic risk portfolio for a disease in two populations: population 1 and population 2.
The smaller oval shape illustrates the disease population for both populations 1 and 2. The oval is subdivided into
three regions. Those in areas A and B share no risk alleles between the two populations, while those in areas C
share risk alleles.
Conclusions
Considerable scientific resources, both financial and intellectual, have been ploughed into
studying the genetics of complex disease. One may ask the question why? After all, com-
plex diseases are not simple, they do not lend themselves to simple analysis, and they do
not conform to Mendelian patterns of inheritance. The answer lies with the promises
of the HGP. Under this banner, three promises can be considered: (1) identifying risk
alleles will aid disease diagnosis, (2) identifying risk alleles will aid patient treatment/
management and care, and (3) identifying risk alleles will inform the debate on disease
pathogenesis and may ultimately lead to the development of new therapies. Clearly these
outcomes are all linked and though at present there is less potential to use risk alleles in
disease diagnosis, there is great potential to use the new genetics in patient management,
including the potential for developing personalized therapy. However, so far, the greatest
success has been with the third promise.
This is illustrated through the selected disease examples discussed above: ankylosing
spondylitis, HCV, rheumatoid arthritis, bipolar disorder, and cardiovascular disease. In
ankylosing spondylitis, the idea of molecular mimicry based on the shared sequence of
self-peptides and those of pathogens was proposed long before the publication of the
map of the human genome, but recent GWAS have led to the identification of mul-
tiple immunoregulatory genes, all of which helps to support the concept of this being an
immune-mediated or autoimmune disease and will enable us to dissect the pathogenesis
of this disease. In fact, significant progress is being made in ankylosing spondylitis fol-
lowing recent GWAS. As with ankylosing spondylitis, in HCV the association with HLA
DQB1*03:01 has been linked to T cell function, though this was also identified prior
to the publication of the first draft of the map of the human genome. However, once
again, subsequent genetic studies have identified several other risk immune alleles for
HCV, some of which may interact with DQB1*03:01 and others that may influence viral
clearance independently. These include IFNL3 (previously IL28) and IFNL4. In rheuma-
toid arthritis, multiple risk alleles mostly associated with immunity have been identified,
including some cytokines and some chemokine receptors (e.g. IL6 and CCR6). In bipolar
disorder, the two major genes for focus are ANK3 and CACNA1C. It is easy to see the
potential for polymorphisms in the genes involved in calcium channel activity (such as
CACNA1C) that is essential for neurotransmitter release being important in bipolar disor-
der. In cardiovascular disease, it is easy to see the potential for genes in lipid metabolism,
cell growth, fibrosis, and regeneration as potential risk genes.
However, this will all mean nothing unless we can link genotype to phenotype—the next
piece in the jigsaw. It is OK to be selective in terms of deciding which genes to look at as
long as all of the potential candidates are considered at some point. Otherwise, the geno-
typing will have been a waste of time and the phenotyping work will fall into the same
trap that applied to early genetic association studies, i.e. rejection of positive associations,
but not based on numbers or study design (as in the past), but this time based on absence
of knowledge or absence of an obvious (simple) link.
Together, these examples illustrate the potential of this work. There is clear progress and
there is a great deal to be gained from this work. However, science is a slow process and
not all studies are easily replicated. This is most strongly illustrated in studies of bipolar
disorder. Overall, we must be sensible and proceed with caution because there is still a
great deal of work to be done.
Further Reading
Books Collins FS & McKusick VA (2001) Implications
Lechler R & Warrens A (eds) (2000) HLA in of the human genome project for medical sci-
Health and Disease, 2nd ed. Elsevier. ence. JAMA 285:540–544. This is a very impor-
tant review that provides the reader with a guide
Strachan T & Read AP (2011) Human to and an illustration of the potential for under-
Molecular Genetics, 4th ed. Garland Science. standing complex disease in the immediate
Chapter 16 deals with identifying human genes post-genome period.
and susceptibility factors, and provides excel-
lent additional data for students wishing to Ferreira MA, O’Donovan MC, Meng YA
delve deeper into medical genetics. There is also et al. (2008) Collaborative genome-wide asso-
additional relevant information in the other ciation analysis supports a role for ANK3 and
chapters. CACNA1C in bipolar disorder. Nat Genet
40:1056–1058.
Tiwari JL & Terasaki PI (1985) HLA and
Disease Associations. Springer. Chapter 8 on Helgadottir A, Manolescu A, Thorleisfsson G
Neurology covers early studies of multiple et al. (2004) The gene encoding 5-lipoxygen-
sclerosis and narcolepsy plus other neuro- ase activating protein confers risk of myocardial
logical conditions. One of the authors of the infarction and stroke. Nat Genet 36:233–239.
1985 book, Paul Terasaki, is considered as one Helgadottir A, Manolescu A, Helgason A et al.
of the founding fathers of HLA typing and (2006) A variant of the gene encoding leukotri-
immunogenetics. ene A4 hydrolase confers ethnicity-specific risk
of myocardial infarction. Nat Genet 38:68–74.
Hirschhorn JN, Lohmueller K, Byrne E &
Articles Hirschhorn K (2002) A comprehensive review
Barrett JC (2012) From HLA association to of genetic association studies. Genet Med 4:45–
function. Nat Genet 44:235–236. Review of the 61. This paper provides an interesting pre-
paper by Raychaudhuri et al. (2012). GWAS view of association studies.
Baum AE, Akula N, Cabanero M et al. (2008) Lees CW, Barrett JC, Parkes M & Satsangi J
A genome-wide association study implicates (2011) New IBD genetics: common pathways
diacylglycerol kinase eta (DGKH) and several with other diseases. Gut 60:1739–1753. This
other genes in the etiology of bipolar disorder. paper provides a current summary of the genet-
Mol Psychiatry 13:197–207. ics of Crohn’s disease and also highlights the
possibility of shared pathways.
Brewerton DA, Hart FD, Nicholls A et al. Lu X, Wang L, Chen S et al. (2012) Genome-
(1973) Ankylosing spondylitis and HL-A 27. wide association study in Han Chinese identifies
Lancet 301:904–907. This is the original iden- four new susceptibility loci for coronary artery
tification of association between ankylosing disease. Nat Genet 44:890–894. This study not
spondylitis and HLA-B*27; see also Schlosstein only identifies association with cardiovascular
et al. (1973). disease specific to the Han Chinese population,
Brown MA, Pile KD, Kennedy LG et al. (1996) but also confirms some shared associations with
HLA class I associations of ankylosing spon- the European population. Data from this study
dylitis in the white population in the United are summarized in Table 4.10.
Kingdom. Ann Rheum Dis 55:268–270. This Peden JF & Farrall M (2011) Thirty-five com-
paper is the gold standard paper for ankylosing mon variants for coronary artery disease: the
spondylitis and HLA-B*27. fruits of much collaborative labour. Hum Mol
Cichon S, Muhleisen TW, Degenhardt FA Genet 20:R198–R205. This review paper sum-
et al. (2011) Genome-wide association study marizes the findings of two major groups (the
identifies genetic variation in neurocan as a CARDIoGRAM Consortium and the C4D
susceptibility factor for bipolar disorder. Am J Consortium) and identifies a total of 35 asso-
Hum Genet 88:372–381. This article is a very ciated common variants. The data from this
good overview of the genetics of bipolar disor- paper are summarized in Tables 4.8 and 4.9.
der with original data and some good examples The paper provides a useful discussion of the
on how to plan GWAS. differences between the two studies.
Raychaudhuri S, Sandor C, Stahl EA et al. United Kingdom: new estimates for a new cen-
(2012) Five amino acids in three HLA pro- tury. Rheumatology 41:793–800.
teins explain most of the association between The 1000 Genomes Project Consortium (2010)
MHC and seropositive rheumatoid arthritis. A map of human genome variation from popula-
Nat Genet 44:291–296. This original paper is tion-scale sequencing. Nature 467:1061–1073.
reviewed by Barrett (2012)—reading both is
useful to give a full understanding of the paper The Human Genome (2001) Nature 409:813–
and the meaning of the results. 958. The complete issue of Nature mostly
Schlosstein L, Terasaki PI, Bluestone R & dedicated to the HGM with multiple papers,
Pearson CM (1973) High association of an letters, and editorial commentary from multi-
HL-A antigen W27 with ankylosing spondyli- ple authors. It is full of useful insight and criti-
tis. N Engl J Med 288:704–706. One of two cal discussion. Any student of human genetics
original reports (see also Brewerton et al. 1973) should read this issue.
of the genetic association of HLA-B*27 with The International HapMap Consortium (2005)
ankylosing spondylitis. This example illustrates A haplotype map of the human genome. Nature
how not all findings from pre-genome and pre- 437:1299–1320.
GWAS studies failed to be replicated. Notice
in both cases the HLA nomenclature differs The International HapMap Consortium (2010)
from present. Standard nomenclature was only Integrating common and rare genetic variation in
just coming into use at this time and in retro- diverse human populations. Nature 467:52–58.
spect these differences can cause some confu- The Wellcome Trust Case Control Consortium
sion when looking at early texts. However, this (2007). Genome-wide association study of
allele family is now labelled HLA-B*27. It is 14,000 cases of seven common diseases and
important to note the quality of phenotype for 3,000 shared controls. Nature 447:661–678.
B antigens in 1973 was reasonable for common This is another landmark paper highlighting
antigens such as B27, but poor for rare antigens the development and application of new tech-
and variants. nologies (GWAS) in complex disease research,
Schunkert H, Konig IR, Kathiresan S et al. and it provides a mass of useful information for
(2011) Large scale association analysis identi- those studying complex disease.
fies 13 new susceptibility loci for coronary heart Vassos E, Steinberg S, Cichon S et al. (2012)
disease. Nat Genet 43:333–340. This paper Replication study and meta-analysis in
describes a large-scale study on cardiovascular European samples supports association of
disease based on a meta-analysis of 14 GWAS. the 3p21.1 locus with bipolar disorder. Biol
The data are presented in Table 4.8. Psychiatry 72:645–650. This paper describes a
Sklar P, Smoller JW, Fan J et al. (2008) Whole recent GWAS and meta-analysis study of bipo-
genome association study of bipolar disorder. lar disorder that illustrates some of the prob-
Mol Psychiatry 13:558–569. lems and uses of this type of analysis.
Stahl EA, Raychaudhuri S, Remmers EF et al. Zhang J, Zahir N, Jiang Q et al. (2011) The
(2010) Genome-wide association study meta- autoimmune disease-associated PTPN22 vari-
analysis identifies seven new rheumatoid arthri- ant promotes calpain-mediated Lyp/Pep deg-
tis risk loci. Nat Genet 42:508–515. This is radation associated with lymphocyte and
an excellent review of genetics of rheumatoid dendritic cell hyperresponsiveness. Nat Genet
arthritis and provides a good starting point for 43:902–907.
any student looking at this disease.
Stranger BE, Stahl EA & Raj T (2011) Progress
and promise of genome-wide association stud-
ies for human complex trait genetics. Genetics Online sources
187:367–383. This is an excellent review of http://www.wtccc.org.uk/ccc2
the planning and how to do GWAS with data
on rheumatoid arthritis to illustrate the ideas. http://www.ncbi.nlm.nih.gov/SNP
There is fresh data in this paper to add to the This an essential site for updates on SNPs.
paper by Stahl et al. (2010). http://www.ncbi.nlm.nih.gov/omim
Symmons D, Turner G, Webb R et al. (2002) This is an essential site for updates and informa-
The prevalence of rheumatoid arthritis in the tion on genetic disease of all types.
Now that we have considered why we are interested in the genetics of complex disease and
how to design studies, it is important to consider how to handle the data from these stud-
ies. This is done through the application of complex statistical analyses. No longer is statis-
tics a minor consideration in genetics; statistical genetics has become a science in itself and
statistical geneticists have become major players in complex disease. The reasons behind
this are clearly illustrated in Chapter 3. In the past, without good statistical advice, studies
were often badly designed and this led to the production of numerous false-positive and
false-negative genetic associations. However, this is not a statistics textbook, but given the
importance of this subject it is essential to consider the basic principles that are currently
applied in complex disease. Therefore, this chapter will consider some of the different
methods available to test statistical significance in both linkage and association analysis. It
will start with some basic methods applied to early studies, but will concentrate mostly on
statistics in genome-wide association analysis.
As stated in Chapter 3, there are two types of study in complex disease: linkage and asso-
ciation. These have been discussed at some length in previous chapters and they will only
be considered here from a statistical viewpoint.
A B
1 2 A: I A I A or I A i
B: I B I B or I B i
IA n IB n
O: ii
i N i n
B A O B A A
1 2 3 4 5 6
IB n IA n i n IB n IA n IA n
i N i n i N i N i n i n
A A B B O
1 2 3 4 5
IA n IA n IB n IB n i n
i N i n i n i n i n
Figure 5.1: Linkage between ABO blood group and nail–patella syndrome. The figure illustrates three
generations of a family, and shows the blood group and phenotype for nail–patella syndrome (a Mendelian
autosomal dominant disorder). The ABO blood type is indicated in each circle or square (IAIA, IBIB, and ii). The
genotype inferred from the phenotype is given below each circle or square. There are two possibilities for the
nail–patella genes: n for wild-type and N for mutant. Circles indicate female family members and squares indicate
male family members. In this figure the grandfather, generation I, person 1 (I-1), is blood group A heterozygous IAi
and is heterozygous for the nail–patella trait (nN), and will display the trait (indicated here by black symbols). The
grandmother, generation 1 person 2 (I-2), is blood group B heterozygous and carries two copies of the wild-type
nail–patella gene, and is therefore unaffected (indicated in white). The grandfather’s haplotypes are IAn and iN.
Those who inherit the paternal i allele in this family also inherit the N nail–patella gene mutation (all shown as black
squares or circles). Not all O blood group individuals have nail–patella syndrome (it is quite a rare trait); however, this
example indicates extreme linkage disequilibrium and suggests that the two genes are not independently assorted.
I A i Nn × I Bi nn
In the second generation we find that only those offspring who inherit the paternal i allele
have the nail–patella trait and offspring with blood group A do not. This suggests that the
genes for the nail–patella syndrome are not independently assorted. If these two genes
were independently assorted, then some of the children with the paternal A blood group
could inherit the nail–patella syndrome from their father. This outcome is really impor-
tant as we need this information to determine whether the nail–patella gene in the father
is on the same chromosome as the marker allele (IA) or not. This is called determining the
phase of these two loci. We can establish the phase of the above loci in the parents as:
I A n I Bn
×
iN in
Further examination of the pedigree in Figure 5.1 shows that while there is no recombi-
nation in the offspring of generation I, there are two instances of recombination in the
offspring of generation II. If we focus on the couple II-1 and II-2 with the following
genotypes:
I Bn I A n
×
iN in
we can see that one of the children (male III-2) is unaffected and has blood group A with
the following genotype:
I An
in
This child must have inherited the i and n alleles from his father and not the i and N
combination that were expected, and because these alleles are on different chromosomes
in the father a crossover must have taken place. Similarly, crossover must have occurred
in child III-3.
Individuals III-2 and III-3 are examples of recombination between the two loci for blood
type (in this case the marker locus) and disease, and from this information we can cal-
culate the recombination fraction (θ) for this family. The distance between two loci is
defined in terms in units of centimorgans (named after geneticist Thomas Hunt Morgan
and denoted cM). If two loci are 1 cM apart, then there is a 1% chance of recombination
between the two loci as the chromosome is passed from one generation to the next.
It is essential to have informative families in linkage analysis, preferably with a large number
of offspring and several generations with both affected and unaffected members in each.
When small numbers are included there is a greater risk of false-positive and false-negative
results. Recombination is more frequent in cases where the trait and marker locus are not
close together, and the evidence for linkage can be quite low. Linkage analysis needs to
take into account these possibilities. The LOD score (logarithm of odds) is a measure of
the significance of linkage.
The LOD score indicates the logarithm of the odds to base 10 and it is calculated using
the recombination fraction (distance) between two loci on the same chromosome, which
is denoted by the Greek letter θ. θ is the likelihood (L) of a recombination event between
two loci and it is a function of the distance between the two loci. Two unlinked loci have
θ = 0.5; the closer they are, the lower the recombination fraction becomes. We use the
likelihood ratio or OR (the ratio between two probabilities) to estimate the probability
that two loci are linked such as:
If loci were unlinked, the most likely recombination frequency would be 1/2 or 0.5, and
in this case the numerator and the denominator of the ratio would be the same, and thus
the OR will be 1. The easiest way to calculate the above likelihood ratio is to employ the
logarithms (logarithm of odds to the base 10 LOD score) such as:
L(θ = θ )
LOD score = Z (θ) = log10
L(θ = 0.5)
where θ is the calculated recombination rate expressed as the distance between two loci.
A likelihood ratio of 1:1 equals to a LOD score of 0, a ratio of 10:1 equals of LOD score
of 1, a ratio of 100:1 equals a LOD score of 2, etc. A way of calculating LOD scores
using the above equation is to consider the number of births in a family (offspring) and
the number of recombinants as:
(1 − θ)n − r θr
LOD score = Z (θ) = log10
(0.5)n
where n is the total number of births and r is the number of recombinant types. If we
assume a test family with 18 non-recombinant offspring and two recombinant offspring
between two loci A and B, we need first to know the distance between the two loci.
Distance is measured in centimorgans (cM) and in this example is 10 cM, describing a
recombination frequency of 1% and estimates the value of θ. We will have:
n = 18 + 2 = 20
r = 2
θ = (10/100) = 0.1
(1 − 0.1)20 − 2 0.12
Z (0.1) = log10 = 3.2
(0.5)20
The resulting LOD score of 3.2 indicates that the two tested loci A and B show a strong
linkage.
By convention a Z (θ) greater than or equal to 3 (likelihood ratio = 1000:1, in favour of
linkage) is considered to be proof of linkage between two loci. A Z (θ) of −2 (likelihood
ratio = 0.01:1) or less instead, is taken as proof of not linkage. A Z (θ) between −2.0 and
3.0 is considered inconclusive and warrants for further studies. Some authors, however,
consider a Z (θ) of 2 (likelihood ratio = 100:1, in favour of linkage) as strong evidence
of linkage between two loci.
Figure 5.2: Type I and type II statistical errors. The figure shows a number of different possibilities arising from
statistical errors. The null hypothesis (H0) states that there is no statistical difference between two groups.
denoted α and is the same α as the significance level of a test; we reject the null hypothesis
if the inferred P value is less than the significance level (or threshold) α. We need to estab-
lish the value of α before we run the analysis. A conventional P value of 0.05 is commonly
assigned, though we may choose a more restrictive value such as 0.01, 0.001, or even
5 × 10−7 depending on the nature of the study and the number of tests being run. For a
given null hypothesis H0, the type I error rate P is shown as:
P(reject H0 H0 is true ) ≤ α
where P is the probability of rejecting the null hypothesis when it should instead be
accepted, i.e. a false positive.
Type II statistical errors arise when we accept the null hypothesis when it is in fact false,
giving rise to false negative associations. A type II error is denoted by β and this corresponds
to the probability of not rejecting the null hypothesis when it is false; 1 – β represents the
power of the test. For a given null hypothesis H0, the type II error rate is shown as:
The way to avoid these problems is outlined further in this chapter, and includes good
study planning, and large-scale studies with clear statistical goals and significance thresh-
olds. In many clinical situations the type I error is potentially serious and a procedure able
to minimize the probability of a type I error is essential. The significance level of α is cru-
cial in reducing the likelihood of a type I error. If you set α at 0.05, it means that you want
the probability of a type I error to be no more than 0.05 or no more than approximately
1:20 cases to be a false positive.
Probability (P) values are simply statements of the probability that the
observed differences between two groups could have arisen by chance
As the P value falls, the evidence against the null hypothesis in favor of the alternative
hypothesis becomes stronger. This also increases the likelihood of obtaining the same or
similar results in a second (validation) cohort. The lower the P value, the greater the con-
fidence one can have in the results obtained.
Bonferroni’s correction
Bonferroni’s correction sets the new significance cut-off at α/n where α is the significance
level (usually α = 0.05) and n represents the number of SNPs assayed (or the number of
tests carried out). For example, in the above investigation of 500,000 SNPs, Bonferroni’s
correction sets the new threshold at 5 × 10−7.
Under Bonferroni’s correction many authors prefer to adjust each single P value instead
of the α significance level. Each P value is multiplied by the number of SNPs tested and
called significant if the corrected P value is still under 0.05.
Corrected P value = P value × n (number of SNPs in test) < 0.05
Bonferroni’s correction assumes there is independence between markers, and thus inde-
pendence between each SNP (or other genetic variant) and the trait. In the context of
genetic association studies, as many of the SNPs under investigation are located on the
same chromosome and linked through varying degrees of linkage disequilibrium, this
correction has been criticized as excessively conservative, which may lead, especially for
tightly linked SNPs, to loss of power and thus to an increase in the likelihood of a type
II error. For this reason many GWAS studies are performed in stages, with preliminary
studies applying less stringent P values in the first round aimed at identifying potential
loci of interest that can then be confirmed in subsequent rounds. This type of multistage
planning is discussed in Chapter 3.
Alternative methods have been proposed in order to avoid the high penalty applied by
using Bonferroni’s correction and to correct for multiple testing. A practical alternative,
for example, is to approximate the type I error rate using a permutation procedure. This
procedure aims to calculate the approximate false-positive rate that can be then be used as
the threshold for the P value in the data analysis. The method is conceptually simple, but
is computationally demanding and is beyond the scope of this book, and therefore will not
be discussed further. Other approaches have been proposed, including an estimation of
the false discovery rate and a Bayesian approach. The false discovery rate is the propor-
tion of positive tests that are false positive and often leads to low threshold levels, while the
Bayesian approach incorporates the prior probability of association.
Sample relatedness
It is assumed in all case control association studies that cases and controls are unrelated.
Alleles can be identical by state (IBS), i.e. they are the same allele having the same pheno-
type. Alternatively, they can be identical by descent (IBD), i.e. they are the same and share
a common ancestor, and thus have a shared haplotype from one of the parents (Figure 5.3).
In order to evaluate the degree of relatedness of a sample, pair-wise probability (of IBD)
between every individual in the study is calculated. As GWAS uses dense SNP arrays this
makes it is easy to compute pair-wise kinship estimates. In practice, there is no need to
perform analysis of the whole genome. A GWAS SNP data set from only 100,000 mark-
ers will yield stable estimates of kinship coefficients. On average, siblings share zero, one,
and two alleles IBD at 25, 50, and 25% for each gene, respectively. Unrelated individuals
do not share alleles IBD. The most commonly used approach to incorporate locus-specific
Family A
A1 A2 A1 A3
A1 A3 A1 A2
Family B
A1 A2 A2 A3
A1 A3 A1 A2
Figure 5.3: Identity by state (IBS) versus identity by descent (IBD). The figure illustrates two families: A and B.
Consider a single gene (A) with three possible alleles A1, A2, and A3. The father’s alleles are shown in gray boxes.
In family 1, the paternal genotype is A1/A2 and the maternal genotype is A1/A3. The children both inherit the A1
allele. However, this is inherited from the father in the male child and the mother in the female child. The male
child receives the maternal A3 allele and the daughter receives the paternal A2 allele. Therefore, in family 1 the
male child has the genotype A1/A3 and the female child has the genotype A1/A2. Though both children carry
the A1 allele, they are IBS for this allele and as a consequence may carry different alleles in linkage disequilibrium
with the A1 allele. In family 2, the pattern of inheritance is subtly different. The father has the same genotype as
family 1, but the mother carries the A2 and A3 alleles only. Both children inherit A1 from their father. Thus, with
respect to the A1 allele they are IBD. The two children do not, however, have the same genotype as they inherit
different maternal alleles, creating A1/A2 in the male child and A1/A3 in the female child.
IBD sharing probabilities into analysis is the pi-hat approach, which is implemented in
the PLINK whole-genome association analysis software. PLINK is a single-syllable term
for a free, open-source whole-genome association analysis toolset, designed to perform a
range of basic, large-scale analyses in a computationally efficient manner. PLINK can be
used for a range of tests and analyses. Using the – –genome option in PLINK, it is possible
to estimate pair-wise IBD to detect pairs of individuals who look more or less different
from each other than we would expect in a random sample. As a rule of thumb a pi-hat
value of 0.95 or greater resulting from pair-wise IBD comparisons is taken as individual
relatedness. Detailed instructions on producing this data can be found on the PLINK web
page (http://pngu.mgh.harvard.edu/~purcell/plink/).
Data set
Observed 50 30 20 100 2 × 50 + 30 2 × 20 + 30
p= q =
Expected 42.2 45.5 12.2 200 200
= 0.65 = 0.35
Pearson’s χ 2
11.6
P value 0.0008
a, b, and c represent numbers of AA, AC, and CC genotypes, respectively. There is no “b” allele in this illustration;
this represents the AC heterozygotes in the general section of this table only.
Pearson’s χ2 and Fisher’s exact test are used to assess the departure
from the null hypothesis
To consider Pearson’s χ2 test, let us use the data illustrated in the 2 × 3 table in Table 5.2,
which represents data from a single SNP for a standard case control association study. The
total number of genotyped individuals n is represented as ncases and ncontrols for cases and
controls, respectively. The table represents a single SNP with alleles A and C, where nAA
is the total number of AA genotypes observed, nAC is the total number of AC genotypes
observed, and nCC is the total number of CC genotypes observed.
Pearson’s χ2 test is used to assess departure from the null hypothesis, which states that cases
and controls have the same distribution of alleles and of genotypes. In this example we will
use this statistical test to determine the χ2 distribution with 2 degrees of freedom (d.f.)
from the 2 × 3 contingency table. In this instance, 2 d.f. is selected because there are two
AA AC CC Total
Cases a b c ncases
Controls d e f ncontrols
Total nAA nAC nCC n
possibilities A or C. The null hypothesis being tested is that there is no significant differ-
ence in genotype distribution between cases and controls. In order to understand this test,
consider Figure 5.4 and focus on the observed value for the AA genotype (OAA = a). When
we apply Pearson’s χ2 test, this observed value is compared with its expected value EAA cal-
culated from an equation involving the total number of cases (ncases), the total number of
AA genotypes (nAA), and the total number of individuals (n):
n AA × ncases
E1 =
n
The above equation is used to calculate each expected genotype i as:
ni × ncc
Ei =
n
where ni = ni cases + ni controls is the observed number of each genotype given by the sum of
i cases plus i controls, ncc is the total number of cases (ncases) or controls (ncontrols), and n is
the total number of individuals genotyped (ncases + ncontrols). Each observed genotype is then
compared with each expected value and the full statistical test is applied:
6
(Oi − Ei )2
χ2 = ∑
i =1
Ei
where the summation is over all six cells in the table and Oi are the observed values (a, b,
c, d, e, and f in Table 5.2) in each cell, while Ei is the expected value calculated according
to the equation above for each genotype in cases and controls.
Cases a b a+b
Controls c d c+d
Figure 5.4: A simplified form of the χ2 test. This simplified version of the χ2 test is often applied in association
studies. χ2 is calculated from a 2 × 2 table above using values for alleles in boxes for cases and controls allele A
and allele C above (a, b, c, and d here) and the data from sum boxes (containing a + b, c + d, a + c, b + d, and n
here). The 2 × 2 table can also be used to calculate the OR using the values in the boxes containing a, b, c, and d.
Pearson’s χ2 test compares the observed and expected genotypes in cases and controls
assuming both cases and controls have the same frequency of genotypes. The test is asymp-
totic, implying that the analysis becomes more accurate with larger data sets. A low count
in any of the cells of the table can violate the above assumption. In practice, an expected
value of at least five observations in each cell is regarded as a minimum number needed in
order to apply Pearson’s χ2 test. Fisher’s exact test is recommended if any of the cells have
values below five.
When using χ2 it is important to use real numbers and not percentages or mean values as
some authors have done in error. It should also be noted that some calculations use a 2 × 2
table in a simplified version of Pearson’s χ2 test (Figure 5.4). The principles are the same in
this analysis; however, allele frequencies are counted rather than genotypes. In complex sys-
tems where genes have multiple alleles it is often easier to count individuals as either positive
or negative for an allele of interest rather than count genotypes. Homozygosity is ignored
in this model and no assumptions are made about homozygous or heterozygous advantage.
Data from 2 × 2 tables illustrated in Figure 5.4 can be used to calculate χ2 using the simple
formula:
n( ad − bc )2
χ2 =
( a + c )(b + d )( a + b )(c + d )
The value of this calculation is that it makes no assumptions about the impact of hetero-
zygotes versus homozygotes—the calculation is simply based on the number of alleles in
the population. Homozygotes are counted once only. This same formula can be used to
calculate the odds ratio (OR) as:
a×d
OR =
c ×b
These two calculations (simplified χ2 and OR) are among the most well-known and fre-
quently used in association studies, especially in the early phases of data analysis.
Fisher’s exact test calculates the exact probability (P) of observing the
distribution seen in the contingency table
Fisher’s exact test is computationally more demanding than Pearson’s χ2 test and requires
factorial calculations. The problem of working with factorials is that they generate very
large values quickly, consequently where it is necessary to use them most students will use
computer statistical packages such as R for this type of analysis. However, it is important
to understand how this test works. Therefore, for the sake of clarity, the formula to calcu-
late Fisher’s exact test from a standard 2 × 2 contingency table is calculated as:
r1 ! r2 ! c1 ! c 2 !
P =
n ! a !b ! c ! d !
where P is Fisher’s exact probability. The symbol “!” refers to the factorial for the cells and
the marginal values that are identified in the 2 × 2 Punnett Square in Figure 5.4. In this
table, a, b, c, and d are the observed values and r1, r2, c1, and c2 are the marginal values
(computed as the sum of each column or row depending on position). The letter n denotes
the total number.
An exhaustive explanation of this test is beyond the scope of this book and we advise the
reader to refer to a good statistics book.
Figure 5.4 shows a 2 × 2 contingency table with two alleles A and C in two groups: cases
and controls. Allele frequencies for the two groups are represented by the values a, b, c, and
d. There is no assumption about the status of homozygous genotypes in this calculation.
The values a + c and b + d are the sum values for the two columns (c1 and c2) and a + b
and c + d are the sum values for the two rows (r1 and r2). The value n is the total for the
whole cohort. The marginal values used in Fisher’s test are c1, c2, r1, and r2 corresponding
to a + c, b + d, a + b, and c + d above, respectively.
Genotype Totals
AA Aa aa
Cases n11 n12 n13 N1·
Controls n21 n22 n23 N2·
Totals n·1 n·2 n·3 N
2
3
∑
i =1
wi (n1i n2 • − n2i n1•
T2 =
n1•n2 •
3 2 3
n
∑ wi2 n•i (n − n•i ) − 2 ∑∑ wi w j n•i n• j
i =1 i =1 j = i +1
(b) Dominant model: allele C (c) Recessive model: two copies of allele
increases the risk C are necessary to increase the risk
AA AC + CC AA + AC CC
AA AC CC
Cases a b c
Controls d e f
(d) Multiplicative model: the disease (e) Additive model: the risk of developing the disease
risk increases r-fold. The heterozygous is r-fold for the AC genotypes, and 2r-fold for the CC
AC carriers risk r; the homozygous genotypes. The disease risk increases r-fold.
carriers risk 2r. Analyzed by allele. The heterozygous AC carriers risk r; the homozygous
carriers risk 2r. Analyzed by allele.
A C
AA AC CC
Cases 2a + b b + 2c
Cases a b c
Controls 2d + e e + 2f
Controls d e f
AA AC CC
Cases a b c
Controls d e f
Figure 5.5: Cochran–Armitage test for trend. The figure illustrates several different options for analysis of data
from single SNP association studies. Several different models are illustrated. Here, allele C is assumed to increase
the risk for the disease. The models applied to (a) are: dominant (b), recessive (c), multiplicative (d), and additive
(e). In the additive model, the risk of the trait is greater than two-fold in homozygotes, whereas in the multiplicative
model the risk is two-fold in homozygotes compared with heterozygotes. (Adapted from Lewis CM & Knight J
[2012] Cold Spring Harbor Protoc 2012:297–306. With permission from Cold Spring Harbor Laboratory Press.)
where wi = (w1, w2, w3) weights are set to detect particular types of association. In genetic
association studies and GWAS, the Cochran–Armitage test is mostly used to examine the
potential additive effect of the alleles. To do this the weights are set as w = (0, 1, 2), which
correspond to Figure 5.5, for the additive effect of the allele C in developing the risk of
the disease.
In addition, the Cochran–Armitage test can also be used to test for a potential domi-
nant (w = 0, 1, 1) or recessive effect (w = 0, 0, 1) effect of the C allele in Figure 5.5.
0.64
Case/case + controls
0.62
0.60
0.58
0.56
0 1 2
Genotype score
Figure 5.6: Cochran–Armitage test for trend in a single SNP case control association study. In this case
we are looking at a single gene with two alleles (alleles 1 and 2). The genotypes are scored as 0, 1, and 2 for
homozygotes allele 1, heterozygotes alleles 1 and 2, and homozygotes allele 2, respectively. The circles indicate
the genotype score for the cases and controls combined together with their least-squares line. Here, the line fits
the data reasonably well as the heterozygote risk estimate is intermediate between the two homozygote risk
estimates, corresponding to an additive genotype risk. (Adapted from Balding DJ [2006] Nat Rev Genet 7:781–791.
With permission from Macmillan Publishers Ltd.)
T 2 has a χ2 distribution with 1 d.f. under the null hypothesis of no association. The
Cochran–Armitage test for trend performs with robust power in the case of an additive
risk of disease developing. Conversely, the power is reduced by deviations from an addi-
tive model. Cochran–Armitage is a more conservative test, minimizing the probability of
incorrectly rejecting the null hypothesis and thus reducing the risk of a type I statistical
error. Cochran–Armitage is also more robust when it comes to departures from HWE
than either Pearson’s χ2 or Fisher’s exact probability tests.
C is rare, with few CC observations in cases and controls. Alternatively, under a recessive
model for allele C, genotypes AA and AC are pooled together (Figure 5.5c). This model
assumes that two copies of allele C are required for increased risk. However, in some cases
risk is additive, meaning that the risk in homozygotes is higher than that for heterozy-
gotes. In an additive model if we assume allele C is the risk allele, the risk of developing
the disease is r for the AC genotypes and 2r for the CC genotypes (Figure 5.5e).
Though the described models and tests are valid methods for analysis of association stud-
ies, these methods should be applied according to a predefined study plan because a ran-
dom application of such tests increases the probability of a false-positive result. These
models are rarely used in the primary analysis of complex genetic disorders, but they may
be employed as a secondary tool to explore the potential mode of inheritance of an associ-
ated SNP or to test a predefined hypothesis.
Multiplicative risk
Under this model, if C is the risk allele, the risk of developing a disease is modeled by
increasing risk factor r from r for heterozygotes (AC) to r2 for homozygotes (CC). Under
the multiplicative model the allele count is then the most powerful method of testing.
The total number of A and C alleles in cases and controls are compared, regardless of the
genotypes from which these alleles are constructed (Figure 5.5d).
Finally, we must remember that in complex diseases we anticipate the involvement of sev-
eral genetic risk factors that may interact in complex models. We must be cautious about
simple patterns. In fact, most single-gene studies count alleles and make no assumptions
about homozygotes or models. This is especially true for human leukocyte antigen (HLA)
studies where the number of alleles at each locus is extreme.
Logistic regression
Regression is a statistical method employed to determine the dependence between a
response variable (y), known as the dependent variable, and one or several predictors (x),
called the independent variables. In the simplest case, y (the dependent variable) is predicted
by only one x (independent variable) by a linear equation or the equation of a straight line:
y = α + βx + ε
where α is the intercept, β is the regression coefficient (i.e. the slope of the line), and ε
represents the error (i.e. the differences between predicted and observed y values). The
above equation illustrates a simple linear regression model where a line fits all values of x
with y. Where there are infinite lines we need to find the linear equation able to best fit all
observed values of x. To achieve this aim we use the least-squares criterion, which states
that the line of best fit for the data is the line where the sum of the squares of the verti-
cal distances from the observed points to the line are as small as possible. This criterion
is illustrated in Figure 5.7 with an example taken from population ecology regarding the
distribution of a predator population (e.g. wolves) and their prey (e.g. hares), where each
dot represents an observed value and x is the number of prey observed within an area and
y the corresponding number of predators. In this equation, d represents the difference
between the y coordinate of the data point and the corresponding y coordinate on the line.
The best line of best fit (Smin) is the line which minimizes the sum of the d squares (d 2):
Smin = ∑d
i
i
2
y
250
d
200
d d
Predators
d
150
d d
100
d
x
1000 2000 3000 4000
Prey
Figure 5.7: Linear regression using a population ecology model. An example from population ecology
that represents the relationship between prey and predator numbers within a studied area, where each dot
represents an observed value; x is the number of prey observed within an area and y is the number of predators,
d represents the difference between the y coordinate of the data point and the corresponding y coordinate
on the line. In this case, an increase in the prey population corresponds with an increase in the number of
predators.
The distance of each point in the scatter from the regression line is known as the residual
or error, denoted d. When all of these residuals are squared and then added together they
are given the term ∑ d 2 . The “best” straight line is the one with the lowest value for ∑ d 2
hence the name ordinary “least squares.” Another example is shown within Figure 5.8,
where a scatter plot is plotted to illustrate the relationship between body mass index versus
hip circumference for a sample of 1500 women in a diet and health cohort study.
y
40
30
BMI
20
10
x
100 110 120 130 140 150
Hip circumference (cm)
Figure 5.8: A scatter plot of body mass index (BMI) versus hip circumference. The figure is a scatter plot
based on data from a sample of 1500 women in a diet and health cohort study. The scatter illustrates the
relationship between BMI and hip circumference. The scatter of values appears to be distributed around a
straight line, i.e. the relationship between these two variables appears to be almost linear.
A simple linear regression model is one with only one independent variable. When we
have more than one independent variable the regression model is called a multiple linear
regression model. This model is an extension of simple regression where the dependent
variable y is predicted by several independent variables (xi) simultaneously:
y = α + β1x1 + β2x2 + … + ε
This multiple linear regression allows us to include more predictors or covariates in the
analysis. This will reduce the residual variance (d ) of y.
In genetic association studies, however, a logistic regression model is preferred compared
with a linear regression model.
P
logit(P ) = log e
1 − P
According to this formula, loge is a natural logarithm and the ratio (P/1 − P) is an odd
(odds ratio) of the event occurring and can range from 0 to infinity. Odds values tell you
how much more likely it is that an observation is a member of the target group rather than
a member of the other group. For example, if the probability P is 0.80, the odds are 0.80/
(1 – 0.80) or 0.8/0.2 or 4 to 1. The probability P can only range from 0 to 1, but logit(P)
can range from negative infinity to positive infinity. We can also express the above equa-
tion using a linear regression equation in which the logit of P or y (dependent variable) is
determined by a linear function of the independent variables xi (where P is a function of x
corresponding to the dependent variable y):
P
logit( y ) = log e
1 − P
= β0 + β1 x1 + β2 x 2 + + βn xn + ε
The terms y and x represent dependent and independent variables, respectively. Each βi
coefficient measures each xi partial contribution to variation in y. Logistic regression uses
a maximum likelihood method, which maximizes the probability of getting the observed
results given the fitted regression coefficients.
We can use a more complicated logistic regression model when additional covariates may
affect the onset of a disease. Examples of this are situations in which environmental effects
such as epidemiological risk factors (e.g. smoking and gender), clinical variables (e.g. dis-
ease severity and age at onset), population stratification, and other marker loci have inter-
active effects on disease risk (and gene–gene interaction or epistasis).
Fortunately, we do not have to calculate any of these mathematical equations by hand.
Computer software is easily available, but, once again, it is important to understand the
principles and how we get from a regression formula to a line to the logistic analysis.
Odds Ratio (OR)
In order to interpret the logit coefficients, however, we need to introduce the concept of
the OR. The OR is a measure of the strength of an association between two variables and
is widely used in case control studies. In a genetic association study, OR values give us use-
ful information for understanding the relationship between alleles and disease risk, and we
can define the OR as the ratio of the odds of bearing an allele in the group with the disease
(cases) to the odds of carrying the same allele in the group without the disease (controls).
In other words, OR estimates the risk of a disease or trait for a specific allele.
In a standard 2 × 2 table, the OR can be expressed as the cross-products ratio (ad/bc), as
seen in Table 5.3 (also discussed in Chapter 3). In this calculation a, b, c, and d are the
frequencies of the alleles in cases and controls. OR values range from 0 to infinity. OR
values close to 1 indicate no relationship between the allele and the trait, while OR values
of less than 1 suggest a protective effect and values greater than 1 suggest an increased risk.
Table 5.3 A standard 2 × 2 plot with allele frequencies shown as a, b, c, and d (this table would be
used for calculation of simple χ2 or OR).
Cases Controls
Risk allele a b
Other allele c d
OR = eβ0 + β1x1 + β2 x2 + + βn xn + ε
or
mutations, and natural selection (less common) act on human populations, and thus vari-
ations in SNP frequencies are seen between the major population groups (this is discussed
in Chapter 1 in some detail). Association tests are statistical assays that detect differences
in SNP or other genetic marker frequencies between cases and controls. Variation within
and between populations may lead to false-positive associations. An example is a SNP in
the complement factor H gene (CFH) which alters the susceptibility to age-related macu-
lar degeneration (AMD). The frequency of the AMD protective A allele in the Yoruba
population of sub-Saharan Africa is four-fold higher than in Europeans. Similarly, another
SNP in the complement system genes, the factor B gene (CFB), shows a higher frequency
in Africans than Europeans. These differences may account for differences in susceptibility
to AMD in these populations.
Recent associations reported an increased risk of Alzheimer’s disease associated with
genetic variation in two SNPs (rs11136000 and rs3818361) of the complement pathway
genes for clusterin (CLU) and for complement receptor 1 (CR1). The CR1 risk allele is
frequent in the European population, but rare in African and Asian populations, while
the CLU risk allele is much more frequent within Asian than African populations. It is
not known whether this variation is a consequence of genetic drift or natural selection. A
meta-analysis including white, African-American, Israeli–Arab, and Caribbean Hispanic
individuals found an association between polymorphisms in CR1 and CLU in populations
with a European ancestry only. In addition, some mutations may be completely absent
from some population groups, e.g. the CARD15 mutations that are present in more than
30% of European Crohn’s disease patients are more or less absent in Asian populations.
Meta-analysis
Meta-analysis is a means of compiling data from previously published studies to focus
on a specific question. Recently, it has been applied to studies of the genetics of common
disease and will be used more often as the number of GWAS grows. Meta-analysis has the
advantage of being able to identify and confirm weak associations because with the vast
numbers included the statistical power of these studies is considerably higher than any
other individual center or single collaborative study. Meta-analysis is frequently used in
clinical drug trials and in the analysis of psychological disorders.
Genomic control
In the presence of population stratification, regular methods for testing association, such
as Pearson’s χ2 method and the Cochran–Armitage test for a trend could produce excess
false positives compared with the nominal significance levels. A popular method used in
the presence of population stratification is genomic control. This method aims to control
for population stratification by first estimating an inflation factor and then adjusting all of
the test statistics. The logic behind genomic control is that the test statistic for association
median(χ2 )
λ=
0.456
An inflation factor λ of 1 indicates that no population stratification is detected within
the samples. We can use this value of λ to rescale all of the test statistics in a downward
direction. In practice, the test statistics of each SNP are divided by the λ value then the
corresponding adjusted P value can be re-estimated:
χ2
χ2adjusted =
λ
It is important to note that λ denoted above is not same as the λ value used in calculating
disease incidence and prevalence (sibling relative risk).
Structured association
SNP genotypes that are not associated with disease can also be used in structured asso
ciation analysis. This method has been implemented in STRUCTURE software
(http://pritchardlab.stanford.edu/structure.html) and employs genotype data from
the study to determine population structure in a similar method to that used in the
genomic control analysis above. Predictive SNPs are those known to be associated with
a specific ethnic population or other population subgroup. The software then performs
tests for associations within each inferred subpopulation. STRUCTURE may also be
used to identify individuals who do not cluster with the majority of the samples. The
latter samples can then be eliminated from the analysis.
PCA
Another analytical methodology often incorporated into GWAS able to scan for popula-
tion structures is called PCA. The basic idea behind PCA is to measure the data as prin-
cipal components rather than on a normal x–y axis. Principal components are structures
or directions where the data are most spread out and thus they indicate where the most
variance is. If we are able to find those directions where most of the variance is, we can
recognize structures in the data. Though we mainly use math to find the principal com-
ponent throughout eigenvectors and eigenvalues, we can imagine plotting our data into
a scatter plot and using PCA to find the direction where there is most variance. The PCA
will first determine the straight line through the points that are able to catch the most
variance among the data in a manner similar to linear regression. This inferred line is then
used as the first principal axis or principal component to generate a second principal com-
ponent. This process occurs in multidimensional space with one dimension for each of
the variables (i.e. SNPs) included and the lines inferred through the points. At the end of
the process PCA will synthesize the data from a mass of variables into a set of compound
axes. The first axis will explain the most variation, then the second, and so on. Once the
principal components are inferred, individuals can be plotted according to their principal
components and the possible structures in the dataset will then be recognized.
PCA can be also be used in a principal components adjustment analysis to control for type
I errors. Association tests performed on data affected by population stratification can be
adjusted by performing PCA on different subsets of variants (or SNPs). We can first carry
out a PCA using only common variants (set at less than 5%) with minor allele frequencies
(set at 5%), or only low-frequency variants (minor allele frequencies between 1 and 5%) or
only rare variants (minor allele frequencies of less than 1%) or a combination of the above.
Then we can test if population stratification is present or not and eventually perform asso-
ciation inferences on the subset. PCA based on either common variants or low-frequency
variants seems to provide effective control for population stratification, while rare vari-
ants performs less well. PCA based on low-frequency variants seems to adjust better than
common variants, but the use of the former could result in over-adjustment producing a
substantial loss of the power, especially in the absence of population stratification.
There are several diagnostic plots that can be used for the
visualization of genome-wide association results
Quantile–quantile plots
In statistics, a quantile–quantile (Q–Q) plot is a plot for comparing probability distri-
butions by plotting their quantile values. It is an effective graphical method to determine
if the two distributions come from the same or two different statistical populations. This
statistical tool is widely used in GWAS to assess the significance of observed associations
(y-axis) compared with the expectations under no association (x-axis).
(a) (b)
50 50
40 40
Observed χ2
Observed χ2
30 30
20 20
10 10
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Expected χ2 Expected χ2
Figure 5.9: Hypothetical Q–Q plots in GWAS. The figure illustrates two different Q–Q plots. The plot compares
probability distributions using their quantile values. This statistical tool is widely used in GWAS to assess the
significance of observed associations comparing observed values (on the y-axis) and expected values (on the x-axis).
Each dot represents a hypothetical SNP while the blank line is the expected null distribution. Plot (a) illustrates
strong deviation from the expected values and therefore strong genetic association of the trait under study with
SNPs in a heavily genotypes region or spurious associations. Plot (b) indicates very little deviation from the expected
values and may suggest cryptic errors in the study population, in the genotyping, or either true association.
In Figure 5.9(a and b), the observed association statistics such as χ2 or the calculated
–log10P values for each SNP are ranked from smallest to largest and plotted against the
distribution that would be expected under the null hypothesis for no association. If the
two compared distributions are similar, the points in the Q–Q plot will approximately lie
on the null or identity line (denoted y = x), indicating that no association is detected for
each SNP. These plots help to indicate whether the study has generated more significant
results than expected by chance. Deviations from the identity line indicate either some
bias in the observed (assumed) distribution or strong true associations.
As the underlying assumption in GWAS is that the vast majority of assayed SNPs are
not associated with the trait, strong deviations from the null hypothesis suggest either a
strongly associated locus, a heavily genotyped locus (Figure 5.9a) (i.e. an associated gene
with many genotyped SNPs), significant undetected differences in the population struc-
ture (Figure 5.9b), cryptic relatedness in the study population, or errors in genotyping. A
clean Q–Q plot (Figure 5.9a and b) should show a solid distribution matching the y = x
line until it sharply curves at the end, which represents the small number of true associa-
tions among the many thousands of genotyped SNPs.
Manhattan plots
Named after the Manhattan skyline, the plots are widely used to visualize data from
GWAS studies. This type of scatter plot is particularly useful to display data with a large
number of data points. A GWAS Manhattan plot outlines the −log10P value of each SNP
on the y-axis against the genomic coordinates on the x-axis (Figure 5.10). Each dot in the
plot represents a different SNP, with the alternating bands of shading representing differ-
ent chromosomes plotted on the x-axis. The y-axis indicates the strength of the association
between a SNP and the trait under study. As the strongest association carries the smallest
P value, it will be the highest ranked and will be visualized at the top of the plot. A good
Manhattan plot visualizing a robust association study should show the highest ranked
SNPs rising with a column of other associated SNPs symbolized as dots from the same
15
10
–log10P
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 21 X
16 18 20 22
Chromosome
Figure 5.10: Manhattan plot displaying findings from a GWAS study. This Manhattan plot presents the
findings of a GWAS. Each chromosome can be identified sequentially (1–22, then X and Y) in alternating light and
dark gray coloring. The y-axis shows the –log10P. The significance threshold is set at 10−5 for moderate associations.
Each tagged SNP is represented by a dot distributed according to the –log10P along the 22 autosomes and the X.
P is for 1 d.f. because with each SNP tested individually there are two possibilities. Points above the dotted line
indicate SNPs with a P < 10−5. P < 10−5 or P < 10−7 are significant; the greater the digit after the value 10, the more
significant the observations are. (From The Wellcome Trust Case Control Consortium [2007] Nature 447:661–678.
With permission from Macmillan Publishers Ltd.)
chromosomal location and haplotype. This is the result of linkage between nearby SNPs,
all of which will have the same or similar association signals. A single dot visualizing as
the highest ranked SNPs spread over the whole plot in isolation shows a pattern typical of
genotyping errors and this may bring into question any reported associations.
Locus 2
Locus 1 B b Totals for rows
A fAB fAb fA
a faB fab fa
Totals for columns fB fb 1
Calculating D
D is a measure of linkage disequilibrium, which is determined as the difference between
the observed frequency of a two-locus haplotype and the frequency that we would expect
if the alleles are randomly segregated. Suppose that fAB is the observed frequency of the
haplotype AB formed by the two alleles A and B. The expected haplotype frequency for AB
is calculated by the product of the allele frequency of each of the two alleles:
fAB = fA × fB
If the alleles are in linkage equilibrium, the observed frequency of the AB in the test popu-
lation will be the same as that which is expected. However, if they are in linkage disequi-
librium, we would expect a departure from the expected frequency.
fAB = fA × fB + D
faB = fa × fB – D
fAb = fA × fb – D
fab = fa × fb + D
The value of D can be obtained by anyone of the above equations as all will result in the
same value of D. D can also be calculated by summarizing all of the above equations:
D = fAB × fab – fAb × faB
When D = 0 there is no linkage disequilibrium. The method can be further illustrated
by considering a population where the two loci above show the observed genotyping
frequencies as:
fAB = 0.85
fAb = 0.05
faB = 0.05
fab = 0.05
The frequencies of each allele are first calculated from the observed data as:
fA = fAB + fAb = 0.85 + 0.05 = 0.9
fa = faB + fab = 0.05 + 0.05 = 0.1
fB = fAB + faB = 0.85 + 0.05 = 0.9
fb = fAb + fab = 0.05 + 0.05 = 0.1
D
D′ =
D max
D2
r2 =
f A f a f B fb
where fA, fa, fB, and fb are the frequencies of each allele, and D is the linkage disequilibrium
measure introduced above. r2 ranges between 0 (loci are in complete linkage equilibrium)
and 1 (loci are in complete linkage disequilibrium).
If the value of r2 between two loci equals 1 this implies the markers at the two loci are
providing the same information. Consequently, the genotypes of alleles of one SNP are
directly predictive of the genotypes of the other and genotyping either SNP or allele will
result in the same P value. In the light of this it is sometimes reasonable to genotype only
one of the two selected SNPs. This can be helpful if specific SNP sequences are difficult to
genotype. A surrogate SNP used in this manner is often referred to as tagged SNP.
The relationship between D′ and r 2
D′ and r2 are both measures that can be written in terms of allele frequencies. To illustrate
this, consider D′ such that D ≥ 0 and fA ≥ fB. In this case, Dmax = fa fB (being fA × fb > fa × fB).
In this case:
D
D′ =
fa fB
We can extrapolate D from the above equation as D = D′ × fa fB and substitute r2 in the
equation, thus:
fa fB
r 2 = ( D ′ )2 ×
f A fb
The above equation describes the relationship between D′, r2, and allele frequencies
when D ≥ 0 and fA ≥ fB. The upper boundary of r2 is D′ and is reached only when fA = fB.
The implication of this is that D′ provides the upper limits of r2. D′ is a commonly used
measure of recombination and gives information on the physical extent of useful linkage
disequilibrium. If we know the value of D′ we can calculate r2. For example, if a recom-
bination point results in a D′ = 0.7, the maximum possible r2 for these alleles would be
0.49.
Once phased, the inferred haplotype can be assessed for associations using the same statis-
tical methods as described previously. However, there are problems with haplotype-based
analyses. For example, inclusion of rare haplotypes in studies increases the number of tests
performed and reduces the statistical power of the testing. A common solution to this
problem is to combine all of the rare haplotypes into a single category. By doing this we
preserve the data and maintain the statistical power of the study. However, this could also
introduce problems because individual rare haplotypes may be important in disease and
these associations may be overlooked when clustered in this way.
Conclusions
In complex disease it is important to include statistics into study design and to do so at
the earliest moment. Deciding what analysis packages and procedures to use after a study
has been performed is not good practice and may prejudice the final conclusions. Data
handling in studies of the genetics of complex disease has become complicated. This is the
era of computerized science. Without computers it would not be possible to handle the
data in the way we do. The Human Genome Project would have been impossible and
the future would simply not exist for this type of science. Fortunately, it does exist and
statistical analysis on various computer platforms helps us to achieve our goals. This is also
the era of collaboration when competing centers work together for the greater good, by
pooling samples so that statistical thresholds can be met and statistical power enhanced.
Meta-analysis is the natural extension of this collaboration.
When looking at data analysis in complex disease we can start with calculating LOD
scores in linkage studies and calculating ORs in association studies. Of course we need
to consider the problems with false-positive and false-negative findings and the impact of
sample size (large-scale studies) on statistical power. We need to be aware of sample relat-
edness, case call rate, and SNP call rate before we begin any study.
Whether we are testing allelic polymorphisms in a single gene or the whole genome we can
use Pearson’s χ2 test, Fisher’s exact probability test, and the Cochran–Armitage test. There
are different models in the Cochran–Armitage test. The next step in complexity is logistic
regression or linear regression analysis; however, these are complex tests and researchers
are advised to use appropriate computer software and refer to genuine statistical genetics
experts for guidance and help with their studies, but a little basic knowledge of the meth-
ods is also advised. It is also important to understand the limitations and potential pitfalls
for different types of study, and consider how those pitfalls may be overcome by good
study design and use of advanced analytical techniques. Other things to consider are how
we present our data, e.g. should we use Q–Q plots or Manhattan plots.
Further Reading
Books geographical distribution and relevance to dis-
Strachan T & Read AP (2011) Human ease. Immunobiology 217:265–271.
Molecular Genetics, 4th ed. Garland Science. Ewens WJ & Spielman RS (1995) The transmis-
Chapter 16 deals with identifying human genes sion/disequilibrium test: history, subdivision,
and susceptibility factors, and provides excellent and admixture. Am J Hum Genet 57:455–464.
additional data for students wishing to delve Hill WG & Robertson A (1968) The effects of
deeper into medical genetics. There is also addi- inbreeding at loci with heterozygote advantage.
tional relevant information in the other chapters. Genetics 60:615–628.
Hirschhorn JN, Lohmueller K, Byrne E &
Articles Hirschhorn K (2002) A comprehensive review of
Armitage P (1955) Tests for linear trends genetic association studies. Genet Med 4:45–61.
in proportions and frequencies. Biometrics Ke X, Hunt S, Tapper W et al. (2004) The
11:375–386. impact of SNP density on fine-scale patterns
Balding DJ (2006) A tutorial on statistical of linkage disequilibrium. Hum Mol Genet
methods for population association studies. 13:577–588.
Nature Rev Genet 7:781–791. Lewis CM & Knight J (2012) Introduction to
Barrett JC, Fry B, Maller J & Daly MJ (2005) genetic association studies. Cold Spring Harbor
Haploview: analysis and visualization of LD and Protoc 2012:297–306.
haplotype maps. Bioinformatics 21:263–265. Lohmueller KE, Pearce CL, Pike M et al.
Chapman JM, Cooper JD, Todd JA & Clayton (2003) Meta-analysis of genetic association
DG (2003) Detecting disease associations due studies supports a contribution of common
to linkage disequilibrium using haplotype tags: variants to susceptibility to common disease.
a class of tests and the determinants of statisti- Nat Genet 33:177–182. This is a useful text on
cal power. Hum Hered 56:18–31. meta-analysis in common complex disease.
Clarke GM, Anderson CA, Pettersson FH et al. McIntosh I, Dunston JA, Liu L et al. (2005)
(2011) Basic statistical analysis in genetic case- Nail patella syndrome revisited: 50 years after
control studies. Nat Protoc 6:121–133. linkage. Ann Hum Genet 69:349–363. This is a
Clayton DG, Walker NM, Smyth DJ et al. useful update on nail–patella syndrome genetics.
(2005) Population structure, differential bias Morton NE (1956) The detection and estima-
and genomic control in a large-scale, case-con- tion of linkage between the genes for elliptocy-
trol association study. Nat Genet 37:1243–1246. tosis and the Rh blood type1. Am J Hum Genet
Collins FS & McKusick VA (2001) 8:80–96.
Implications of the human genome project for Price AL, Patterson NJ, Plenge RM et al.
medical science. JAMA 285:540–544. This is a (2006) Principal components analysis corrects
very important review that provides the reader for stratification in genome–wide association
with a guide to and an illustration of the poten- studies. Nat Genet 38:904–909.
tial for understanding complex disease in the Pritchard JK, Stephens M & Donnelly P (2000)
immediate post-genome period. Inference of population structure using multi-
de Bakker PI, Yelensky R, Pe’er I et al. (2005) locus genotype data. Genetics 155:945–959.
Efficiency and power in genetic association Purcell S, Neale B, Todd-Brown K et al. (2007)
studies. Nat Genet 37:1217–1223. PLINK: a tool set for whole-genome associa-
Dudbridge F & Gusnanto A (2008) Estimation tion and population-based linkage analyses. Am
of significance thresholds for genome-wide J Hum Genet 81:559–575.
association scans. Genet Epidemiol 32:227–234. Sasieni PD (1997) From genotypes to
Epstein MP, Allen AS & Satten GA (2007) A genes: doubling the sample size. Biometrics
simple and improved correction for population 53:1253–1261.
stratification in case control studies. Am J Hum Schaid DJ & Jacobsen SJ (1999) Biased tests
Genet 80:921–930. of association: comparisons of allele frequencies
Ermini L, Wilson IJ, Goodship TH & Sheerin when departing from Hardy–Weinberg pro-
NS (2012) Complement polymorphisms: portions. Am J Epidemiol 149:706–711.
Stephens M & Scheet P (2005) Accounting for Wang WY, Barratt BJ, Clayton DG & Todd JA
decay of linkage disequilibrium in haplotype (2005). Genome-wide association studies: the-
inference and missing-data imputation. Am J oretical and practical concerns. Nat Rev Genet
Hum Genet 76:449–462. 6:109–118.
Snyder LH (1932) Studies in human inheritance Wittke-Thompson JK, Pluzhnikov A & Cox
IX. The inheritance of taste deficiency in man. NJ (2005) Rational inferences about departures
Ohio J Sci 32:436–468. This paper is considered from Hardy–Weinberg equilibrium. Am J Hum
to be the first direct pharmacogenetics study. Genet 76:967–986.
The 1000 Genomes Project Consortium (2010) Zhang K, Calabrese P, Nordborg M & Sun
A map of human genome variation from pop- F (2002). Haplotype block structure and its
ulation-scale sequencing. Nature 467:1061– applications to association studies: power and
1073. This is a very exciting project with the study designs. Am J Hum Genet 71:1386–1394.
promise of more yet to come. This is a good paper to read to understand how
The ENCODE Project Consortium (2012) An haplotypes can be used in GWAS.
integrated encyclopedia of DNA elements in Zhang Y, Guan W & Pan W (2013). Adjustment
the human genome. Nature 489:57–74. This for population stratification via principal com-
is an invaluable research paper for those start- ponents in association analysis of rare variants.
ing out on the investigation of the genetics of Genet Epidemiol 37:99–109.
human disease.
The Human Genome (2001) Nature 409:813–
958. The complete issue of Nature mostly Online sources
dedicated to the Human Genome Map with
multiple papers, letters, and editorial commen- http://hapmap.ncbi.nlm.nih.gov
tary from multiple authors. It is full of useful This is the site of the results of the International
insight and critical discussion. Any student of HapMap project.
human genetics should read this issue. http://pngu.mgh.harvard.edu/~purcell/plink
The International HapMap Consortium (2005) PLINK is a freely available, open-source, whole-
A haplotype map of the human genome. Nature genome association analysis toolset designed
437:1299–1320. for quality and analysis of GWAS data.
The International HapMap Consortium (2007) http://pritchardlab.stanford.edu/structure.html
A second generation human haplotype map of This site includes STRUCTURE software
over 3.1 million SNPs. Nature 449:851–861. referred to in the text.
The International HapMap 3 Consortium This is an essential site for updates and informa-
(2010) Integrating common and rare genetic tion on genetic disease of all types.
variation in diverse human populations. Nature http://www.ncbi.nlm.nih.gov/projects/SNP/
467:52–58. This is essential reading for those snp_summary.cgi
involved in complex disease genetics. This site provides information about millions of
The Wellcome Trust Case Control Consortium validated reference SNP clusters in the genome.
(2007) Genome-wide association study of
14,000 cases of seven common diseases and http://www.ncbi.nlm.nih.gov/SNP
3,000 shared controls. Nature 447:661–678. This an essential site for updates on SNPs.
This is another landmark paper highlighting www.r-project.org
the development and application of new tech- This is another freely available and excellent
nologies (GWAS) in complex disease research, package for statistical computing that can be
and it provides a mass of useful information for used with or without existing commercial pack-
those studying complex disease. ages such as SPSS or SAS.
The extended major histocompatibility complex (xMHC) located on the short arm of
chromosome 6 (6p21.3) is perhaps the most complex genetic system in the human genome
with 252 expressed genes in a region of approximately 7.6 Mb of DNA. The MHC is so
named because it encodes genes with vital functions in determining the compatibility of tis-
sues. In particular, it encodes the human leukocyte antigens (HLAs), matching of which is
vital in some, but not all, forms of clinical transplantation. JJ van Rood, one of the pioneers
of immunogenetics, suggested that all discoveries go through three phases: the discovery
itself, the development of tools with which to investigate the discovery, and the final expo-
sure of the discovery (the latter of which he likened to the charting of a new continent). The
scientific process is an iterative cycle of discovery, hypothesis generation, and testing, con-
tinually looking back. This is particularly appropriate when we consider the human MHC.
This chapter will briefly examine the discovery of the MHC, the complex naming of HLA
antigens and alleles, the structure and normal function of the products of the HLA alleles,
and how genetic variation in HLA may determine susceptibility and resistance to disease
using a number of key examples. We will also consider a selection of other non-HLA
MHC-encoded immunoregulatory genes and discuss how they may also play a role in
susceptibility to complex disease.
Selecting disease examples for this section is a very difficult task because there are many
diseases with genetic associations with the MHC. Genetic associations with HLA were
first reported in 1967 by Amiel, who reported an association between HLA and non-
Hodgkin’s lymphoma. Later, in the 1970s, Ceppellini et al. reported an association
between endemic malaria and HLA in four Sardinian villages. Reports of other associa-
tions quickly followed: HLA-B8 and celiac disease by Stokes et al. in 1972, HLA-A3 and
multiple sclerosis by Terasaki et al. in 1972, B13 and psoriasis by Russell et al. in 1972,
and B27 and ankylosing spondylitis by Brewerton et al. in 1973. There are currently so
many genetic associations with HLA that there are complete textbooks devoted to the
subject. Therefore, we will focus here on a small number of illustrative examples, some
of which are selected because they are rarely discussed, and some because they are fre-
quently discussed and have become exemplary models. The selected examples in this chap-
ter include hemochromatosis, psoriasis (type I and type II), severe cataplectic narcolepsy,
three autoimmune liver diseases [primary sclerosing cholangitis (PSC), primary biliary
cirrhosis (PBC) and autoimmune hepatitis (AIH)], multiple sclerosis, and systemic lupus
erythematosus (SLE). The selection illustrates the impact of the MHC immunoregulatory
genes in clinical disease, and informs the debate on disease diagnosis, treatment and care,
and disease pathogenesis for a large variety of common and some less common diseases.
Type 1 and type 2 diabetes (T1D and T2D), rheumatoid arthritis, ankylosing spondylitis,
cancer, and infectious disease are all discussed elsewhere in this book as part of other chap-
ters or as specific chapters, and therefore the role of the MHC and MHC encoded genes
in these diseases are not included here.
6.1 Histocompatibility
Many geneticists avoid discussing the MHC because of its complexity and as a conse-
quence it is seen by many as an area for specialists only. These specialists are mostly tissue-
typers who provide data for transplant programs in particular. A summary of the history
of the major developments in the MHC is shown in Figure 6.1.
GWAS
HGMP
1991 PCR
1988 RFLP
1987 HLA and T1D
1987 A2 crystal
1984 Classification class II
1970 HLA-Cw (now C)
1967 HLA-B (and haplotypes)
1967 First IHTW
1964 Microlymphocytoxicity test
1963–1965 HL-A (and MAC now HLA-A0*2)
1958 Antibodies from pregnancy for HLA
1944–1946 Peter Medawar – LA
1937 Peter Gorer – H2 mouse MHC
Little, Snell, et al., – Mouse tumors and H2
1900 Karl Landsteiner – ABO blood groups
1829 James Blundell – Blood transfusion
Figure 6.1: Time pyramid for major discoveries and developments in the world of HLA. The figure is a time
pyramid illustrating the timing of some of the major discoveries and developments in the world of HLA. It shows
selected developments over the last millennium from recent major technological advances to early discoveries
that underpin our knowledge today. As students of science we should not forget the principle that today’s work
is based on the work of those who came before us. Many of the above achieved fame in their own lifetimes,
but not all are as well remembered as they deserve. The list includes several Noble Prize winners. The top of the
pyramid is left unfilled for the next great leap ahead.
the outcome of tissue grafts in both animals and in humans. These factors turned out to
be encoded in the MHC. In the early days, at least, success or failure in clinical transplan-
tation could depend on matching donor and recipient HLA types. The HLA antigens
encoded in the MHC are important in bone marrow and kidney transplantation, and less
important in heart and liver transplantation. The history of the discovery of the MHC is
discussed in brief in Box 6.1.
Naming the HLA antigens and alleles up to and including the early
molecular genotyping era
In addition to being one of the most complex systems in the human genome, some of the
genes in the MHC are the most polymorphic. To understand the nomenclature applied to
HLA we need to look back at the history; however, because this is neither an immunology
book nor a history book, a prolonged discussion of early nomenclature is not appropriate
here. Table 6.1 indicates some of the chaos that was present in the early world of HLA
typing when it came to naming new alleles and genes, and how this was adjusted to cre-
ate current standard nomenclature. Therefore, this will only be dealt with in brief. The
International Histocompatibility Testing Workshops (IHTWs) that were first set up in
Recipient
blood O A B AB
group
O
Anti-A
Anti-B
antibodies
A
Anti-B
antibodies
B
Anti-A
antibodies
AB
No ABO
antibodies
Figure 6.2: Donor and recipient interaction for ABO blood groups. The four main ABO blood groups are
encoded on chromosome 9. Those with blood group O can donate to all other groups, but because they can
produce antibodies to A and B antigens they can only receive blood from O donors. In contrast, those with AB
can only donate to other AB carriers because O, A, and B carriers can all produce antibodies to either A and B
(O blood group), or A (B blood group) or B (A blood group). However, AB-positive individuals do not produce
antibodies against the other blood groups and can receive blood from any donor.
BOX 6.1 OF MICE AND MEN: HOW EARLY WORK WITH MICE REVEALED
THE POSSIBILITY OF THE MHC AND THE HLA ANTIGENS
The first clues to the existence of the MHC came from mouse models of tumor immunol-
ogy. As early as 1903, Jensen noted that different strains of mice were not equally suscep-
tible to the growth of grafted tumor cells. Little and colleagues, who were also working
on mice, proposed that acceptance of tumor grafts depended on the graft and host tis-
sues possessing a large number of susceptibility factors in common. Bauer, working with
human subjects in 1927, had also noted the acceptance of skin grafts in identical twins
was consistent with the theory that compatibility is genetically determined. In 1937,
Peter Gorer proposed a genetic theory for transplant rejection: “Normal and neoplastic
tissues contain isoantigenic factors (which are) genetically determined. Isoantigenic fac-
tors present in the grafted tissue and absent (from the) host are capable of eliciting a
response which results in the destruction of the graft.”
These ideas may not seem revolutionary in 2015, but in 1937 they represented a great
leap forward in both genetics and immunology, and Peter Gorer is considered one of
the founding fathers of immunogenetics. Peter Medawar followed up Gorer’s idea work-
ing on skin grafts in rabbits from which he was able to suggest that leukocytes and skin
share important transplantation antigens. These were the first ideas pointing at HLA.
Leukocytes are white blood cells—specifically, the mononuclear cells and these antigens
were later called leukocyte antigens. After much work by many different groups, Jean
Dausset published his hypothesis in 1952 suggesting that a similar antigenic system to
that seen in mouse erythrocytes (red cells) by Gorer existed on the surface of human
BOX 6.1 OF MICE AND MEN: HOW EARLY WORK WITH MICE REVEALED
THE POSSIBILITY OF THE MHC AND THE HLA ANTIGENS (Continued)
leukocytes. Subsequently the “antigens” were called “human leukocyte antigens” (HL-A
or HL-antigen and finally HLA). One distinct feature of the HL-antigen is that, these
antigens are not expressed on red blood cells, unlike the ABO blood groups. This system
is the HLA system we know today and Dausset and several others were later awarded the
Noble Prize for this work and related work.
The table illustrates the complexity of the original naming systems for new antigens in the left-hand column.
Through a series of committees and workshops, these differences have been clarified so that a single naming
system works. A substantial restructuring of HLA naming has taken place in recent times.
1965 have ensured a consistent procedure for identification, reporting, and naming of all
HLA alleles. The reason for this is so that different laboratories can perform HLA typing
on the same sample and produce the same result. The history of how this came together is
indicated in Box 6.2. Standardization of the naming system for HLA was essentially for
clinical transplantation, where matching of donor and recipient is required.
Early studies suggested a single locus for the histocompatibility antigen. Bodmer & Payne
proposed a prefix of LA for leukocyte antigen. Dausset proposed the prefix HU for human
system. Using the latter system, HU-1 meaning human system-1 was the first antigen.
This was a time of considerable confusion. Each group began naming identified antigens
differently: 4a and 4b (Van Rood and Leeuwen, 1963), LA1, LA2, and LA3 (Payne and
Bodmer, 1964), Ao (Amos), Bt (Batchelor), To (Ceppellini, who is credited with coining
the term “haplotype”), Da (Dausset) and Te (Terasaki, who was instrumental in develop-
ing the microcytotoxicity test). These early pioneers soon realized that it would be essential
to standardize the systems for naming and testing these HLA antigens. In order to do this
they assembled and decided on steps to unify and clarify the naming and identity of the
HLA antigens. The first step was to set up regular International Histocompatibility Testing
Workshops (IHTWs). The first of these was held in 1965. In 1968, the term HL-A stand-
ing for “Human Leukocyte A” was coined; the hyphen was dropped from the name follow-
ing the recognition of other class I loci. The different antigen families (genes today) were
called HLA A, HLA B, and HLA Cw. As part of standard procedure all newly identified
antigens were given a “w” until confirmed at the next IHTW. HLA C antigens (and later
alleles) retained the “w” in their name because complement components are also named
using an alpha-numeric code with C as the letter and a single number. However, C alleles
are no longer indicated by Cw, but by C alone.
Throughout the 1970s and early 1980s, a number of authors proposed the existence of
novel class II loci and alleles; these included DC and MB (some of which are the same as
LB), MT, SB, Te, and BR. This was very much like a repeat of the late 1960s and early
1970s. At the 1984 IHTW these novel class II antigens were reclassified and assorted,
giving rise to the present DR, DQ, and DP picture for class II. Thus, DC, LB, and MB
were all identified as DQ antigens; MT1, MB1, and DC1 were recognized as equivalent
to DQ1; MT2 and MT3 were reassigned DRw52. BR3, BR4, Te24, and MB2 were
reassigned DRw53. DC3 and LB-E17 were assigned as DQ2. MB3, MT4, DC4, and
TB21 were reassigned DQ3. All SB antigens were identified as belonging to a single
locus renamed DP to fit the sequence DR, DQ, DP. While some HLA DQ antigens were
detectable by serology, HLA DP antigens (previously SB) were not.
Current nomenclature is focused on the genotype rather than the phenotype as before.
There is an enormous amount of polymorphism in the HLA genes as seen in this chapter
and consequently in 2010 the nomenclature was reformed. The group of genes is referred
to in italics with the suffix HLA then each locus joined by a hyphen (e.g. HLA-A) next the
allele or allele family is identified (e.g. HLA-A*01) with A*01 referring to the first group
or family of alleles of the HLA-A gene family. These first two digits refer to the antigen
and are essentially the same as the serotype or phenotype. In early studies the example
above would be the A1 antigen. In the new era sub-typing this group means there has
to be a greater number of possible alleles. This means more codes are needed to distin-
guish them all from each other. Therefore, the use of a colon and second set of digits is
applied to identify variants of the HLA-A*01 family thus HLA-A*01:01. These alleles
must encode an amino acid change. If the variant does not encode an amino acid change
(i.e. it is a synonymous polymorphism or noncoding polymorphism) it is listed using a
third set of digits after the insertion of a second colon for example HLA-A*01:01:01.
Alleles that encode changes in the intronic sequences are listed by use of a fourth set of
digits after a third colon thus HLA-A*01:01:01:01. The new naming system is designed
to cope with the most polymorphic genetic system in the human genome.
Antigens are proteins; in the case of HLA, these are usually expressed on the cell sur-
face. Alleles are genetic variants, which in the case of HLA usually refers to the region
of the gene that encodes these proteins. The two terms are often used interchangeably.
Early work with serum and cells was based on detecting antigens. Later work based on
restriction fragment length polymorphism (RFLP) or polymerase chain reaction (PCR)
analysis detected genetic variants and therefore uses the term alleles. Occasionally, depend-
ing on the level of specificity applied, the term family may also be used, i.e. the HLA A2
family, where HLA A2 is tested rather than a range of different HLA A2 variants. The term
family will be applied to both genes and antigens when appropriate.
HLA class I
Understanding nomenclature is difficult in any subject and no more so than with regard to
HLA class I. Although there are many HLA class I genes, there are only three major HLA
class I genes that have been studied in detail: HLA-A, HLA-B, and HLA-C.
HLA class II
The major HLA class II antigens are DR, DQ, and DP encoded by the gene pairs DRA
and DRB, DQA and DQB, and DPA and DPB. The changes heralded by molecular geno-
typing, which allowed more accurate HLA typing (genotyping), also heralded a new
era. They confirmed what serology had suggested, i.e. that there may be more than one
expressed DRB gene on some, but not all, MHC haplotypes, and identified the DQ and
DP families of genes with a single pair of expressed DQA and DQB genes encoding the
DQ molecule and a single pair of expressed DPA and DPB genes encoding the DP mol-
ecule for all MHC haplotypes.
The second expressed DRB gene
There are nine DRB genes numbered DRB1 to DRB9. The first of these, DRB1, is expressed
on all haplotypes. Some haplotypes express only the DRB1 gene (DRB1) and others
express a second DRB gene. Which second DRB gene is encoded on each haplotype can
be determined from the DRB1 family encoded on each haplotype. The second expressed
DRB gene on haplotypes encoding alleles of the DRB1*15 and DRB1*16 families is
DRB5. DRB3 is the second expressed DRB gene expressed on haplotypes encoding alleles
of the DRB1*03, DRB1*11, DRB1*12, DRB1*13, and DRB1*14 families. DRB4 is the
second expressed DRB gene on haplotypes encoding alleles of the DRB1*04, DRB1*07,
and DRB1*09 families. The DRB1 families that express only a single DRB1 gene are the
DRB1*01, DRB1*08, and DRB1*10 families.
The second expressed DRB genes are all thought to be only weakly expressed in each case.
However, even though these genes are thought to be weakly expressed in comparison
with DRB1, expression levels may alter under immune stress. The significance of having
two, as opposed to one, expressed DRB gene, on some haplotypes, both of which may be
polymorphic, is not properly understood in the context of complex disease, but the second
DRB locus should not be overlooked.
The DRB pseudogenes
The final set of DRB genes to consider are DRB2, DRB6, DRB7, DRB8, and DRB9, all of
which are pseudogenes and are not expressed. All haplotypes carry the DRB9 pseudogene,
but the number of other pseudogenes carried on each varies depending on the DRB1
family. The different haplotypes (including pseudogenes) are illustrated in Figure 6.3 and
details of some DRB gene family nomenclature is given in Table 6.2.
Splits
These developments also helped to identify what were previously referred to as “splits”
(more correctly “isoforms”). This is demonstrated in Figure 6.4. For example, the HLA
B5 antigen was found to have three variants; these were first called Bw51, Bw52, and
Bw53. The “w” in the name was dropped when their status was confirmed, leaving the
labels B51, B52, and B53. These are now more correctly identified as B*51, B*52, and
B*53. When they were discovered these three variants were named according to the next
available unused number in the catalogue for HLA A and B antigens. Each of these vari-
ants has multiple subdivisions. However, not all new allele assignments were bone fide,
some turned out to be incorrect and some alleles are null alleles. When an allele assign-
ment is incorrect, the number is deleted from the official sequence record and the number
is not reused. Consequently there are gaps in the allele sequences numbers, e.g. there
is no B*51:25. In addition, the level of variation does not stop after four figures. Some
alleles carry non-synonymous polymorphisms, which can lead to further splits and names
like B*51:01:01. In fact, there are seven members in the B*51:01 family (B*51:01:01 to
B*51:01:07). Though these are all members of the B5 family, not all of the alleles were
detectable by serological typing and quite a few B5s identified by serology were incor-
rectly assigned as members of cross-reactive groups. DNA genotyping revolutionized HLA
DRB1 DRB9
Figure 6.3: Organization of the HLA DRB genes. Each haplotype may carry two or more DRB genes. However,
a significant number of the DRB genes are pseudogenes (DRB2, DRB6, DRB7, DRB8, and DRB9; shown in gray)
and are not expressed.
Table 6.2 Commonly used DRB allele family names past and present.
Names are used in this chapter as they have been historically and as they are currently; the evolution of HLA
and the change in nomenclature are explained in some detail. To make reading easier, the simplest names have
been used where possible, i.e. DR3 in place of DRB1*03:01; however, it is not always appropriate to make this
exchange, especially when referring to molecular structures. For the ancestral haplotypes, a mixture of old and
new nomenclature is used depending on published history and the nature of the studies performed. For example,
B8, which is old nomenclature, is often used. This has been applied because although this example should be
correctly labeled HLA-B*08, old studies did not use this naming system. In contrast most studies of DQ used the
current nomenclature, e.g. DQB1*06:02.
typing to such an extent that the number of detectable alleles and variants (splits) cata-
pulted from around 100 A and B antigens to several thousand alleles at each locus. The
same process had the same effect on the HLA class II allele groups.
Taken altogether, i.e. identifying alleles on the new genes and the improved quality of
typing (illustrated in Tables 6.3 and 6.4), which also enabled a greater number of v ariants
to be accurately identified, these developments have been astounding. HLA typing is now
well and truly in the molecular genetics era and nomenclature has been updated too.
The current naming system for HLA alleles and genes allows for a
greater level of resolution to be reported
As HLA was at first thought to be a single locus system, antigens were first numbered
sequentially A1, A2, etc., as they were discovered and approved. However, it soon became
clear that there were at least two HLA loci: A and B. On close inspection when the antigen
sets for the two loci (A and B) were pulled apart, it was discovered that there were quite
a few B antigens that had A antigen labels. It was decided that for numbering purposes,
A and B would share the same number set. Thus, no A antigen has the same number
sequence as a B antigen and vice versa; the numbers assigned for A and B antigens are
exclusive. Therefore, there are HLA A1, A2 and A3 antigens, but no B1, B2, and B3 anti-
gens. This tradition has persisted. This way of naming new antigens and alleles has been
subjected to a standard practice since the 1965 IHTW. It is also important to know that all
B, and some A, antigens have a sequence variation that labels them as belonging to either
the Bw4 or Bw6 family. This is illustrated in Table 6.5.
Antigen
family
HLA-B5
Figure 6.4: HLA nomenclature tree: splits. The figure illustrates the development of HLA nomenclature based
on one member of the HLA-B antigen family. B5 was identified early on in studies of HLA (as indicated by the low
number B5). Three variants of B5 were later identified. These variants were often called “splits” in tissue-typing
laboratories and because at that time the numbering sequence for the combination of HLA A and B antigens had
reached 50, the next numbers to be assigned were given to these three variants, i.e. B51, B52, and B53. As these
antigens were first identified by serotyping, the original names reflect this naming, i.e. the phenotype (e.g. B51)
not the genotype (e.g. B*51). In the molecular era further variation has been identified for each of these antigens
with allele labels suggesting more than 40 B51 (B*51:01 to B*51:42) variants and approximately 10 for both B52
(B*52:01 to B*52:09) and B53 (B*53*:01 to B*53:11).
In the molecular era we are dealing with alleles and genotypes. The alleles identified are named
and numbered in a similar way to the antigens, but more information can be provided. For
example, DR alleles of the DRB1 family are referred to as DRB1*01 and a further :01 can
be added to indicate a higher level of specificity (or resolution). The allele thus labeled is
DRB1*01:01. The use of DRB1 indicates that this is the product of the DRB1 locus and not
of the other expressed DRB loci (DRB3, DRB4, and DRB5). Considering the large number of
HLA alleles so far identified it is important to be as precise as possible when naming alleles or
allele families. Table 6.6 illustrates the correct use of the current naming techniques. When
looking at this we need to consider whether studies are being performed at low or high resolu-
tion, or somewhere in the middle. The term family can be useful when genotyping is being
performed at low resolution. Current nomenclature is frequently updated and can be found at
http://hla.alleles.org/antigens/index.html or http://www.ebi.ac.uk/imgt/hla/stats.html.
Table 6.3 Analysis of the correlation between serological and RFLP DR typing in 1000 individuals.
This table shows the correlation between two different methods of HLA DR typing: serological phenotyping versus
RFLP genotyping. In this model, the analysis shown in the table assumes the RFLP technique always correctly
identifies the antigen/genotype. The earlier method is of lower quality due to the poor quality of antibodies for
testing the less common DR antigens DR8, DR9, and DR10, and also the inability to accurately detect DR6, which
was often unassigned especially in heterozygotes. True negative indicates true homozygotes.
The level of genetic polymorphism in the HLA genes is illustrated in Figure 6.5. However,
the level of variation in the expressed gene is not always equal to the number of protein
polymorphisms in the coding regions of the gene. For example, the DNA sequence may
change, but the polypeptide composition may remain unchanged. This is because there is
redundancy within the genetic code so that a single amino acid may be encoded by more
than one DNA triplet (e.g. CCC, CCA, and CCG each encode the amino acid proline).
Thus, the numbers of proteins encoded by the three HLA class I loci (A, B, and C ) are
2185, 2870, and 1850, respectively. Each of these values is lower than the figures given
for the number of alleles. Some of the alleles are null alleles with no expressed product; at
present there are 147 null HLA-As listed, 124 HLA-Bs, and 86 HLA-Cs.
Table 6.4 Analysis of the correlation between RFLP and PCR oligonucleotide probing for HLA-DRB1
genotyping in 395 individuals.
This table shows the correlation between two different methods of HLA-DRB1 genotyping: RFLP genotyping
versus PCR-oligoprobing using 24 oligoprobes. PCR clearly identifies 18 DRB1 allele families and this example
includes the rarely discussed DRBr allotype (which is in fact one of the DRB1*01 family members). In this model,
the analysis shown in the table assumes the PCR technique always correctly identifies the antigen/genotype.
Group 77 78 79 80 81 82 83
Bw6 S L R N L R G
Bw4 D L R T L L R
Bw4 S L R T L L R
Bw4 S L R I A L R
Bw4 N L R I A L R
Bw4 N L R T A L R
The sequences of the Bw4 and Bw6 epitopes vary at positions 77 to 83 of the second exon of the α1 domain on
the HLA class I molecule. The critical amino acid difference between Bw4 and Bw6 appears to be a glycine (G) for
arginine (R) exchange at position 83. Variations at the other positions in this amino acid sequence define the five
Bw4 genotypes. Note that there is no variation at positions 78 and 79, which all carry leucine (L) and arginine (R),
respectively.
Table 6.6 The current system for naming HLA genes and alleles.
Name Explanation
HLA-DRB1 This indicates a specific HLA locus (DRB1)
HLA-DRB1*01 This indicates a group or family of alleles that encode the DR1 antigen
HLA-DRB1*01:01 This indicates a specific allele from the DRB1*01 family (many scripts
use DRB1*0101 for this and omit the “:” sign)
HLA-DRB1*01:01:02 This indicates an allele that encodes a synonymous mutation of
DRB1*01:01
HLA- This indicates an allele that encodes a mutation outside the coding
DRB1*01:01:01:02 region of DRB1*01:01
HLA-DRB1*07:10N This indicates a “null” allele (i.e. an allele that is not expressed)
HLA-DRB1*01:01L This indicates an allele encoding a protein with significantly reduced
cell surface expression
for formation of a wider range of variation in the expressed DQA molecules than initially
indicated from the simple allele frequencies. In addition because every individual inherits
two haplotypes—one from each parent—DQA1 alleles of paternal origin can form func-
tional DQ molecules with DQB1 alleles of maternal origin (and vice versa), thus adding a
further level of diversity for DQ molecules. The same is also true for HLA DP.
4000
3500
A
3000
C
2500 B
Polymorphisms
MICA
2000 TNF
C2/C4
1500
DRB1
DQA
1000
DQB
500 DPA
DPB
0
A MICA DRB1 DPA
Figure 6.5: Frequency of HLA polymorphisms. The figure shows the extent of polymorphism in HLA as of
April 2015. The very high frequency of HLA-B polymorphism gives a false impression regarding polymorphism for
some of the other genes listed. There are 101 MICA alleles, for example. One hundred and one is a considerable
level of polymorphism, it just looks small compared to the 3887 B alleles represented. The order of the genes is
indicated on the right-hand descending column.
Case A
DQB1*02:01 DQA1*05:01 DRB1*03
Case B
DQB1*02:02 DQA1*02 DRB1*07:01
Figure 6.6: The HLA DQ2 molecule constructed in cis and trans. The figure shows the HLA molecule being
constructed in cis and trans. The cis construction (case A) is illustrated by full gray arrows with each allele coming
from a single haplotype from the same parent where both the α-chain and the β-chain of the final molecule are
encoded on the DRB1*03 haplotype by the alleles DQA1*05:01 and DQB1*02. The trans construction (case B) is
illustrated by the dotted arrows where more than one parental haplotype contributes a polypeptide chain to form
the final molecule. In each case, the DQA1 locus contributes the DQA1*05:01 allele that encodes the necessary
α-chain and the DQB1 locus contributes the DQB1*02 allele that encodes the necessary β-chain. The ability of the
DQ genes to construct molecules in trans is important as it increases the number of potential genetic variations
there are at the molecular level. HLA DR is less able to do this as the DRA1 gene is essentially monomorphic with
very little variation.
to increase the genetic variation through trans expression as opposed to the normal cis
pattern can have important personal consequences (Figure 6.6). Sollid et al. found that
patients with celiac disease have an increased frequency of HLA DR3 and DR7. The
genetic association with DR3 was well known, but the association with DR7 was less
well reported. Further analysis revealed that DR7 is only associated with celiac disease
in the presence of either DR3 or DR5. DR3 is always found with DQ2 (DQA1*05:01–
DQB1*02:01; sometimes called DQ2.5). DR5 is almost always found with DQ7
(DQA1*05:–DQB1*03) and DR7 occurs with either DQ9 (DQA1*01:02–DQB1*03)
or DQ2 (DQA1*01:02–DQB1*02:02). The DQA1 alleles of the DR3–DQ2 and DR5–
DQ7 haplotypes are almost identical except for codon 135 in the membrane proximal
domain of the DQ chain. DR5/DR7 heterozygotes with the DR7–DQ2 haplotype carry
similar DQA1 and DQB1 genes encoding alleles with essentially the same sequences in
the trans position as those found on the DR3– DQ2 haplotype in cis. The ability to form
DQA/DQB in trans may explain the complex HLA association with celiac disease.
The final groups of genes that need to be considered are those called
pseudogenes, gene fragments, and null alleles
Pseudogenes are genes that are not expressed. DRB pseudogenes have been discussed above,
but there are also HLA class I pseudogenes (see Table 6.7). Gene fragments are exactly as
described and are mostly fragments of once expressed genes, possibly the remains of our
evolutionary past. Null alleles are generally considered to be alleles that are not expressed.
In some cases there is doubt about expression: alleles may be found to be expressed in the
Table 6.7 A list of HLA other class I genes not discussed in the text.
Pseudogenes are genes that are not expressed. Others include the HLA class I pseudogenes (genes that do not
encode functional products), gene fragments (fragments of once expressed genes) which are possibly part of our
evolutionary past, and null alleles (alleles that are not expressed). There is some overlap in the use of these terms.
cell only and not at the cell surface or may be soluble, but not found attached to the cell
surface. More information on this is given at http://hla.alleles.org/antigens/index.html.
1500 300
Figure 6.7: Gene map of HLA in MHC (showing selected expressed genes only). The figure is not to scale.
The boxed figures and gray arrows mark the distances between loci. This figure covers the old 3.6-Mb pre-xMHC.
Whether a haplotype carries another expressed DRB gene (DRB3, 4, or 5) depends on the DRB1 family encoded
on the haplotype (see Figure 6.3).
basis that the genes within them encoded proteins with similar structures and functions.
This terminology is no longer fit for purpose; all three regions encode a variety of proteins
with variable structures and functions, including many none HLA antigens.
Of those genes listed in Figure 6.5, Tables 6.7 and 6.8, the most studied in this region have
been the classical HLA genes HLA-A, -B, -C, -DRB, -DQA, -DQB, -DPA, and -DPB. In
addition to these eight loci, some studies have also investigated a selection of other MHC
genes. These include a group of genes encoding proteins of the classical and alternative
complement pathways (C2, C4A, C4B, and Bf ), tumor necrosis factor (TNF)-α (TNFA),
and, more recently, the MHC class I chain (MIC)-like proteins (MICA and MICB).
Antigen-
binding Figure 6.8: HLA class I molecule structure. The figure
α2 groove α1 represents a simple diagram of the HLA class I molecule
with three immunoglobulin domains (α1, α2 and α3)
encoded by the HLA class I allele and a β-2-microglobulin
molecule (encoded on chromosome 15). The α3 unit and
α3 β-2-micro- β-2-microglobulin unit (in gray shading) form a support
globulin structure for the α1 and α2 globulin domains, which create
a closed groove for the binding of antigenic peptides (the
antigen-binding groove). In the case of class I molecules
the antigenic peptides are usually eight to nine amino
acids in length. The majority of HLA polymorphisms
encode amino acid variations in and around the antigen-
binding grove.
Figure 6.9: Top-down view of the HLA class I binding groove showing some of the sites at which different
alleles encode amino acid variations. Some of the sites for amino acid variation are illustrated by dots. The
figure shows the β-pleated sheets (in gray) which form the floor and two opposing α-helices (in black) which form
the walls of the closed antigen-binding groove of the molecule. For HLA class I, short peptide antigens of around
eight to nine amino acids in length are bound in this groove and presented to the T cell receptor in the formation
of the immune synapse (Figure 6.11)—a key process in adaptive immunity. HLA class II molecules have a similar
structure but the groove ends are open, allowing for longer peptides to be bound. (Adapted from Parham P
[2009] The Immune System, 3rd ed. Garland Science.)
Antigen-
binding
β1 groove α1
T cell
Class II does not appear to be involved in NK cell activity. Furthermore, the mechanisms
by which HLA class I and class II engage and present polypeptides are different. HLA class
I presents antigenic polypeptides from within the cell (i.e. intracellular), whereas HLA class
II presents antigenic peptides that are from outside the cell (i.e. intercellular). The mecha-
nisms by which this occurs are complex and are illustrated in Figure 6.12.
ER
Extracellular Cytosol Intracellular
antigen antigen
Endocytic
vesicle
Proteasome
Peptide production Antigen processing
in phagolysosome Nucleus to peptides in
proteasome
Peptide binding
by HLA class II Peptide transport
into endoplasmic
reticulum (ER)
Peptide binding
HLA class II by HLA class I
in vesicle
Figure 6.12: Antigen processing by HLA class I and II antigens. The right-hand side illustrates antigen
processing by HLA class I. Intracellular antigens from the cytosol, including self-antigens and those of pathogens
such as bacteria and viruses, are processed through the proteasome and transported through the endoplasmic
reticulum (ER) to the Golgi apparatus. The cell is constantly producing new HLA molecules. Antigenic peptides
contact these HLA molecules in the ER and the resulting HLA–peptide complex is transported through the
Golgi apparatus to the cell surface. On the left-hand side peptides from outside the cell (extracellular peptides)
are taken up by endocytosis and phagocytosis within endocytic vesicles into the cell. The figure illustrates a
macrophage. In this situation proteases in the vesicles break down captured proteins and they are bound to
newly synthesized class II molecules in the ER and transported through the Golgi apparatus to the cell surface.
(From Parham P [2009] The Immune System, 3rd ed. Garland Science.)
HLA class I, on one hand, and HLA class II, on the other hand, provide two of the
essential mechanisms of immune surveillance and induce immune responses to infec-
tious agents within the cell and outside the cell. These processes frequently overlap, and
are one of the main forces in acquired immunity and immune surveillance, which is
essential for survival.
Table 6.9 Narcolepsy amino acid positions for DQB1*06:01 versus DQB1*06:02.
sequence. Depending on the associated allele, possession of a specific amino acid at a spe-
cific site can have profound effects on which antigenic peptides are bound and presented
to the TCR. In narcolepsy, the P4 and P9 pockets appear to have the most pronounced
effects. At P9 the critical change is with β37. At this position there is a tyrosine to aspartic
acid exchange that introduces significantly more negative charge into the P9 pocket and
this may subtly alter the anchor specificity. The P4 binding pocket is the largest pocket
in the expressed DQB1*06:02 structure. In contrast, the DQB1*06:01 has a smaller P4
pocket. This is due to polymorphisms at positions β13 and β26, which result in exchange
of glycine for alanine and leucine for tyrosine. The impact of these changes is considered
to be substantial.
Considering the function of HLA class II molecules and these molecular models, it is
plausible that narcolepsy is an immune-mediated disorder dependent on T cell responses.
At a molecular level, changes in the chemical structure influence the functional dynamics
of the antigen-binding groove in such a way that there may be expansion or contraction
of the range of peptides that may be preferentially bound and presented to the TCR.
In addition, subtle changes in the molecular structure of HLA molecules may influence
the orientation of the bound peptide and the recognition of the HLA–peptide complex
by the TCR. It is therefore interesting that three of the amino acid differences between
DQB1*06:01 and DQB1*06:02 are in the putative TCR recognition sites, suggesting that
both antigen binding (variation in the binding groove) and the TCR recognition site may
be important in determining HLA-encoded susceptibility to narcolepsy.
Finally, more recent studies in narcolepsy aimed at identifying immune genetic markers
in the Han Chinese population found several novel associations, including an association
with DQB1*03:01 and disease onset. The study also found that selected HLA-associated
markers differed significantly in cases where onset occurred after the recent influenza
pH1N1 pandemic. This study indicates three major points:
• The association in the Han Chinese population may be different to that in Northern
Europeans.
• The association may be related to age of onset or disease severity.
• The association may be related to other environmental factors.
motor weakness, impaired vision, and lack of coordination. It is a highly variable disease
and some sufferers experience slow progressive degeneration, while others have peaks and
troughs with periods of recovery in between periods of degeneration.
It is generally accepted that multiple sclerosis is a genetically complex disease involv-
ing both genes and the environment with a considerable genetic load associated with
specific MHC haplotypes, especially the DRB1*15:01–DQA1*01:02–DB1*06:02 hap-
lotype. This latter haplotype is normally found with HLA-A*03: and HLA-B*07:, and
the extended haplotype is labeled with the abbreviation HLA 7.2 because of the linkage
disequilibrium between DRB1*15 (originally the major form of DR2) and HLA-B7. This
haplotype is very common in European populations. Recent studies suggest the concor-
dance rate for multiple sclerosis in monozygotic twins may be as high as 25%, while in
dizygotic twins concordance is five times lower than in monozygotic twins and in sib-
lings concordance is 10 times lower. However, concordance rates should be considered
with caution as it appears that concordance rates mirror prevalence rates; therefore the
lower the prevalence in the population, the lower the level of concordance. For example,
there are significant differences in concordance rates between Northern and Southern
European populations. This difference is reflected in the lower prevalence for multiple
sclerosis in Southern Europe.
Early studies of multiple sclerosis
Early studies of the genetic basis of multiple sclerosis identified strong reproducible associa-
tions with the MHC and one haplotype in particular, the HLA 7.2 ancestral haplotype. The
first studies to indicate an association with the MHC were published in 1972 and 1973 by
the same group. Many others followed, but it would be wrong to give the impression that
all early studies identified the same HLA associations or that all studies found any associa-
tion. The reasons for these inter-study variations include differences in techniques applied
and the availability and quality of anti-serum for the early HLA assays as well as sample size,
diagnostic differences, and population differences (Chapters 3 and 5). For example, two
studies based in the UK reported quite different results—one was based on the population
of the remote Orkney Islands and the other based on the general UK population.
However, the association with the 7.2 haplotype has stood the test of time and remains a
confirmed multiple sclerosis risk marker. The same group of HLA class II alleles, especially
DQB1*06:02, are very strongly associated with narcolepsy (as we have discussed), but
DQB1*06:02 is protective against T1D and may have mild protective effects against other
autoimmune conditions (though not all; see below).
The association with the MHC is suggestive of an autoimmune or at least immune-medi-
ated etiology in multiple sclerosis. In fact, studies confirm this link—multiple sclerosis is
an immune-mediated disease. Furthermore, it appears to be an autoimmune disease since
there are myelin antibodies in cerebrospinal fluid. Other evidence to support this comes
from studies of T cell function in multiple sclerosis. The activation of T helper 1 CD4+
cells and production of interferon (IFN)-γ are implicated in the pathogenesis of multiple
sclerosis. These cells are found to be enriched in both the blood and spinal fluid of patients,
and activated macrophages that are found in sclerotic plaques can release cytokines and
other enzymes that are able to cause demyelination. IFN-γ was trialed as a potential treat-
ment for multiple sclerosis, but was found to worsen rather than improve symptoms.
With so much already known, some people question the value of further genetic studies
in multiple sclerosis. There are two reasons why investigations should continue. The first
is illustrated by the disappointing results of the IFN-γ trials, that is our knowledge to
date is incomplete and not sufficient to aid in the formulation of effective therapies. The
second is illustrated in the HLA association. Though we have this information, we have
no idea what the auto-antigen or auto-antigens are, and how many immune-regulatory
processes have failed or broken down to permit this disease to occur. Genetic studies offer
one way to identify these.
MHC
CD80/
CD86
Cytokines T cell
IL12, selection
TNFRSF THEMIS
MS
Figure 6.13: Illustrating risk genes and cell targets for multiple sclerosis. This figure illustrates the complexity
of interaction between a few selected immune response genes associated with an increased risk of multiple
sclerosis (MS). There are several different explanations for this. Risk alleles may work independently or in an
additive model. The additive model may include a number of different groups. In an additive model, it may
be necessary to inherit several common polymorphisms before there is significant risk of multiple sclerosis.
Alternatively, under the correct set of environmental conditions a solitary polymorphism may be sufficient to
significantly promote the disease or protect from it.
AIH
HLA 8.1
DRB1*04
PSC
HAV HLA 8.1
DRB1*13:01 DRB1*13:01
Positive
HLA
Chronic associations PBC
HCV DRB1*08-
DQB1*03:01 DQA1*04:01
HBV DILI-FLU
DRB1*13:01 B*57:01
Figure 6.14: HLA and liver disease. The figure illustrates the major positive HLA associations with seven
different forms of liver disease. Clockwise these are: autoimmune hepatitis (AIH), primary sclerosing cholangitis
(PSC), primary biliary cirrhosis (PBC), drug-induced liver injury specifically flucloxacillin (DILI-FLU), hepatitis B
virus (HBV), hepatitis C virus (HCV), and hepatitis A virus (HAV). The first three are all autoimmune diseases; the
last three are viral liver diseases. For AIH, DRB1*04 represents DRB1*04:01 in European and North Americans of
European origin and DRB1*04:05 in Japanese, Korean, and South American adult cases. The association between
PSC and the HLA 7.2 haplotype is weak and controversial, and is not reported in the figure. The association with
PBC in Japan and China may involve a different DQA1-DQB1 haplotype: DQA1*01:03–DQB1*06:01. DILI-FLU
refers to the extraordinarily strong association with HLA-B*57:01 seen in a minority of DILI patients treated with
the commonly prescribed antibiotic flucloxacillin.
AIH
DRB1*15:01
PBC
DRB1*11
DRB1*13
Figure 6.15: Protective HLA associations with three major autoimmune liver diseases and one example
of drug-induced liver injury. The figure illustrates four major negative (protective) HLA associations with
three autoimmune liver diseases and a selected example of drug-induced liver injury (DILI). Clockwise these
are autoimmune hepatitis (AIH), primary sclerosing cholangitis (PSC), primary biliary cirrhosis (PBC), and drug-
induced liver injury, specifically augmentin (DILI-Augmentin). Comparing Figure 6.14 with figure 6.15, DRB1*04
which is associated with an increased risk in AIH protects from PSC and DRB1*13 which is associated with an
increased risk of PSC, chronic HAV infection in South America, and AIH in children from Argentina is associated
with a reduced risk of PBC. The DRB1*07:01 haplotype that protects from augmentin DILI is the one which carries
HLA-B*57:01 – the major risk allele for flucloxacillin DILI.
allele in AIH. The Korean and Japanese populations have a shared ancestry and therefore
this was not surprising. However, compared with the European and North American find-
ings, initial analysis suggested the results were discordant and do not agree with the shared
epitope model for AIH. DRB1*04:05 does not encode LLEQKR at positions 67–72, it
encodes LLEQRR. However, under further scrutiny it was found that the only difference
is at position 71 where DRB1*04:05 does not encode lysine but it encodes arginine and
this amino acid is very similar to lysine, being a highly charged polar amino acid of similar
structure. Therefore, at a functional level these two alleles may encode molecules with very
similar preferences for antigen binding.
Extending the shared epitope model for HLA-encoded susceptibility to AIH
At present the LLEQKR model with emphasis on DRβ-71 is the best model to explain
HLA-encoded susceptibility to AIH. However, there may be scope to extend the model
to include amino acids upstream of position 71 as far as position 74. This idea was first
promoted in a paper from Korea by Lim et al. in 2008. In the extended shared epitope
model, LLEQKRGR is encoded by the DRB1*03:01 allele and LLEQK-(or R)-RRA is
encoded by the DRB1*04:01 and DRB1*04:05 alleles. In the extended model there is a
critical difference between DRB1*03:01 (on one side) and DRB1*04:01 and DRB1*04:05
(on the other). DRB1*03:01 encodes arginine at position 74, while both DRB1*04:01
and DRB1*04:05 encode alanine. As we have already seen, these are two very different
amino acids with different polarity. Essentially, Lim et al. argue that DRB1*04:01 and
DRB1*04:05, which both carry arginine 71 and are positively charged, can form a salt
bridge with the negatively charged P4 binding pocket. DRB1*03:01 is also able to form
HLA locus
Haplotype A C B DRB3/4/5 DRB1 DQA1 DQB1
Haplotypes that encode susceptibility to AIH
8.1 *01 *07 *08 3*01:01 *03:01 *05:01 *02:01
DR4–DQ7 – – – 4* *04:01 *03 *03:01
DR4–DQ4 – – *54 4* *04:05 – *04:01
A11–DR4 *11 – – 4* *04 – –
*04:05
DR13 – – – – *13:01 *01:03 *06:03
Haplotypes that encode resistance (protection) to AIH
7.2 – *07 – 5*01:01 *15:01 *01:02 *06:02
DR4–DQ7 – – – 4* *04:06 – *03:01
DR13 – – – – *13:02 – –
The first two haplotypes are associated with AIH in UK, Northern Europe and USA. The third haplotype is
associated with AIH in Japan and Korea. The fourth and fifth haplotypes are found in pediatric AIH in Argentina. In
Argentina, DRB1*13:01 is the major susceptibility determinant in pediatric AIH, but not in adult cases. In adult AIH,
from South America DRB1*04:05 is the major allele. Not all studies are listed, only enough to give a cross-sectional
view of the data on AIH and those which contribute most to the scientific discussion of this subject. In cases where
individual groups have produced multiple studies, the study with the largest number of patient cases is shown.
Alleles are not listed at all loci for all haplotypes, in each case a dash is entered to indicate this.
this salt bridge as it has lysine at position 71. This confirms the idea that the interaction
of the DRβ-71 residue with P4 is of primary importance in AIH. This implies that the
amino acid at position 74 is of secondary importance. Thus, the alanine carried by both
DRB1*04:01 and DRB1*04:05 at position DRβ-74 does not appear to influence the effect
of the amino acid at position DRβ-74 on the P4 binding pocket due to its small size and
charge, whereas DRB1*03:01 has arginine at position 74 and this can also interact with
the P4 binding pocket to form a second salt bridge. In this extended model those carry-
ing DRB1*03:01 receive a double-dose effect through structural variations encoded by a
single allele. This may also go some way to explaining the differences in HLA distribution
reported in early- and late-onset disease in AIH (Figure 6.16).
A slightly more difficult situation arose when studies from South America reported associ-
ations with DRB1*13:01 in children with AIH. However, other studies in South America
have suggested that chronic hepatitis A virus is associated with DRB1*13:01 and this
virus has been identified as a potential trigger for AIH. Interestingly, adult patients from
Argentina have an excess of DRB1*04:05, that is the same allele that is associated with sus-
ceptibility to AIH in Japan and Korea. Thus, at least the data for South American adults
fits into the shared epitope hypothesis. The different data in children from South America
remains unexplained.
Haplotypes and clinical severity in AIH
The two AIH major susceptibility haplotypes DRB1*03:01 and DRB1*04:01 or DRB1*04:05
(depending on nationality) each carry a second expressed DRB gene: DRB3 for DRB1*03:01
R74 K71
A74 R71
Figure 6.16: Salt bridge formation over the P4 binding pocket in AIH may explain genetic associations
with different HLA alleles. The figure is a simplified illustration of the peptide-binding groove of an HLA
molecule focusing on the interaction between peptides at positions 71 and 74 on the DRB1-encoded β-chain.
The figure compares the formation of the salt bridge at the P4 binding pocket (fourth binding pocket for antigen
peptide side chains) in two groups; those with DRB1*03:01 and those with either DRB1*04:01 or DRB1*04:05.
These alleles are the major susceptibility alleles for autoimmune hepatitis in the European, European American
(the first two), and Japanese populations (the last). All alleles are able to form a salt bridge; however, those
carrying DRB1*03:01 have a double strike because they have positively charged amino acid at positions 71
(lysine K) and 74 (arginine R), while those with DRB1*04:01 and DRB1*04:05 have only a single strike because
though they have arginine at position 71, they have neutral alanine (A) at position 74. This may account for
the observation that AIH patients with DRB1*03:01 tend to present earlier and may have more severe disease.
The model presented is similar to that seen for other autoimmune diseases. In those with DRB1*04:06, there is
glutamate at position 74 and even though they have arginine at position 71 they are unable to form a salt bridge
because the negatively charged glutamate repels the negatively charged P4 peptide.
and DRB4 for DRB1*04:01/04:05. While DRB3 encodes LLEQKRGR, DRB4 does not.
Therefore patients with the DRB1*03:01 haplotype have at least two DRB alleles, each with
the same critical sequence, whereas those with DRB1*04:01 or DRB1*04:05 do not. It is
therefore possible that the increased number of LLEQKRGR-expressing alleles restricts the
selection of antigenic peptides available for binding to the expressed DR molecule. This
multihit hypothesis may explain why patients with DRB1*03:01 present earlier in life and
those with DRB1*04:05 present later. In fact, early-onset AIH is rarely seen in Japan or
Korea and the association with DRB1*04:01 is with late-onset cases in European popula-
tions. Based on all of the information above, the possible gene dose effect of having a single
DRB1*03 gene will increase to two hits, i.e. lysine 71 (first hit) and arginine 74 (second
hit) for the DRB1*03 allele alone, and this will increase from two to four hits in most cases,
as they will also all have the DRB3 gene that encodes the LLEQKRGR sequence. Both
genes on the same haplotype have a potential two-hit locus (two times two equals four).
Homozygotes for DRB1*03-DRB3 haplotype may have eight hits.
This does not rule out the possibility that other genes within the same MHC haplotype
are having an impact on disease risk. It is worth noting that recent findings on rheumatoid
arthritis indicate the possibility of another three MHC genes encoding five amino acids
as risk factors. With respect to AIH, it is also important to understand when comparing
populations that the DRB1*03:01 allele and the ancestral HLA 8.1 haplotype to which
most DRB1*03:01 alleles are connected are rare in Japan and Korea as is the DRB1*04:01
allele. It is therefore not surprising when the associations reported in Europe and the USA
cannot be replicated in these quite distinct populations, and it is a major advance when
unifying hypotheses (such as that above) can be developed from what may appear to be
conflicting data.
Table 6.11 Primary sclerosing cholangitis and HLA alleles and haplotypes.
HLA locus
Haplotype A C B DRB3/4/5 DRB1 DQA1 DQB1
Haplotypes that encode susceptibility to PSC
8.1 *01 *07 *08 3*01:01 *03:01 *05:01 *02:01
7.2 *03 *07 *07 5*01:01 *15:01 *01:02 *06:02
DRB1*13 – – – 3*01:01 *13:01 *01:03 *06:03
Haplotypes that encode resistance (protection) to PSC
DRB1*04 – – – 4*01:03 *04:01 *03 *03:02
DRB1*07 – – – 4*01:03 *07:01 *02:01 *03:03
MICA*002 – – – – – – –
Note that the haplotype 7.2 is only weakly associated with PSC as is the DRB1*07 haplotype. The MICA*002
association is discussed later in the chapter. Note that both the 8.1 and 7.2 haplotypes carry the MICA*008 allele.
Alleles are not listed at all loci for all haplotypes, in each case a dash is entered to indicate this.
P9
N37
P1
V86
Figure 6.17: HLA DR binding groove and HLA-encoded susceptibility to PSC. The figure shows a simplified
version of the HLA peptide-antigen binding groove from a top-down perspective. The amino acids asparagine
(N) at position 37 and valine (V) at position 86 are highlighted. These two amino acids interact at the P9 and
P1 binding pockets, respectively. Studies indicate that HLA DRB1 alleles carrying the amino acids asparagine 37
and valine 86 are most strongly associated with PSC in European populations and account for the majority of
HLA-encoded susceptibility to PSC. (Adapted from Hov JR, Kosmoliaptsis V, Traherne JA et al. [2011] Hepatology
53:1967-1976. With permission from John Wiley and Sons.)
remain the same, and it is interesting that both AIH and PSC favor associations with DRB
rather than DQ (as does rheumatoid arthritis).
MICA in PSC
The MICA genes are encoded in the so-called MHC class III region, and are stress induc-
ible and expressed on the gastric epithelium. MICA can activate NK cells and may also
activate γδ T cells. All of these factors make them ideal candidates to consider when look-
ing at the MHC in PSC. There is considerable polymorphism in MICA. Based on their
putative function, their location, and the level of polymorphism in the MICA and MICB
genes, two studies in 2001 investigated MICA in PSC: one from the UK and the other
from Norway. The UK study described a strong association with MICA*008 and a strong
dominant protective effect of MICA*002. In the UK study there were two patient groups
totaling 112 altogether and both patient groups reported statistically significant associa-
tions with these alleles. Overall, the MICA*008 association appeared to be due to a very
high frequency of homozygous patients.
The Norwegian study described associations with the MICA5.1 and MICB24 markers.
The method employed for MIC genotyping in this study was different from that above.
MICA5.1 is a marker for the ancestral 8.1 haplotype. This study did not agree with that
from UK. One reason for this is the different techniques; the other lies within a subtle
difference between the two populations. These two populations though similar are not
the same. At the time of this study the Norwegian PSC patients were all diagnosed to be
positive for ulcerative colitis, whereas in the UK only 70% of cases of PSC are diagnosed
to be positive for ulcerative colitis.
The UK set have a weak but confirmed association with the ancestral 7.2 haplotype.
This association is not seen in either Sweden or Norway. The MICA*008 allele is carried
on both the HLA 8.1 and 7.2 ancestral haplotypes (haplotypes 1 and 2 in Table 6.11).
The absence of the weak association with the 7.2 haplotype in the Scandinavian popu-
lation may account for this difference, and it may also explain the difference reported
at the molecular level regarding electrostatic charge and position DRβ-37 versus posi-
tion DRβ-38. This difference may also reflect genuine population differences and case
ascertainment criteria.
In the case of DRβ-38 versus DRβ-37, clearly the latest data in favor of DRβ-37 are very
convincing and clearly correct, but the potential role of MICA has yet to be determined
in a positive or negative manner. There are four reasons for this:
• The Scandinavian team did not identify MICA*008, but a marker for the HLA 8.1
haplotype MICA*5.1.
• The sample sizes in the two studies were comparable (112) versus (130).
• The location of MICA expression in relation to the disease.
• The function of MICA and the fact that the liver has very high numbers of NK and
γδ T cells.
These observations are likely to be only a piece of the final picture and other MHC genes
may have a role to play in PSC. Until very large-series studies of the whole MHC region
have been performed in PSC we will have to wait to find out what else is hiding within this
region. Finally, we do not yet know the antigenic peptides that are preferentially bound by
these HLA molecules. Identifying these is the essential theme for this research as this will
provide insight into the disease pathogenesis and will hopefully lead to novel therapies.
As the associations with DRB1*11 and DRB1*13 are weak in some populations this model
is speculative, but it provides a basis for considering the molecular biology of this associa-
tion and the binding groove, and moves the discussion away from the simple process of
genotype collecting. Three of the four positions above are of potential functional impor-
tance. The amino acid at position 13 affects the binding of antigenic side chains associated
with both the P4 and P7 binding pockets, while the amino acids at positions 47 and 74
affect P4 and P6, respectively.
PBC and ankylosing spondylitis, but there does not appear to be any overlap between
PSC and PBC, though some members of both risk groups have roles in cytokine produc-
tion and regulation (IL12A in PBC; IL2 and IL2RA in PSC). Ulcerative colitis (usually
mild) occurs in 70% of all PSC cases. These two diseases share a number of risk alleles
and have some that are not shared. Two examples of PSC/ulcerative colitis risk alleles are
MST1 and IL21; others are found in PSC only, e.g. IL2RA and SIK.
The MHC class III region complement, MICA, and TNFA genes in
complex disease
The gene products of the MHC class III region are an assortment of proteins, many with
important roles in immunity and immune regulation. We will consider six genes but only
one disease example. The genes are:
• Complement C2, C4A, C4B, and Bf.
• MICA (MHC class I chain-related A).
• TNFA (TNF-α).
C4 C2 Bf
C4bC2a C3bB D
C3 BbC3b Bb Ba
C3a C3b
Figure 6.18: Complement system and C3 conversion. The two major complement pathways—the classical
pathway and the alternative pathway—converge at the point at which C3 is converted into subunits C3a and
C3b. C2a and C4b subunits are primarily responsible for this process in the classical pathway. Thus, converted
C3 has many different functions. C3a is active in macrophage recruitment. C3b interacts with the factor B of the
alternative pathway and along with factor D cleaves the protein into two subunits: Ba and Bb. The Bb subunit
interacts with C3b and forms a C3 convertase. Through this interaction C3 once converted creates a positive
feedback loop for further conversion of C3. C3 conversion producing C3b also leads to C5 conversion in the
classical pathway. In many ways C3 conversion is the critical stage in the complement pathways.
complement genes are C2, C4A, and C4B for the classical pathway and Bf (factor B) in
the alternative pathway.
Polymorphisms in the C2 and C4 genes have been associated with a number of autoimmune
diseases. One reason for this is the very strong linkage disequilibrium in the HLA 8.1 ances-
tral haplotype. Studies of complement genes in diseases where there are associations with
this haplotype will also report associations with C4A null alleles. These nulls are large-scale
deletions of the C4A–CYP21A region of 6p21.3 and they do not have a functional product.
Though these are called null alleles by some, they are in fact null genes as the whole gene
sequence is deleted. However, we each inherit two copies of our genome and in complement
there are two functional forms of C4, designated C4A and C4B. As these have similar func-
tions, the partial absence of an expressed C4A gene (heterozygote) or even complete absence
(homozygote) frequently has no phenotypic consequences. C4B simply makes up for C4A.
Interestingly, there are also two forms of C4B, designated long and short, which differ by as
much as 6 kb in size and there are also null alleles for C2. Copy number variation (CNV)
also occurs with these genes whereby some haplotypes carry multiple copies of the genes. As
C4A null is frequently carried on the HLA 8.1 haplotype it is difficult to select whether or
not this null allele has any effect in diseases with the HLA 8.1 association. However, there is
one outstanding exception: systemic lupus erythematosus (SLE).
and MICE ). Polymorphisms in the A and B genes have been related to a number of
autoimmune diseases including Behçet’s disease and Addison’s disease. MICA encodes an
MHC class I-like molecule with three extracellular immunoglobulin domains, a trans-
membrane segment, and a cytoplasmic tail to anchor the molecule to the cell. MICA is
stress inducible and expressed on gastric epithelium. MICA can activate NK cells and may
also activate γδ T cells. There is considerable polymorphism in MICA. Currently, there are
101 MICA and 41 MICB alleles (encoding 80 and 27 proteins, respectively). A potential
association between MICA polymorphism and PSC is reported above.
TNF-α
TNF-α is encoded by the TNFA gene in the MHC class III region. It is a pro-inflammatory
cytokine with powerful effects that can be localized to infected tissue or produced systemi-
cally through the body. TNF-α is released by macrophages following Toll-like receptor
stimulation. It can be beneficial or harmful depending on whether it is local or systemic.
The gene encodes a number of well-characterized SNPs that have been investigated in
a number of diseases. The most commonly investigated SNP is the A/G –308 SNP.
The TNF-308A allele has been identified as being associated with increased TNF-α
production. Early studies found it was increased in most diseases studied. However,
interest in TNFA soon faded and then stopped when it was realized that the TNFA-
308A allele was in linkage on the HLA 8.1 haplotype. Interest in TNFA did, however,
spark a general interest in cytokine gene polymorphisms as potential potent immune-
regulatory genes.
that these links to common haplotypes reflect the fact that each carries more than one
potential risk allele. However, this does not suggest that the disease results from the effect
of several genes; on the contrary, it suggests perhaps a range of diseases are associated
with the same haplotype because the haplotype encodes several risk loci. For example, an
allele that is associated with an increased risk of SLE may be associated with AIH through
carriage on the common 8.1 haplotype. In SLE, it is the C4A gene that is thought to
be important in disease pathogenesis, whereas in AIH it is thought to be DRβ-71 and
DRβ-74. In this hypothesis the association between AIH and C4A*Q0 simply reflects
linkage disequilibrium in the 8.1 haplotype, and the real disease risk is explained by the
association with DRB1 alleles, which carry lysine (and arginine) at DRβ-71. This suggests
that the association with C4A*Q0 may be false and due simply to linkage disequilibrium.
In SLE, the situation is reversed—the association with HLA genes on the 8.1 haplotype
may be simply due to linkage disequilibrium with the C4AQ0 allele.
HLA class 2
TNFA Complement
Disease
risk
MICA
Figure 6.19: Multihit hypothesis. The figure shows the potential for multiple genes to increase disease
susceptibility. Here the figure shows five different MHC gene families or genes. However, these are not the only
risk genes for the traits discussed in this chapter. There is an indication that genes may interact, e.g. TNFA and
MHC class II. Thus, it is possible that polymorphisms only increase disease risk when other alleles are present
or overexpressed. Finally, the figure also allows for redundancy. In this case, the complement system is used
as an example. The gray arrow indicates that even in the presence of complement gene-encoded risk alleles,
possession of such an allele may not have an effect unless the biological system that the gene products serve is
stressed. Under such conditions having a null allele for C4A, for example, may lead to failure to produce adequate
complement levels and this may have downstream consequences in terms of immune complex clearance, etc..
of why certain haplotypes are associated with several diseases. For example, the extended
8.1 haplotype includes HLA-A*01–C*07:01–B*08–MICA*008–TNFA-308**–C4A*Q0–
C4B*1–C2*C–Bf*S–DRB3*01:01–DRB1*03:01–DAQ1*05:01–DQB1*02:01, one or any
combination of these individual alleles may increase or reduce a particular disease risk.
A combination effect is partly illustrated by the example of AIH above. This possibil-
ity may explain the predominance of this haplotype in autoimmune disease. Of course,
this is a common haplotype and most carriers do not develop autoimmune diseases, but
possession of this group of risk alleles under immune stress may be sufficient to tip the
balance in the wrong direction. The converse is also possible under certain circumstances,
for example during a major epidemic it may be beneficial to have this haplotype because
possession of the haplotype may result in a mild response to the infectious agent. This may
explain the high frequency of some haplotypes in the population.
According to this explanation there are multiple risk alleles within the MHC. Each hap-
lotype will carry multiple hits, but the same haplotype does not necessarily operate in the
same way in each disease. The same DR, DQ, and C4 polymorphisms may impact on
different diseases differently, each making a large, small, or no contribution to the risk
portfolio. The multihit model can be exclusive.
it has become possible to increase the level of resolution and to focus on areas previously
ignored, allowing hitherto undetected, often undetectable relationships to be identified.
Therefore in studies of the MHC we often see the focus of research move from one loca-
tion to another. We must remember that had there been no initial interest in HLA, these
associations would not have been identified and perhaps in many cases the idea that these
complex diseases would be worthy of genetic investigation would have gone unmarked.
Studies of HLA in disease go back at least as far as 1967, and flowered in the 1970s and
1980s. These were periods during which even with poor technology, great advances were
made in a range of diseases and these were the stepping stones on which present-day
advances in the MHC and elsewhere began.
Conclusions
This chapter contains many key messages for the student of the genetics of complex dis-
eases. The early history of the MHC illustrates the importance of having standardized
nomenclature for complex systems. This will be of increasing importance in the post-
genome era with high-resolution genotyping being applied to more and more genes, creat-
ing a potential for personalized naming systems to arise.
The MHC is very complex with extreme levels of linkage disequilibrium and polymor-
phism. The high levels of linkage disequilibrium can be both helpful and problematic. In
particular, problems arise when data based around a single gene are over-interpreted. The
MHC illustrates the importance of studying whole haplotypes and not focusing on single
genes until there is sufficient evidence to be confident that the candidate has been identi-
fied. However, systems interact and genetic interaction on extended haplotypes can turn
the most carefully formulated hypotheses into nonsense.
In a more positive frame of mind, the examples discussed enable us to see the link between
genotype and phenotype, and the way genetics informs the debate on disease pathogenesis
at the molecular level. The key among these is the role of HLA in the formation of the
immune synapse—a process that is essential in adaptive immunity.
Though a key example, it would be wrong to consider the formation of the T cell synapse
as the only immune process influenced by MHC genes. HLA alleles have important roles
in innate immunity through NK cell activation and regulation.
Genetic associations with the MHC are not all the result of HLA polymorphism. This large
region encodes many immune-regulatory genes with diverse functions. Polymorphism in
any of these non-HLA immune-regulatory genes has the potential to impact on disease
susceptibility. For example, complement is a key element in immune complex clearance
and bacterial destruction, MICA genes impact on innate immunity, and the list goes on.
Finally, the examples discussed in this chapter also allow us to consider the difficulties
and complexity of assessing genetic studies in this area, and how we can approach these
subjects with a critical mind.
In summary, the MHC is a large densely packed genetic region within the human genome.
The genes within this region are particularly important in immune regulation, and there
are very strong links between polymorphisms in these genes and many complex diseases.
The region has been the focus for a considerable amount of research; however, much
remains to be done and, despite some very promising findings and hypotheses, the link
between genotype and phenotype is yet to be fully established. Interestingly, with current
technologies we are beginning to unpick the MHC to a greater extent than before and
identify multiple associations within the region. For example, the WTCCC2 study in
2011 suggested that there may be an independent association with an HLA-A allele that
confers some protection from multiple sclerosis. These technologies will be discussed in
more detail later in the book.
Further Reading
Books/Theses the original report of 1984 IHTW identifying
Donaldson PT. The influence of donor recipi- HLA DQ and DP for the first time.
ent histocompatibility and immunogenetics on Brown JH, Jardetzky TS, Gorga JC et al. (1993)
the outcome of clinical liver transplants. PhD Three dimensional structure of the human
Thesis, Faculty of Clinical Medical Sciences, class II histocompatibility antigen HLA-DR1.
King’s College London. 1995. This thesis gives Nature 364:33–39.
a very thorough history of the development of Donaldson PT (2011) Electrostatic modifica-
transplantation from 1829, with Blundell's first tions of the human leucocyte antigen DR P9
use of blood transfusion, up to 1995. peptide-binding pocket primary sclerosing
Hillert J & Fogdell-Hahn A (2000) HLA and cholangitis: Back to the future with human
neurological diseases. In HLA in Health and leucocyte antigen DRβ. Hepatology 53:1798–
Disease (Lechler R and Warrens A eds), pp 1800. This is the Editorial for the paper by Hov
219–230. Academic Press. This chapter is full et al. (2011) written by one of the authors of
of data on HLA, and both multiple sclerosis this book. The Editorial explains the history of
and narcolepsy. HLA and PSC.
Parham P (2009) The Immune System, Donaldson PT & Norris S (2002) Evaluation of
3rd ed. Garland Science. This is an excel- the role of MHC class II alleles, haplotypes and
lent basic immunology book for the student selected amino acid sequences in primary scle-
of human immunology including links to rosing cholangitis. Autoimmunity 35:555–564.
immunogenetics. Han F, Faraco J, Dong XS et al. (2013)
Van Rood JJ (2000) The history of the discov- Genome wide analysis of narcolepsy in China
ery of HLA. In HLA in Health and Disease implicates novel immune loci and reveals
(Lechler R & Warrens A eds), pp 3–22. changes in association prior to versus after the
Academic Press. This is an excellent overview 2009 H1N1 influenza pandemic. PLoS Genet 9
of the early days of HLA, and the flowering of (10):e1003880.
HLA nomenclature and typing methods, by Horton R, Wilming L, Rand V et al. (2004)
one of its founding fathers. Gene map of the extended human MHC Nat
Rev Genet 5:889–899.
Articles Hov JR, Kosmoliaptsis V, Traherne JA et al.
Arnett KL, Haung W, Valiante NM et al. (2011) Electrostatic modifications of the
(1998) The Bw4/Bw6 difference between HLA- human leukocyte antigen-DR P9 peptide-
B*08:02 and HLA-B*08:01 changes the pep- binding pocket and susceptibility to primary
tides endogenously bound and the stimulation sclerosing cholangitis. Hepatology 53:1967–
of alloreactive T cells. Immunogenetics 48:56–61. 1976. This is an excellent article that discusses
the past and present interpretations of HLA
Bauer KH (2007) Homiotransplantation von associations with PSC, introducing some very
epidermis bei eineiigen zwillingen. Bruns Beit up-to-date methods for data analysis and pro-
Klin Chir 141:442–447. This is an example of viding a new model to account for these genetic
the early work on histocompatibility and raises associations.
awareness that not all papers were published in
English. Landsteiner K (1900) Zur kenntniss der anti-
fermentativen lytischen und agglutinieren-
Blundell J (1829) Observations on transfusion
den. Wirkungen blutserums und der lymphe.
of blood. Lancet 12:321–324. A fascinating
Zbl Bakt 27:357–362. Once again this paper
insight into very early clinical investigation.
reminds the student that not all science was and
Bjorkman PJ, Saper MA, Samraoui B et al. is published in English. We should endeavor to
(1987) Structure of the human class I his- be aware of this especially when we consider
tocompatibility antigen, HLA-A2. Nature that Landsteiner was awarded a Nobel Prize for
329:506–512. his work.
Bodmer WF, Albert E, Bodmer JG et al. Liu JZ, Almarri MA, Gaffney DJ et al. (2012)
(1985). Nomenclature for factors of the HLA Dense fine-mapping study identifies new
system, 1984. WHO Bull 63:399–405. This is
susceptibility loci for primary biliary cirrhosis. mechanisms in multiple sclerosis. Nature
Nat Genet 44:1137–1141. 476:214–219.
Liu JZ, Hov JR, Folseaas T et al. (2013) Dense Todd JA, Bell JI & McDevitt HO (1987)
genotyping of immune-related disease regions HLA-DQβ gene contributes to susceptibility
identifies nine new risk loci for primary scleros- and resistance to insulin-dependent diabetes
ing cholangitis. Nat Genet 45:670–675. mellitus. Nature 329:599–604.
Norris S, Kondeatis E, Collins R et al. (2001) Wiencke K, Spurkland A, Schrumpf E &
Mapping MHC-encoded susceptibility and Boberg KM (2001) Primary sclerosing chol-
resistance in primary sclerosing cholangitis: The angitis is associated to an extended B8-DR3
role of MICA polymorphism. Gastroenterology haplotype including particular MICA and
120:1475–1482. MICB alleles. Hepatology 34:625–630. The
Siebold C, Hansen BE, Wyer JR et al. (2004) study by Norris et al., above and this study by
Crystal structure of HLA-DQ0602 that Wiencke et al., show slightly different results.
protects against type 1 diabetes and confers This is common in association studies between
strong susceptibility to narcolepsy. Proc Natl competing centers. Confounding results may
Acad Sci USA 101:1999–2004. This original be based on many factors. In this example the
paper compares two diseases and discusses studies are both based on similar-sized patient
complex molecular models to explain the pools, but use different methods and different
HLA associations with them. Not for the populations. This is an example that illustrates
faint-hearted, but a great paper that requires many of the concepts of studies in complex dis-
some consideration. eases, and together these two papers provide an
excellent basis for critical appraisal.
Terasaki PI & McClelland JD (1964)
Microdroplet assay of human serum cytotox-
ins. Nature 204:998–1000. This original report Online sources
opened the door to the use of micro assays for
HLA typing—a method almost universally http://hla.alleles.org/antigens/index.html
adopted for HLA class I typing in the 1970s This is an excellent source for up-to-date infor-
and 1980s until molecular genetics techniques mation on HLA nomenclature.
became available. http://www.ebi.ac.uk/imgt/hla/stats.html
The International Multiple Sclerosis Genetics This is an excellent source for up-to-date infor-
Consortium & The Wellcome Trust Case mation on HLA allele numbers and nomencla-
Control Consortium 2 (2011) Genetic risk ture. The website includes text on the history of
and a primary role for cell-mediated immune HLA nomenclature.
Infectious diseases account for the majority of morbidity and mortality in human popula-
tions. In a classical sense, we would not normally consider infectious diseases as genetic
diseases. However, variation in the human genome can lead to variation in resistance and
susceptibility to infectious disease, and in this sense it is correct to say that there is a genetic
component in infectious disease. In this chapter, we will consider how the evolutionary
pressure of infectious disease has led to particular genetic variants being maintained in a
human population and will examine molecular mechanisms that link particular polymor-
phisms with susceptibility to infection. Two key examples, malaria and human immuno-
deficiency virus type 1 (HIV-1), will be used to illustrate how different clinical aspects of
infectious disease are influenced by host genetic variation. In addition, we will discuss the
use of hypothesis-driven and genome-wide studies to identify specific human genes that
influence the severity of infectious disease.
Response to
treatment
Disease outcome
Exposure Infection Pathogenesis • Symptom-free, uninfected
• Symptom-free, carrier
• Local/systemic infection
• Mild/severe disease
Pathogen
• Acute/chronic disease
• Alternative symptoms
Host Multiplication
• Death
of pathogen
Figure 7.1: The infection process. All infectious diseases follow a similar process. First, the host must be
exposed to the pathogen that may then infect the host, replicating either on host cell surfaces or inside cells. The
infection process often requires the pathogen to bind to a host cell receptor protein. The outcome of infection
varies enormously, depending upon the specific interaction between host and pathogen, and ranges from
asymptomatic infection to mild disease, severe disease, and death. Treatment of infection (either before or after
exposure) may inhibit the infection process at one or more steps in the pathway, and there is significant variation
in the response of the patient and the pathogen to treatment.
when populations were more geographically isolated than they are today. In 1519, Spanish
Conquistadors led by Hernando Cortes landed in Mexico. Unknown to the Conquistadors,
they were taking with them a number of infectious agents, including measles, smallpox,
and the influenza viruses. They also carried bacteria, including Rickettsia, which can cause
typhus. The invading populations were relatively resistant to these pathogens, whereas the
indigenous South American populations, including the Mexican Aztecs, were highly sus-
ceptible to these infectious agents and died in vast numbers as a result. For example, small-
pox became epidemic in Mexico between 1520 and 1521, and may have killed an estimated
10–50% of the local population. Several factors are likely to have contributed to the dif-
ference in susceptibility between the invading and indigenous populations. The invaders
had already been exposed to these pathogens and would therefore have acquired immune
memory. In contrast, the indigenous populations who had not previously been exposed to
these pathogens had no acquired immune memory, making them more vulnerable to infec-
tion. However, host genetic differences between the two populations may also have played
a significant role. Pathogens such as smallpox had been common among European popula-
tions for many years, and individuals with allelic variants conferring resistance to smallpox
and other common pathogens would have had a survival advantage over those without such
alleles. As a consequence of this selective pressure, resistance alleles may have been present
at a higher frequency in the invading populations than in South American populations,
where no such selection pressure had occurred. Thus, a difference in the relative frequency
of resistance-associated alleles in the two populations may have contributed to the greater
susceptibility of the South American people to the imported pathogens.
There are many other examples in history of naïve populations being devastated by infec-
tious diseases brought to them by immigrant populations. Haldane, in 1949, suggested
that “Europeans have used their genetic resistance to such viruses as that of measles (rube-
ola) as a weapon against primitive peoples as effective as fire-arms”. Yet the reverse rule also
applies: Europeans, migrating to colonial regions, have found themselves more susceptible
to local infectious agents and in some cases early colonies were wiped out by exposure to
infectious disease.
Table 7.1 Selected examples of rare monogenic disorders that cause defects in the immune response
leading to increased susceptibility to infectious disease.
association with a specific infectious disease. This has allowed researchers to cast a much
wider net in the search for relevant alleles.
7.4 Malaria
Malaria is caused by protozoa of the Plasmodium genus, comprised of four species:
Plasmodium falciparum, Plasmodium vivax, Plasmodium ovale, and Plasmodium malariae.
Of these P. falciparum and P. vivax are the most common, with P. falciparum causing more
severe disease than P. vivax. The Plasmodium parasite is carried by the female Anopheles
mosquito, which thrives in wet, humid areas.
Gametocytes
ingested with Feeding mosquito
blood meal injects sporozoites and
ingests gametocytes
Exflagellation to
form male gametes
Fertilization Gametocytes
Sporozoites
migrate to
Zygote Sporozoites
salivary glands
enter blood,
via hemocoel
penetrate
Sporozoites liver cell
Female released
gamete
Oocyst undergoes
Liver stage
sporogony
Merozoites
Oocyst forms released as
Merozoites erythrocyte
released is lysed
Ookinete penetrates
gut epithelium Asexual
blood cycle
Motile ookinete formed
Figure 7.2: The Plasmodium life cycle. Plasmodium parasites that cause malaria have a complex life cycle
involving a mosquito insect vector. When an infected mosquito bites a human host, Plasmodium sporozoites may
be introduced into the blood and travel to the liver. Here, the Plasmodium enters the next phase in its life cycle,
producing merozoites that infect red blood cells. Gametocytes are released from the red blood cells and can be
ingested when the host is bitten by another mosquito. The sexual phase of the Plasmodium life cycle takes place
in the gut of the mosquito and ultimately new sporozoites are produced. These travel to the salivary glands of the
mosquito from where they can be transmitted to the next host. (From Loker ES & Hofkin BV [2015] Parasitology.
Garland Science.)
β1 α1
α2 β2
Figure 7.3: Structure of hemoglobin. Hemoglobin is a tetrameric protein composed of four polypeptide units
that is found in red blood cells and is responsible for the transport of oxygen in the blood. The most common
form of hemoglobin in adults is HbA, which consists of two α polypeptides and two β polypeptides. Each of these
polypeptides forms a complex structure with an iron-containing heme prosthetic group, which is able to bind one
oxygen molecule (shown above). Diseases caused by mutations that affect the structure or expression levels of
hemoglobin polypeptides are called hemoglobinopathies.
sickle cell anemia. Heterozygotes (HbAS), however, have sickle cell trait and suffer few if
any symptoms.
Thalassemia
Thalassemia is caused by mutations in globin genes that reduce the expression of α-globin
(α-thalassemia) or β-globin (β-thalassemia). There are two α-globin genes, closely linked
on chromosome 16, and therefore two types of α-thalassemia: α+-thalassemia occurs when
one α-globin gene is defective and α0-thalassemia occurs when both are defective. There
are five possible allele combinations for α-thalassemia, as illustrated in Table 7.2.
β-thalassemia is caused by defects in the β-globin gene. β-thalassemia heterozygotes (β/–)
suffer mild anemia and morphological changes to red blood cells, whereas β-thalassemia
homozygotes suffer severe anemia that can be fatal without treatment.
Genotype Phenotype
Normal homozygote (αα/αα) Normal
α+-thalassemia heterozygote (–α/αα) No symptoms
α+-thalassemia homozygote (–α/–α) Mild anemia and small red blood cells with reduced
hemoglobin
α0-thalassaemia heterozygote (– –/αα) Mild anemia
α -thalassasemia homozygote (– –/– –)
0
Lethal
Percentage of
population
that has the
sickle cell allele
(HbS)
Endemic
>6 P. falciparum
2–6 malaria
Figure 7.4: Comparison of global distribution of the HbS allele and malaria. The global distribution of
the HbS sickle cell allele corresponds strikingly with the distribution of P. falciparum. This correlation was first
recorded in 1954 by Allison, who proposed that the HbS allele conferred resistance to P. falciparum and went on
to demonstrate that red blood cells from HbS heterozygotes were resistant to infection by P. falciparum. (From
Berg JM, Tymokczo JL & Stryer L [2007] Biochemistry, 6th ed. With permission from W. H. Freeman.)
the erythrocytes of individuals who are heterozygous for the sickle cell allele (HbAS) are
more resistant to infection by P. falciparum than wild-type (HbA) homozygotes. Allison
concluded that “those who are heterozygous for the sickle-cell gene will have a selective
advantage in regions where malaria is hyper-endemic.” His work provided strong evidence
to support the malaria hypothesis, although at the time Allison was unaware of Haldane’s
earlier work and had reached the same conclusions independently.
More recent studies on the distribution of sickle cell trait indicate that 9% of the popula-
tion carry the HbS allele in parts of Africa stretching from southern Ghana to northern
Zambia and a frequency of 18.25% was recorded in northern Angola. A case control study
in West Africa in 1991 indicated that children with the HbS allele had a 92% reduction in
the relative risk for severe malaria compared with those without the HbS allele.
means less cyto-adherence, offering a clear mechanism for protection against severe
malaria. It has also been suggested that the polymerization of hemoglobin in HbAS
erythrocytes may limit the growth of the Plasmodium parasite within the cells.
Rather surprisingly, in 1996 Williams et al. found that α+-thalassemia children on the Pacific
island of Espiritu Santo were more likely to suffer from milder forms of malaria compared
with non-thalassemic children. The effect was particularly significant in younger children
and for malaria caused by P. vivax. A possible explanation for this apparent paradox is that
α+-thalassemia makes these children more susceptible to infection by the less dangerous P.
vivax at a very young age, when they are still protected by maternal antibodies, so that they
develop immune memory that may protect them from more severe malaria later in life.
Another rather counterintuitive suggestion is that the microcytic anemia experienced
by α+-thalassemic individuals may itself offer protection from more severe malarial ane-
mia. Malarial anemia is caused by destruction of Plasmodium-infected erythrocytes.
α+-thalassemic patients have a high number of small circulating erythrocytes, each with
a relatively low concentration of hemoglobin. The rationale for the proposed mechanism
is that less hemoglobin is lost when a microcytic erythrocyte is destroyed, compared with
destruction of a normal erythrocyte. The resulting malarial anemia is therefore less severe.
could be divided into four distinct genetic subgroups. This is a clear example of popula-
tion stratification discussed in the opening chapters of this book. Once aware of the sub-
populations, they could be taken into account in the analysis. A total of 19 regions were
identified that showed genetic association with malarial resistance, one of which was the
HbS locus. The association with marker SNPs in the HbS region was, however, relatively
weak (P < 4 × 10−7) which was puzzling, given the high frequency of the HbS allele in the
Gambian population and the strong protection that it confers against P. falciparum.
Could weak linkage disequilibrium between the HbS tagged SNPs and the HbS causal
SNP in the Gambian population be the reason for the unexpectedly weak association
signal? To address this question the investigators needed more detailed genotype informa-
tion and so they carried out high-resolution association mapping at the HbS locus, using
multipoint mapping/imputation. Imputation analysis is a statistical method used to
infer a genotype by combining information on sequence variation and haplotype struc-
ture—essentially it is a way of filling in missing genotype information. Once again the
investigators recognized that there was considerable local variation in haplotype structure
in Africa and therefore they sequenced the HbS region in 62 of the subjects and used this
as reference data for the imputation analysis, instead of using the HapMap samples from
the Yoruba people, which are more commonly used in African GWAS. The result was a far
more convincing association signal for the SNP rs334 (P = 4.5 × 10−14).
7.5 HIV-1
HIV-1 is a retrovirus that infects macrophages and helper T cells (CD4+ T cells). The virus
is transmitted by close contact with infected blood and other body fluids, most commonly
during sexual intercourse, or via use of contaminated needles either in a healthcare setting,
intravenous drug abuse, piercing, or tattooing. It can also be passed from mother to child
across the placenta, during birth, and in breast milk. In December 2013, there were 35.3
million people infected with HIV-1 worldwide.
On entering the host, the gp120 glycoprotein on the surface of the virus binds to its pri-
mary receptor, the CD4 protein, on the surface of the CD4+ T cells and macrophages.
This initial binding event results in a conformational change in the gp120 protein, allow-
ing it to interact with a second receptor, or co-receptor. Further conformational changes in
the receptor proteins and viral glycoproteins facilitate fusion of the viral envelope with the
cell membrane and the viral nucleocapsid is released into the cytoplasm of the cell. Once
inside the cell, the viral genome is uncoated and the RNA genome is used as a template in
the synthesis of complementary DNA (cDNA) by the reverse transcriptase enzyme carried
in the virus particle. Immediately after this process, a second viral enzyme, integrase, cata-
lyzes the integration of the viral cDNA into the host cell chromosome. Once integrated,
the viral genes can be transcribed and translated in order to generate new virus particles,
which leave the T cell by budding out from the cell membrane (see Figure 7.5).
1 2 3 4 5 6 7 8
Virus particle Viral envelope Reverse Viral cDNA T cell RNA transcripts Early genes code Viral RNA and
binds to CD4 fuses with cell transcriptase enters induces are multiply for proteins that proteins are
and co- membrane copies viral nucleus and low-level spliced, amplify assembled
receptor on T allowing viral RNA genomes is integrated transcription allowing transcription of into virus
cell genome to enter into double- into host of provirus translation of viral RNA and particles,
the cell stranded cDNA DNA early genes transport viral which bud
RNA to the from the cell
cytoplasm
Provirus
Figure 7.5: The life cycle of HIV. HIV infects predominantly macrophages and T cells expressing CD4 on the
cell surface. The virus initially binds to CD4 (primary receptor) and to a co-receptor (CCR5 or CXCR4). The viral
envelope then fuses with the cell plasma membrane and the viral nucleocaspid enters the cytoplasm. After
uncoating, a complementary DNA (cDNA) copy of the viral RNA genome is synthesized by reverse transcriptase
and this cDNA is able to integrate into the host cell chromosome. The integrated genome is transcribed and
translated to produce viral proteins, which package newly synthesized viral RNA genomes to form new virus
particles that bud from the host cell. (From Strelkauskas AJ, Strelkauskas JE & Moszyk-Strelkauskas D [2009]
Microbiology: A Clinical Approach. Garland Science.)
during the asymptomatic phase of infection and are believed to be the strain commonly
transmitted by sexual contact. Over time, however, the viral RNA genome accumulates
mutations and changes in the gp120 protein alter its binding affinity such that it recog-
nizes the CXCR4 chemokine receptor instead of the CCR5 receptor (Figure 7.6). The
CXCR4 receptor is found on a wider range of CD4+ T cells, including naïve T cells, and is
associated with more rapid depletion of the CD4+ T cell population. The CXCR4-adapted
HIV-1 is known as the T-tropic strain and emerges in around 50% of infected individu-
als. CD4+ T cells are essential in both antibody-mediated and cell-mediated arms of the
immune response, so as more T cells become infected and their numbers fall, the host
develops deficiency in both antibody and cell-mediated immunity. This acquired immune
deficiency syndrome is called AIDS. In most cases, without treatment, infection with
HIV-1 is fatal within approximately 10 years. Fatality is not caused by the HIV-1 virus
directly, it occurs as a result of overwhelming opportunistic infections and tumors which
occur due to the failure of the host immune response, especially CD4+ T cell activity.
RANTES SDF-1
Figure 7.6: Attachment and entry of HIV-1. (a) CCR5 is a chemokine receptor expressed on the surface of
macrophages and T memory cells. The gp120 surface protein binds to CD4, inducing a conformational change
that allows gp120 to bind to the co-receptor CCR5. Binding to the co-receptor initiates fusion of the viral envelope
with the cell membrane allowing the virus nucleocapsid of the M-tropic HIV-1 to penetrate the cell. (b) RANTES
chemokine is a natural ligand for CCR5 and has the potential to block attachment of M-tropic strains of HIV-1,
thereby inhibiting infection. (c) As infection progresses, accumulated mutations in the gp120 gene lead to a change
in binding affinity. The resulting T-tropic strains of HIV-1 bind CXCR4 protein, expressed on the surface of a wider
range of CD4+ T cells and macrophages, as a co-receptor instead of CCR5. (d) SDF-1 chemokine is a natural ligand
for CXCR4 and has the potential to block attachment of T-tropic strains of HIV-1, thereby inhibiting infection.
(From O’Brien SJ & Nelson GW [2004] Nat Genet 6:565–574. With permission from Macmillan Publishers Ltd.)
• Fast progressors: are unable to control the virus and develop AIDS in less than 2 years.
• Elite controllers (EC): are infected but able to control the viral replication to an
extremely low level (less than 50 genome copies/ml).
We have already seen how variation in the host genome can influence outcome in malaria.
Could this spectrum of extreme outcomes following HIV-1 infection be attributed to
genetic variation in the human hosts?
infection by M-tropic HIV-1, but were susceptible to T-tropic strains. This result hinted
that CCR5 may be the key to the remarkable resistance of these two subjects and in
August 1996 it was reported that the same two EU subjects were homozygous for a 32-bp
deletion (Δ32) in the CCR5 gene (CCR5-Δ32). The deletion causes a frameshift that
leads to a premature truncation of the CCR5 protein such that it is no longer trafficked
to the cell membrane. Within a matter of weeks two further publications independently
reported that the CCR5-Δ32 homozygotes among EU subjects were entirely absent from
the HIV-infected populations tested. Further analysis of the HIV-infected subjects com-
pared the genotype of those who showed a delayed progression to AIDS (long-term non-
progressors) with those who progressed to AIDS at the usual rate and found CCR5-Δ32
heterozygotes were more frequent among the long-term non-progressors.
The overall conclusions at the end of this remarkably productive year were that the CCR5
polymorphism is a key player in infection by HIV-1. CCR5-Δ32 homozygotes are highly
resistant to M-tropic strains of HIV-1 because macrophage and T cells in these individuals
do not express CCR5 on the cell surface, and so the virus is unable to bind to and enter the
cell via the CCR5 route. Further work showed that CCR5-Δ32 heterozygotes have a 70%
reduced risk of infection compared with those without the Δ32bp deletion and confirmed
that they are more likely to show a delayed progression to AIDS.
CCR5
CD4
Chemokines
released
CCR5-Δ32
WNV HIV-1
Figure 7.7: The contrasting effect of CCR5-Δ32 on infection by WNV compared with HIV-1. CCR5 is a
chemokine receptor expressed on the surface of macrophages and some CD4+ T cells. It acts as a co-receptor
for HIV-1, facilitating entry of HIV-1 into the host cell. A 32-bp deletion in the CCR5 gene (CCR5-Δ32) inhibits
trafficking of CCR5 to the cell surface and this reduces susceptibility to infection by HIV-1. By contrast, CCR5-Δ32
increases susceptibility to infection of the brain by WNV. CCR5 appears to play an important role in the immune
response in WNV infection, detecting cytokines secreted in response to infection of the brain by WNV and
mediating recruitment of lymphocytes to the site of infection. Reduced expression of CCR5 in cells with the
CCR5-Δ32 allele leads to suppression of the immune response to WNV. (From Lim JK, Glass WG, McDermott DH
& Murphy PM [2006] Trends Immunol 27:308–312. With permission from Elsevier.)
Table 7.5 Some examples of genetic variations in the genes encoding the chemokine receptor and
chemokine ligands that influence the outcome following infection with HIV-1.
but subsequent studies in various populations and ethnic groups have given very mixed
results—some indicating protection, some no effect, and some increased risk. It is there-
fore unclear whether CCR2-V64I has a causal effect on AIDS progression or whether the
association is due to linkage disequilibrium with the CCR5 gene.
A key paper published in 1999 by Gonzalez et al. shed some light on the complexity
of associations between CCR5 and CCR2 genotypes and the outcomes of HIV-1 infec-
tion. The paper identified nine haplogroups, based on polymorphisms in CCR2 and
CCR5, and showed that the frequency of the haplogroups varied considerably between
ethnic and racial groups (Figure 7.8). Moreover, the effect of the various haplotype com-
binations on the outcome of HIV-1 infection also varied depending upon ethnicity. For
example, HHA and HHA/HHF*2 haplotypes were associated with delayed progression
to AIDS in African-Americans but had no effect in European-Americans. Conversely,
HHC haplotypes were associated with delayed progression to AIDS in Europeans and
European-Americans, but with accelerated progression in African-Americans. To add to
the complexity, haplotype pairing is also important. For example, accelerated progression
to AIDS in African-Americans with the HHC haplotype could be mitigated if combined
with the protective effects of HHA or HHF*2, thus HHC/HHA and HHC/HHF*2
African-Americans showed delayed progression to AIDS. Subsequent studies on the effect
of these haplogroups on different populations have added to the complexity of the story
but the important message was eloquently summarized by Gonzalez et al., who observed
that “analysis of a single mutation or haplotype in isolation may obscure the complexity
underlying CCR5 genotype–phenotype relationships”.
CCR2 CCR5
Pu Pd 1 kb
Exon 3
CCR2-641
+1 ATG
CCR5-Δ32
CCR2-641 Exon 1 Exon 2
29(-2733)A/G
208(-2554)G/T
303(-2459)A/G
627(-2135)C/T
630(-2132)C/T
676(-2086)A/G
(927(-1835)C/T)
CCR5-Δ32
Haplotypes
HHG*1 V G G A C C A C –
HHG*2 V G G A C C A C +
HHF*1 V A G A C C A T –
HHF*2 I A G A C C A T –
HHE V A G A C C A C –
HHD V A T G T T A C –
HHB V A T G T C A C –
HHC V A T G T C G C –
HHA V A G G T C A C –
Figure 7.8: CCR5/CCR2 haplogroups. The gene for CCR5, a chemokine receptor that also acts as an important
co-receptor for HIV-1, is located on the short arm of chromosome 3. The gene for CCR2 (another chemokine
receptor) lies less than 20 kb upstream of the CCR5 gene. Nine CCR2/CCR5 haplogroups were identified by
Gonzalez et al., based on polymorphisms in the CCR5 gene, in the promoter region, and in the CCR2 gene. The
effect of each haplogroup on the outcome of HIV-1 infection was shown to vary depending upon the ethnic or
racial background of the population studied. (Adapted from Arenzana-Seisdedos F & Parmentier M [2006] Semin
Immunol 18:387–403. With permission from Elsevier.)
Table 7.6 Some HLA class I gene polymorphisms that have been shown to influence the outcome following
infection with HIV-1.
pronounced if two or three loci are homozygous. HLA class I proteins present viral anti-
gens to cytotoxic CD8+ T cells, leading to destruction of the infected cell. Homozygosity
in the HLA class I allele genotype means that the individual is able to present a reduced
number of viral epitopes compared with a heterozygous individual. Mutation of the
HIV-1 genome allows the virus to escape immune surveillance over time and this process
may happen more quickly in HLA class I homozygotes, which may explain the faster
progression to AIDS. When there is concordance in the HLA genotype of the mother and
child, there is also an increased risk of transmission from mother to child. Concordance in
this case means that both mother and child share the same HLA genotype, so that HIV-1
escape epitopes arising in the mother are also able to escape the cytotoxic CD8+ T cell
response in the child. Similarly, a high risk of transmission is reported to be associated
with concordance between sexual partners.
HLA-Bw4
Self-peptide
Inhibitory KIR Uninfected cell
NK cell
Activating KIR
Killing action
Infected cell
Figure 7.9 Interaction between KIRs on NK cells and HLA-Bw4 on uninfected and HIV-1-infected cells. NK
cells express KIRs on the cell surface, which may activate or inhibit NK cell activity in response to ligand binding.
This interaction may be modified by the HLA molecule. HLA-Bw4 epitopes are found on some HLA-B proteins, as
well as some HLA-A proteins, and act as ligands for both inhibitory and activating KIRs. This process mediates the
destruction of HIV-1-infected cells by NK cells and may prevent destruction of non-infected cells. (Adapted from
Pelak K, Need AC & Fellay J [2011] PLoS Biol 9(11):e1001208. With permission from the Public Library of Science.)
hypothesis. The HLA-Bw4 motif plays a key role in this process, acting as an impor-
tant KIR-ligand. HLA-Bw4 homozygosity may therefore offer protection from HIV by
enhancing NK control of HIV-1 infection, although the mechanism for this is complex
and not yet fully understood.
Other HLA class I polymorphisms consistently associated with delayed progression to
AIDS are HLA-B*57 and HLA-B*27. The proposed mechanisms underpinning these asso-
ciations are thought to be linked to the peptide-presenting role of the HLA class I proteins.
HLA-B*57 is believed to have a broad peptide-binding specificity, which makes it harder
for HIV-1 escape mutants to evolve. HLA-B*27 is known to recognize and present an epi-
tope from the HIV-1 gag protein and escape mutations with this epitope appear to reduce
the fitness of the virus. Again, it is therefore difficult for the virus to escape the CD8+ T cell
response. It is worth noting that, as discussed in Chapters 6 and 8, HLA-B*57 is also associ-
ated with hypersensitivity to the anti-retroviral drug Abacavir and therefore may also influ-
ence the outcome of HIV-1 infection by modifying the patient’s response to treatment.
Peptide presentation may not, however, be the whole story. Interestingly, both, HLA-B*27
and HLA-B*57 carry the Bw4 motif discussed above, and there is growing evidence that
interaction with KIRs may offer an alternative, or complimentary, protective mechanism.
The KIR genes themselves are highly polymorphic and a marked protective effect has been
reported, e.g. for individuals with high-expressing alleles of KIR3DL1 in combination
with HLA-B*57.
HLA-B*35 is associated with rapid progression to AIDS; the effect appears to vary depend-
ing on the ethnicity of the subjects, with significantly faster progression in European and
European-Americans, but not in African-Americans. HLA-B*35 individuals can be divided
into two subtypes based on peptide-binding preference: HLA-B*35-PY preferentially binds
peptides with a proline at position 2 and a tyrosine (Y) at position 9; HLA-B*35-Px also
binds peptides with a proline at position 2 and has no specific preference (x) at position
9, but will not accept tyrosine. This difference in peptide binding can be attributed to a
single amino acid change at position 116 of the HLA-B*35 allele. Rapid progression to
AIDS is associated with HLA-B*35-Px, but not HLA-B*35-PY. This may explain the ethnic
variation in this association, as HLA-B*35-PY is the most common HLA-B*35 variant in
African-Americans. It should be noted, however, that there are more than 60 listed HLA-
B*35 alleles and this relationship may be much more complex than first reported.
the genetically diverse African ethnic populations. As discussed for malaria above, more
extensive SNP arrays, the application of imputation analysis, and more detailed sequence
data such as that provided by the 1000 Genomes Project, will help improve the success of
GWAS in these very important cohorts. As Africa bears by far the largest burden of HIV-1
infection, such developments will be critical in the fight against AIDS.
62 70
67
63
97
Figure 7.10: A three-dimensional ribbon diagram of the peptide-binding groove of the HLA-B protein.
Variations at amino acid positions 62, 63, 67, 70, and 97, all located in the peptide-binding groove, have been
shown to affect HIV-1 resistance. Variation in amino acid 97, located in the floor of the peptide-binding groove,
has the greatest impact on HIV-1 resistance.
the mechanism of control appears to involve a second SNP, 263D/I, located in a binding
site for micro-RNA in the 3′ untranslated region (UTR) of the HLA-C gene, which is in
linkage disequilibrium with the –35 SNP in Europeans. Binding of micro-RNA destabi-
lizes mRNA transcripts and is an important control mechanism for expression of many
genes. The 263D/I polymorphism affects the amount of HLA-C expressed on the cell
surface and presumably thereby moderates the immune response to HIV-1.
Some SNPs previously implicated in HIV-1 control have not yet been
confirmed by GWAS
As polymorphisms in the CCR5 gene and its chemokine ligands are clearly associated
with HIV resistance we would predict that association with these alleles would be identi-
fied in GWAS. Interestingly, only a weak association with CCR5/CCR2 was detected and
no significant association was seen with any of the other SNPs identified in hypothesis-
driven studies (except those in the MHC region). MHC variants have been proposed to
account for around 19% of the variability seen in viral load in HIV-1-positive subjects,
which rises to 23% when combined with the CCR5/CCR2 variants. There are clearly a
large number of as yet undetermined factors, probably both genetic and environmental,
which affect infection by HIV-1 and progression to AIDS.
Conclusions
Infectious disease, unlike most of the other complex diseases discussed in this book, is not
generally considered to be genetic in the common use of the term. However, it does have
a heritable component. It is important to remember that infectious disease has played a
major role in human evolution and genetic diversity. Indeed, one of the important scien-
tific issues that developed from present studies has been a better understanding of how the
pressure from infectious disease on human populations has played a significant role in the
evolution of the human genome and also the maintenance of particular alleles that confer
resistance to pathogens.
Heritability in infectious disease can be illustrated by considering human history. For
example, consider the impact of pathogens carried by the conquistadors on the indig-
enous populations of South America, and later scientific observations on the apparent
heritability of leprosy and tuberculosis. Both of these examples illustrate how the outcome
of exposure to a pathogen may be influenced by the host genome. There is an enormous
variety of pathogens that can challenge Homo sapiens, and new viruses and bacterial strains
arise frequently, e.g. the hepatitis viruses (A, B, and C) and HIV-1 are all quite recently
emerged human pathogens. In a genetically diverse population when new infectious
agents develop, diversity ensures that someone is likely to survive a pandemic, whereas in
a less diverse population the likelihood of survival will be reduced.
The genetics of infectious diseases can be more difficult to study than the genetics of non-
infectious chronic disease. The causative agents of most infectious diseases are known. In
some cases exposure almost always leads to infection, so investigating the role of genes in
infection per se is not possible or appropriate. The real points for genetic research often
lie with understanding response to infection, progression, outcome, and response to treat-
ment, and not initial susceptibility or resistance to infection, which may be governed
by other factors. However, this is not true for all infectious diseases, e.g. HIV-1 and the
CRR5-Δ32 deletion discussed above.
The same processes that have been applied to non-infectious disease have been applied to
infectious disease and with great success in some cases, especially when the right question
is set. Substantial existing knowledge of the infectious process has allowed scientists to
use a hypothesis-driven approach to identify genes that confer susceptibility or resistance
to infectious disease. More recently, non-hypothesis-constrained approaches, particularly
GWAS, have been used to identify susceptibility and resistance alleles in infectious dis-
ease, and this is shedding new light on the mechanisms by which some pathogens cause
disease. There are, however, significant technical challenges to be overcome in carrying out
GWAS in non-European and mixed populations, especially in Africa, where the burden
of infectious disease is greatest. Despite considerable progress in our understanding of the
interplay between the human genome and infectious disease, there are as yet few examples
of this knowledge leading to effective preventions or treatments.
By concentrating on two selected examples of infectious disease, this chapter shows how
variation in the host genome plays a significant role in determining the disease phenotype.
The first example, malaria, illustrates how genetic variation associated with Mendelian
recessive traits can influence complex traits. In the second example, HIV-1, we consid-
ered how inherited variation in genes can directly influence infection with a virus. In this
example we also considered how genetic variation can influence the host response to a
virus and briefly how our genes may determine our response to treatment. The last point
is covered in more detail in chapter 8. All of this holds a clue to how we may one day con-
quer many different infections through a better understanding of disease pathogenesis and
the development of more efficient treatment programs, including better vaccines where
appropriate.
Future use of whole-genome sequencing, epigenetics, and expansion of existing databases
using imputation may produce new insights into the genetics of infectious disease. We are
making progress towards fulfilling some of the promises of the Human Genome Project
that is to improve patient management and care and improve our understanding of disease
pathology. However, as we have seen, and shall see again in the next chapter there is still
plenty of work to be done.
Further Reading
Books to malaria. Malaria J 10:271–281. An overview
Engleberg NC, DiRita V & Dermody T of polymorphisms associated with susceptibil-
(2012) Schaechter’s Mechanisms of Microbial ity and resistance to malaria.
Disease, 5th ed. Lippincott Williams & Fellay J, Shianna KV, Ge D et al. (2007) A
Wilkins. An accessible entry-level text whole-genome association study of major
book on microbial disease, including chap- determinants for host control of HIV-1. Science
ters on the immune response, retroviruses 317:944–947.
(including HIV-1), and protozoa (including
Glass WG, McDermott DH, Lim JK et al.
P. falciparum).
(2006) CCR5 deficiency increases risk of symp-
Haines JL & Pericak-Vance MA (1998) tomatic West Nile virus infection. J Exp Med
Overview of mapping common and geneti- 203:35–40.
cally complex human disease traits. In Gonzalez E, Barnshad M, Sato N et al. (1999)
Approaches to Gene Mapping in Complex Race-specific HIV-1 disease-modifying effects
Human Disease (Haines JL & Pericak-Vance associated with CCR5 haplotypes. Proc Natl
MA eds), pp 1–16. Wiley. This book provides Acad Sci USA 96:12004–12009. This is a key
the best definition of complex disease and is a study that first defined and investigated CCR2–
good starter text on major issues in complex CCR5 haplogroups and their influence on HIV
disease. progression to AIDS.
Hill AVS, Allsop CEM, Kwiatkowski D et al.
Articles (1991) Commonest African HLA antigens are
Allison AC (2004) Two lessons from the associated with protection from severe malaria.
interface of genetics and medicine. Genetics Nature 352:595–600.
166:1591–1599. This is an interesting auto- International HIV Controllers Study (2010)
biographical account of the discovery of the The major genetic determinants of HIV-1 con-
protective role of sickle cell anemia against trol affect HLA class I peptide presentation.
malaria. Science 330:1551–1557. This paper reports
An P & Winkler CA (2010) Host genes asso- a major collaborative GWAS that implicates
ciated with HIV/AIDS: advances in gene dis- HLA-C and the binding groove of HLA-B in
covery. Trends Genet 26:119–131. This is an the response to HIV-1 infection.
excellent overview of the field that includes a Jallow M, Teo YY, Small KS et al. (2009)
discussion of genome-wide RNA interference Genome-wide and fine-resolution analysis of
screens, not included in this chapter. malaria in West Africa. Nat Genet 41:657–
Arenzana-Seisdedos F & Parmentier M (2006) 665. This paper describes a GWAS study
Genetics of resistance to HIV infection: role carried out by members of the MalariaGEN
of co-receptors and co-receptor ligands. Semin consortium that provides an excellent exam-
Immunol 18:387–403. ple of the challenges of GWAS in African
Carrington M & Walker B (2012) populations.
Immunogenetics of spontaneous control of Lim JK, Glass WG, McDermott DH &
HIV. Annu Rev Med 63:131–145. This is a clear Murphy PM (2006) CCR5: no longer a “good
and readable review on the effect of HLA geno- for nothing” gene – chemokine control of
type on response to HIV. West Nile virus infection. Trends Immunol
Chapman SJ & Hill AVS (2012) Human sus- 27:308–312. This is an interesting review
ceptibility to infectious disease. Nat Rev Genet of the role of CCR5 in the control of WNV
13:175–188. A comprehensive overview of infection.
recent developments in the field; includes a dis- Lopez C, Saravia C, Gomez A et al. (2010)
cussion of single-gene variants associated with Mechanisms of genetically based resistance to
susceptibility and resistance to infectious dis- malaria. Gene 467:1–12. A review of poten-
ease, and a very useful glossary of terms. tial mechanisms by which polymorphisms
Driss A, Hibbert JM, Wilson NO et al. (2011) associated with susceptibility and resistance to
Genetic polymorphisms linked to susceptibility malaria exert their effects.
Individuals vary considerably in their response to prescribed drugs and other foreign
compounds. Genetic factors have been shown to make an important contribution to
this variability. The subject of pharmacogenetics is concerned with this relationship.
Though considered to be a specialized discipline, pharmacogenetics can also be con-
sidered to be a sub-branch of complex disease genetics, because even though many
pharmacogenetic traits behave as Mendelian traits, some are complex traits with low
levels of penetrance and more than one risk allele. In this chapter, we will concentrate
on genetic factors that affect either drug metabolism or the interaction of the drug with
its target within the body and also with adverse drug reactions, which are unwanted
and unexpected effects of a prescribed drug that can sometimes be life-threatening to
the patient. A number of different drug classes will be included in the examples we
consider, but we will not consider pharmacogenetic aspects of individual physiologi-
cal systems or deal with particular medical specialties. Pharmacogenetics can also be
extended to other areas not concerned with response to prescribed medicines such
as genetic susceptibility to diseases where chemical exposure is a risk factor, includ-
ing cancer. This area will be considered briefly in the Chapter 9. Similarly, the cancer
genome as opposed to the patient genome may help predict the outcome of drug treat-
ment in cancer. This specialized aspect of pharmacogenetics will also be considered
separately in Chapter 9.
1994 CYP2C19 gene shown to code for the enzyme that metabolizes mephe-
nytoin and gene defects identified.
1995 Thiopurine methyltransferase gene cloned and polymorphisms respon-
sible for absence of activity identified.
1997 Term “pharmacogenomics” first used.
1998 Trastuzumab (Herceptin®) licensed as a targeted treatment for breast cancer.
2001 First draft of the Human Genome Map.
2004 VKOR gene cloned. Polymorphism in this gene is a key determinant of
warfarin dose requirement.
2007 First GWAS on complex diseases are followed by GWAS on pharmacoge-
netic traits.
2008 Clinical trial shows HLA-B*57:01 to be a highly sensitive predictor of
abacavir hypersensitivity. Genotyping for this allele prior to treatment
becomes common.
2009 IL28B genotype shown to be key predictor of interferon-α response in hepa-
titis C infection.
Lipophilic drug
Monooxygenases
FMO CYP450
Phase I
Mostly oxidation
reactions, notably Esterases,
aromatic hydroxylation hydrolases
Water-soluble
intermediate with Ac
reactive “handle” Me
NAT
TPMT
Phase II
Conjugation reactions
in which a transferase UDPG Glucuronyl
adds a chemical group
GST Glutathionyl
ST
Excretable SO42–
product
Figure 8.1: Phase I and phase II drug metabolism. Phase I drug reactions create more polar drugs with a
reactive group that prepares the drug for phase II. They involve a variety of enzymes, including CYP450. The
phase II reactions involve addition of chemicals to the molecule to facilitate excretion of the final product.
Note that phase II reactions do not always occur after phase I reactions; they can occur without a phase I
reaction. BCHE, butyrylcholinesterases; CES, carboxyesterase; EPHX, epoxide hydrolases; FMO, flavin-containing
monooxygenases; GST, glutathione S-transferases; NAT, N-acetyltransferases; PON, paraoxonases; TPMT,
thiopurine methyltransferases; UGT, UDP-glucuronosyltransferases; ST, sulfotransferases. (From Strachan T,
Goodship J & Chinnery PF [2014] Genetics and Genomics in Medicine. Garland Science.)
n=0 Inactivating
mutation ( )
n=0 Deletion ( )
Simple
n=2
duplication
Multiplex
n = 3–13 duplication
(b)
Multiple
Genotype N,N or N,other or or or or
CYP2D6 genes
Ultrarapid Extensive Intermediate Poor
Phenotype metabolizers metabolizers metabolizers metabolizers
Frequency
5–10% 65–80% 10–15% 5–10%
(Caucasians)
90
80
70
Number of patients
60
50
40
30
20
10
0
0.01 0.1 1 10 100
Metabolic ratio
Nortriptyline dose requirement (mg/day)
>250–500 150–100 20–50
Nortriptyline (mg)
Figure 8.2: CYP2D6 alleles and relationship between genotype and metabolism of the antidepressant
drug nortriptyline. (a) CNV in CYP2D6 genes is common in some populations and creates different
phenotypes according to the copy number. There are also deletions, inactivation, and reduced function
mutations. (b) CNV results in four phenotypes: ultrarapid metabolizers, extensive metabolizers, intermediate
metabolizers, and poor metabolizers. The figure shows urinary concentration of the drug substrate, in this
case the antidepressant nortriptyline, with high metabolic ratios representing poor metabolizers and low
metabolic ratios representing ultrarapid metabolizers. Extensive and intermediate metabolizers are shown in
between these two extremes. (Adapted from Meyer UA [2004] Nat Rev Genet 5:669–676. With permission from
Macmillan Publishers Ltd.)
Table 8.1 Properties of selected cytochromes P450 important in human drug metabolism.
Examples of polymorphisms
(including rs number where Effect of
Isoform Substrates appropriate) polymorphism
CYP2D6 Debrisoquine, tricyclic Splice site (rs3892097) No activity
antidepressants, codeine
Deletion No activity
Missense (rs165852) Decreased activity
Duplication Increased activity
CYP2C19 Clopidogrel, diazepam, Splice site (rs4244285) No activity
omeprazole
Promoter region base substitution Increased activity
(rs12248560)
CYP2C9 Warfarin, diclofenac, Missense (rs1057910) Decreased activity
ibuprofen
CYP3A4 Midazolam, nifedipine, Missense (rs55785340) Decreased activity
cyclosporin, tacrolimus
Intronic base substitution Decreased activity
(rs35599367)
CYP3A5 Tacrolimus Splice site (rs776746) No activity
CYP2A6 Nicotine Missense (rs1801272) No activity
Deletion No activity
of active CYP2D6 genes. The CYP2D6 polymorphisms are associated with metabolizing
activity for a number of different drugs that use the same pathway.
Though CYP2D6 was the first example of a cytochrome P450 gene showing functionally
significant polymorphism, there are now a number of similar examples involving other
members of the cytochrome P450 family of genes. Table 8.1 lists the most important
examples with details of typical drug substrates and the main polymorphisms involved
together with their overall effects. Polymorphisms in genes encoding enzymes that affect
drug metabolism such as cytochrome P450 may result in circulating drug levels that are
either too low or too high to achieve the normal therapeutic response. It is possible to over-
come this problem by either changing the dose or prescribing a drug that is metabolized
by a different enzyme. Three drugs where the cytochrome P450 genotype appears to be
clinically relevant are considered in more detail below: codeine, warfarin, and clopidogrel.
Biologically
Precursor
Carboxylase active
Oxidative
deactivation
Vitamin K Vitamin K
(reduced) (epoxide)
CYP1A2
CYP2C9
CYP3A4
VKORC1
S-warfarin R-warfarin
Warfarin
Figure 8.3: The mechanism of action of warfarin and role of VKORC1 in the pathway. The role of VKORC1
in the conversion of vitamin K epoxide to oxidized vitamin K is shown. The carboxylase enzyme transfers carboxyl
groups to glutamic acid residues of selected clotting factors with a requirement for oxidized vitamin K. Warfarin
inhibits the VKORC1 reaction. There are two chemically distinct enantiomers of warfarin: the R and S forms. The
S enantiomer is more biologically active than the R form and is also metabolized only by CYP2C9, whereas other
P450 isomers metabolize the R form.
(a)
Percent of subjects per 0.5 units of activity TPMTL/ TPMTL
TPMTL/ TPMTH
TPMTH/ TPMTH
10
0
0 5 10 15 20
TPMT activity (units/ml RBCs)
(b)
1 2 3 4 5 6 7 8 9 10
TPMT*1
(wild-type)
VNTR
TPMT*3A
VNTR A719G
Tyr240Cys
Figure 8.4: TPMT: relationship between levels of the enzyme in red blood cells and genotype. (a) Activity
of TPMT in red blood cells (RBCs) from 298 European blood donors. Presumed genotypes for the TPMT gene
polymorphisms are also indicated. TPMTL and TPMTH are designations for alleles resulting in “low” and “high”
activity, respectively. These allele designations were used before the molecular basis for the polymorphism
was understood. (Adapted from Weinshilboum RM & Sladek SL [1980] Am J Hum Genet 32:651–662. With
permission from Elsevier.) (b) Common TPMT alleles. TPMT*1 is the most common allele (wild-type). TPMT*3A,
with two non-synonymous coding SNPs, is the most common variant allele in Caucasian European subjects.
TPMT*3C is the most common variant allele in East Asian subjects. Rectangles represent exons with the gray scale
representing the open reading frame.
eczema, and rheumatoid arthritis. Both 6-mercaptopurine and azathioprine are thiopu-
rine drugs that were originally developed by Gertrude Elion—a pioneering biochemist
who received the Nobel Prize for Medicine in 1988. TPMT is polymorphic (Figure 8.4).
Approximately 1:300 individuals lack TPMT activity. This appears to be a recessive trait
usually due to the presence of one or more non-synonymous mutations. These individuals
will develop severe bone marrow toxicity if prescribed the normal recommended dose of
either 6-mercaptopurine or azathioprine. It is therefore recommended that patients should
be genotyped for TPMT before they are treated with 6-mercaptopurine or azathioprine
and those predicted to lack activity should either be given a lower drug dose or treated with
another drug that is not metabolized by TPMT.
This is probably the best current example of a pharmacogenetic polymorphism where
genotyping is routinely performed prior to drug prescription and the result used to guide
treatment. The example once again illustrates the potential role for pharmacogenetics in
clinical practice.
the endoplasmic reticulum. VKORC1, the gene encoding the enzyme VKOR in
humans, was only identified quite recently. Coumarin anticoagulants including war-
farin (see Section 8.2) inhibit VKOR enzyme activity directly. A very small proportion
of patients treated with coumarin anticoagulants fail to respond at doses within the
normal range and are termed coumarin resistant. This lack of response appears to be
due to rare non-synonymous mutations in VKORC1 that appear to affect binding of
the anticoagulant to the enzyme. Some coumarin-resistant individuals cannot be given
coumarin anticoagulants successfully, but others can be stabilized on a very high dose.
Polymorphisms in the VKORC1 coding sequences are rare, but several non-coding poly-
morphisms, including -1639G > A (rs9923231) affect levels of VKORC1 expression.
The A variant, for example, is associated with lower transcription than the G variant.
VKOR protein levels appear to directly determine the required dose of an anticoagulant.
A clear association between -1639G > A genotype and warfarin dose requirement has
been reported in a large number of independent studies, and similar associations for
the other coumarin anticoagulants such as acenocoumarol and phenprocoumon have
also been detected. There is ethnic variation in the frequency of the -1693A variant.
Though the -1639A allele is less common than -1693G in Europeans, it occurs at a high
frequency in East Asians and, as a result, the average dose of coumarin anticoagulant
required in East Asian patients is significantly lower than that required in Europeans.
Approximately 20–30% of variability in coumarin anticoagulant dose requirement can
be explained by VKORC1 genotype. Anticoagulant dosage is routinely individualized by
measurement of coagulation rate, but it is possible that problems with excessive bleeding
or lack of response during the initial treatment period may be decreased by genotyping for
VKORC1 and for CYP2C9 (see Figure 8.3). Genotype and other patient-specific factors
can be used to calculate an individualized dose. The FDA suggests that genotyping for
VKORC1 and CYP2C9 should be considered in patients receiving warfarin treatment, and
clinical trials to assess the value of genotype-based dosing are in progress.
A F
S L
G L
6
N A
G P
P N 15
16
Q R Arg16Gly
G S
H2N– M H
A
β2-adrenoceptor
P
D
H
D Ile164Thr
V
T
C Y A N
Q N E
27
Q Gln27Glu I T
R A C Arg175Arg
D E C
E Q D
V T F G H F
W N
W M F T F
V K W A T N L I
D R
Extra I V II M III C IV R 175 V N VI Q VII K
L A Q V I E
G M G A H I T W
F E
M H WQ
Y
A Y V H I Y V
V I S M G A I S I S A I L
N L
I L P F V D L P I I S I V
N
I W
L V A V V C L T S F S V P F
F Y G
I A G L T V A G L Y F L N V S
V M S S P V C W G
G F V I V L L F
V N D L T E W I I V F T P N
V L C A A C L M V L V M G T M I L
T I S L I V I V F I I C Y
I A I T V A R V I S Y L G S R
A F D A R T P Intra
K Y R K V K D
F N Y N F L F
E T F K Q A R
R L V A T E K I
Q T
I L A H A
T L K E F
S S R K Q
P Q Q L E
F K Y E K N C
L C L Q E R S
Q F L E N G T
K K C V K Q N
I S L H L S D
D S R Y L D S
K R R G C I L
S L –COOH
R S S E N
E L S Q D D
G G L E L S
R H K G P P
F G A T G V
H T Y N T T
351
V R G G E G Gly351Gly
Q G N N D Q
N D G S F H
L Q Y S V G
S Q V E
Figure 8.5: ADBR2 gene polymorphisms. The figure illustrates main polymorphism in the ADBR2 gene. The
β2-adrenoreceptor (ADBR2) gene is located on chromosome 5q31–q32 and consists of only one exon. There are
five common variants to the coding sequence indicated in the figure. The R16G polymorphism (rs1042713), with
the Arg (R) allele occurs in 25–40% of all individuals and the Q27E variant (rs1042714), with the Glu (Q) allele
occurs in 50–60% of all white individuals. The I164T variant (rs1800888) is associated with a strong and stable
phenotype. Additional common variants of unknown functional relevance are R175R (rs1042718) and G351G
(rs1042719). (From Rosskopf D & Michel MC [2008] Pharmacol Rev 60:513–535. With permission from The
American Society for Pharmacology and Experimental Therapeutics.)
Data on the functional significance of the codon 16 polymorphism is limited but the
arginine-16-containing form of ADRB2 appears to be associated with increased agonist-
induced downregulation and with a poorer response to inhaled short-acting β-agonists.
This response to β2-adrenergic receptor agonists appears to be specific to short-acting
agonists as a large study involving patients treated with longer-acting agonists failed to
show any difference in outcome based on the codon 16 genotype. In general, it is now
accepted that the codon 16 genotype does appear to affect response to short-acting ago-
nists. However, this may not be important in clinical practice as current treatments mainly
involve use of longer-acting agonists.
hepatocellular when the injury is focused on the hepatocyte and cholestatic when the
damage occurs at the hepatocyte canalicular membrane or further downstream in the
biliary tree (Figure 8.6). The underlying mechanism by which DILI develops may
involve both direct toxic effects by the drug, e.g. involving oxidative stress or cellular
damage, and formation of reactive intermediates resulting in an inappropriate immune
Esophagus
Liver
Left hepatic duct
Stomach
Gall bladder
Cystic duct
Common
bile duct
Pancreatic Pancreas
duct
Duodenum
Figure 8.6: The biliary tree. When the liver cells secrete bile, it is collected by a system of ducts that flow from
the liver through the right and left hepatic ducts to the common hepatic duct, which joins with the cystic duct to
form the common bile duct. Approximately half of the bile formed flows into the duodenum or small intestine
and half into the gallbladder. The stored bile is used to break down fats following the ingestion of food. (From
Ahmed N, Dawson M, Smith C & Wood E [2006] Biology of Disease. Garland Science.)
30
25
20
–log10P
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 22 X
Chromosome
Figure 8.7: A Manhattan plot for a GWAS of flucloxacillin-induced liver injury. This Manhattan plot
indicates a number of statistically significant associations with flucloxacillin-induced liver injury. In particular,
the –log10P value shows that a number of polymorphisms in the MHC region have genome-wide significance.
The study involved 51 cases of DILI and 282 population controls. The polymorphism (rs2395029) showing the
most significant difference in frequency between cases and controls is in complete linkage disequilibrium with
the HLA class I HLA-B*57:01 allele. (From Daly AK, Donaldson PT, Bhatnagar P et al. [2009] Nat Genet 41:816–
819. With permission from Macmillan Publishers Ltd.)
response. Up to 10% of patients with DILI develop liver failure, which may be fatal
unless a liver transplant is performed, though the majority of patients recover after the
causative drug is discontinued. There is sometimes overlap between hypersensitivity
reactions as discussed above and DILI.
Several different HLA alleles are risk factors for DILI, but the strongest effect for any
HLA allele reported up to the present is that with HLA-B*57:01, which has been
associated with flucloxacillin-induced liver injury (see Figure 8.7). Though this is the
same risk allele as that identified for abacavir hypersensitivity and the overall specific-
ity is high, sensitivity is low because most individuals who are prescribed flucloxacillin
(a commonly used antibiotic, especially in the UK) and are B*57:01-positive do not
appear to develop liver injury. The underlying mechanisms for toxicity with abaca-
vir and flucloxacillin may also differ. Another example where the same HLA alleles
and haplotypes appear to be risk factors for liver injury induced by different drugs
(Table 8.2) is lapatinib and ximelagatran, two drugs with very different chemical struc-
tures that share the same association with the HLA-DRB1*07:01–DQB1*02:02 hap-
lotype and the development of liver toxicity. A common feature of many of these
associations discussed here is that compounds with very different chemical structures
show the same HLA association.
There are many other susceptibility factors for serious adverse drug
reactions
Based on a number of studies, it appears that liver injury associated with some widely used
drugs such as statins and isoniazid is not HLA-associated. Risk factors for these forms of
DILI may include:
• Genotypes for enzymes affecting drug metabolism, e.g. N-acetyltransferase 2 (NAT2),
which metabolizes isoniazid.
• Genes relevant to oxidative stress, e.g. superoxide dismutase (SOD2), and genes
relevant to the innate immune response, e.g. signal transducer and activator of
transcription 4 (STAT4), which encodes a transcription factor that regulates T cell
responses.
QT
R R
P T P T
Q Q
S S
Long QT
R R
P T P T
Q Q
S S
(b)
Figure 8.8: Drug-induced prolongation of the cardiac QT interval and torsades de pointes. (a) The cardiac
depolarization and repolarization cycle. The QT interval is shaded; it starts with the onset of QRS and terminates
at the end of the T wave. This is the time taken for one complete cycle. (b) Some drugs can prolong the QT
interval and induce rapid heart beating, which is seen in the torsades de pointes (TdP; polymorphic ventricular
tachycardia). The abnormal TU wave preceding the TdP indicates cardiac arrhythmia, which can potentially be
fatal. (From Strachan T, Goodship J & Chinnery PF [2014] Genetics and Genomics in Medicine. Garland Science.)
Conclusions
There are now a large number of well-described and replicated pharmacogenetic associa-
tions. Recent developments in genomics such as the widespread application of GWAS
have helped increase knowledge in this area. Despite the large body of knowledge, there
has been less than hoped for progress in the clinical implementation of these find-
ings, with a few important exceptions. Currently, genotyping for HLA-B*57:01 prior
to abacavir prescription and TPMT genotyping or phenotyping prior to azathioprine
or 6-mercaptopurine prescription are the main examples of tests that are performed
routinely in many countries. For many of the other examples described above, the sen-
sitivity of the test may not be sufficient to make genetic testing useful in diagnosis or
treatment choice. As there are no data based on some of these randomized clinical tri-
als, assessing the clinical utility of genetic associations is not yet possible. In the case of
warfarin and other coumarin anticoagulants, several clinical trials to assess the value of
genotyping for CYP2C9 and VKORC1 prior to prescription have now been completed,
but unfortunately the findings are inconsistent with the two largest studies disagree-
ing on whether genotyping is useful. Even larger studies may be needed to resolve the
discrepancy. There are also additional examples where clinical trials might be helpful,
such as testing CYP2C19 and clopidogrel and CYP2D6 and tamoxifen. There are also
additional examples where patient response to commonly used treatments is variable
and impossible to predict. In these cases, response may have a genetic component. This
includes response to antihypertensive drugs and to aspirin. Clearly, there is much more
work yet to be done in this area, but some of the results so far look promising. The
increasing availability of genome-wide sequencing help us to better understand the
complex genotype–phenotype interactions that exist in pharmacogenetics.
The application of new technologies will speed up the process of genotyping in phar-
macogenetics as well as in other areas of genetic research. The further development of
large collaborative groups and integration of data sets and large collections will enable
larger, better quality studies to be performed. Altogether, this means faster, better,
and more detailed studies. The potential to identify major genetic links in pharma-
cogenetics and develop better pharmacological agents using that information, or to
identify genetic links to an individual’s response to commonly used drugs, and then
produce personalized therapy plans, are only two of the goals that the new genetics
offers. The future use of present and developing technologies holds great possibilities
in pharmacogenetics.
Further Reading
Books Mallal S, Phillips E, Carosi G et al. (2008)
Coleman MD (2010) Human Drug HLA-B*5701 screening for hypersensitivity to
Metabolism. An Introduction. Wiley-Blackwell. abacavir. N Engl J Med 358:568–579.
Maitland-van der Zee A-H & Daly AK (eds) Marsh S & Van Booven DJ (2009) The increas-
(2012) Pharmacogenetics and Individualized ing complexity of mercaptopurine pharma-
Therapy. Wiley. cogenomics. Clin Pharmacol Ther 85:139–141.
Strachan T, Goodship J & Chinnery PF (2014). Meyer UA (2004) Pharmacogenetics – five
Genetics and Genomics in Medicine. Garland decades of therapeutic lessons from genetic
Science. diversity. Nat Rev Genet 5:669–676.
Niemi M (2010) Transporter pharmacogenet-
ics and statin toxicity. Clin Pharmacol Ther
Articles 87:130–133.
Brodde OE (2008) Beta-1 and beta-2 adreno- Relling MV, Gardner EE, Sandborn WJ
ceptor polymorphisms: functional importance, et al. (2011) Clinical Pharmacogenetics
impact on cardiovascular diseases and drug Implementation Consortium guidelines for
responses. Pharmacol Ther 117:1–29. thiopurine methyltransferase genotype and thio-
Daly AK (2004) Pharmacogenetics of the purine dosing. Clin Pharmacol Ther 89:387–391.
cytochromes P450. Curr Topics Med Chem Rubboli A, Becattini C & Verheugt FWA
4:1733–1744. (2011) Incidence, clinical impact and risk of
Daly AK & Day CP (2012) Genetic associa- bleeding during oral anticoagulation therapy.
tion studies in drug-induced liver injury. Drug World J Cardiol 26:351–358.
Metab Rev 44:116–126. Scott SA, Sangkuhl K, Gardner EE et al. (2011)
DeGorter MK, Xia CQ, Yang JJ & Kim Clinical Pharmacogenetics Implementation
RB (2012) Drug transporters in drug effi- Consortium guidelines for cytochrome
cacy and toxicity. Ann Rev Pharmacol Toxicol P450-2C19 (CYP2C19) genotype and clopido-
52:249–273. grel therapy. Clin Pharmacol Ther 90:328–332.
Guillemette C, Levesque E, Harvey M et al. Uetrecht J (2007) Idiosyncratic drug reactions:
(2010) UGT genomic diversity: beyond gene current understanding. Annu Rev Pharmacol
duplication. Drug Metab Rev 42:24–44. Toxicol 47:513–539.
Jonas DE & McLeod HL (2009) Genetic and van Schie RM, Wadelius MI, Kamali F et al.
clinical factors relating to warfarin dosing. (2009) Genotype-guided dosing of coumarin
Trends Pharmacol Sci 30:375–386. derivatives: the European pharmacogenetics of
Killeen MJ (2009) Drug-induced arrhythmias anticoagulant therapy (EU-PACT) trial design.
and sudden cardiac death: implications for the Pharmacogenomics 10:1687–1695.
pharmaceutical industry. Drug Discov Today
14:589–597.
Online sources
Koren G, Cairns J, Chitayat D et al. (2006)
Pharmacogenetics of morphine poisoning in http://www.cypalleles.ki.se
a breastfed neonate of a codeine-prescribed This is a useful website which provides informa-
mother. Lancet 368:704. tion of CYP450 gene nomenclature and alleles.
Cancer is a major cause of morbidity and mortality worldwide. Statistics from the USA
and UK reported in 2013/14 showed that 1,690,000 and 331,487 new cancer cases were
diagnosed in the period 2012 and 2011 in the USA and UK, respectively. More than
500,000 American and 159,178 UK citizens die from cancer each year. In the USA and
UK, lung cancer is the leading cause of cancer-related deaths in both men and women,
accounting for 28 and 22% of all cancer mortalities, respectively. In the USA, breast
cancer is the second-leading cause of cancer-related deaths for women (approximately 23
deaths per 100,000 women annually) and prostate cancer is the second-leading cause for
men (approximately 20 deaths per 100,000 men annually), followed by colorectal cancer
for both sexes (15 deaths per 100,000 population annually). In the UK, figures for breast
cancer and prostate cancer are similar to those reported in the USA; however, in the popu-
lation overall bowel cancer accounted for 10% of all cancer-related deaths, whereas breast
and prostate cancer accounted for 7% in 2011.
The examples above are all examples of solid tumors, but it is important to remember
that not all cancers are solid tumors (e.g. leukemias). Some cancers are caused by viruses,
e.g. human papilloma virus is linked to cervical cancer and human herpes virus 8 is
associated with Kaposi’s sarcoma, which is also common in human immune deficiency
virus (HIV-1)/acquired immune deficiency syndrome (AIDS). The viruses that can cause
tumors are often referred to as oncoviruses.
This chapter will address selected examples of recent advances in the field of cancer
genetics, but will not attempt to cover all cancers and genetic risk factors. However,
before we start looking at the selected few it is worth considering the basic statistics on
other forms of cancer. Table 9.1 presents selected currently available data for cancer
deaths per 100,000 individuals for a range of different countries for men and women
published by the World Health Organization (WHO) in 2014 (based on data from
2008; http://apps.who.int/gho/data/node.main.A864). This shows some interesting
contrasts, both between countries and between males and females. In most cases, the
incidence of death from cancer is lower in women compared with men. The size of this
difference can vary from country to country, with figures being twice as high in men as
in women. In some countries, the figures are very similar for men and women, particu-
larly where the overall incidence is low, for example in Namibia and Kuwait. Looking
in more detail at rates for different cancers, Figure 9.1 illustrates the 10 most common
cancers in men and women in the UK in 2011 as published by Cancer Research UK in
2014 (http://www.cancerresearchuk.org/home/). These clearly show the impact of the
selected examples on the UK population only.
However, we cannot discuss all forms of cancer here. In the selected examples, we will
look at genetic risk factors in three categories: novel genetic risk factors i.e. those that have
Breast cancer
Prostate cancer
Lung cancer
Bowel cancer
Uterine cancer
Ovarian cancer
Men
Malignant melanoma
Women
Non-Hodgkin’s lymphoma
Brain tumor
Pancreatic cancer
Kidney cancer
Bladder cancer
Esophageal cancer
Leukemia
Other
Figure 9.1: Incidence of the 10 most common cancers in women and men in UK in 2011. This figure shows
the reported incidence of each form of cancer in 2011 in the UK. Note that some of the cancers do not occur
in both men and women (for example ovarian and prostrate cancer), or are very rare in one sex compared to
the other (for example, breast cancer is rare in men). In addition, some though common, are not found to be in
the 10 most common for both groups. For example pancreatic cancer is listed in women but does not make the
top 10 in men for this period. Whereas in women where ovarian, uterine, and pancreatic cancer are in the top 10,
bladder, esophageal cancer, and leukemia are not listed. The less common cancers are all included in the group
labeled “other.” Breast cancer accounts for 30% of cancers in women in this figure, while prostate cancer accounts
for 25% of cancers in men, with bowel and lung cancer making up approximately 25% of cancers in both men
(28%) and women (23%). The frequencies for the other listed cancers range from around 5 to 2% in descending
order. (Data from Cancer Research UK.)
been detected only by genome-wide association studies (GWAS), risk factors detected by
GWAS affecting more than one type of cancer, and risk factors previously detected by can-
didate gene studies that have been confirmed by GWAS. In addition, the role of genetics
in understanding response to therapy and advances in treatment of cancers as a result of
genetic research will also be considered in detail in this chapter.
Occupation
Meat consumption
Women
UV radiation
Men
Infection
Alcohol
Low consumption of
fruit and vegetables
Overweight
Smoking
0 5 10 15 20 25
Percentage
Figure 9.2: Fraction of cancer attributable to selected lifestyle and environmental factors. The percentage
of all cancers where specific risk factors contribute. Men and women are considered separately. The data is for
the UK and is based on information collected for 2010. (Data from Parkin DM, Boyd L & Walker LC [2011] Br J
Cancer 105 (Suppl 2):S77–S81.)
7000 500
450
6000 Female cases
Average number of cases per year
300
4000
250
3000
200
2000 150
100
1000
50
0 0
0-4
5-9
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85+
Figure 9.3: Average age at diagnosis of breast cancer. The histogram shows the average number of cases per
year in the UK in 5-year age ranges. The solid line shows the rate per 100,000 women plotted against age. (Data
from Cancer Research UK.)
of these changes in tumor material. However, the emphasis in this chapter is on cancer
susceptibility and individualized treatment of cancer, not the process by which normal
cells become tumor cells.
Table 9.2 Selected examples of identified potential susceptibility loci for breast cancer and the
gene function.
growth factor receptor 2 and in European populations has been shown to have one of the
strongest genetic associations of all the risk variants reported up to now. However, the per
allele odds ratio of approximately 1.26 even for the SNP showing the strongest association
(rs2981578) is relatively modest, though statistically significant. As we may expect, there
are several other SNPs within the gene showing roughly equivalent effects and the molecu-
lar basis for this association is still not completely clear. However, the SNP rs2981578 is in
a site for transcription factor binding that is conserved among species and may affect gene
expression. It is important to note that the SNPs in FGFR2 that were found to be associ-
ated with breast cancer risk were all more strongly related to estrogen receptor-positive
cancer. Differences in associations depending on the estrogen receptor status of the tumor
have been found to extend to other genetic risk factors for breast cancer, though one of
those listed in Table 9.2, TOX3, which encodes a transcription factor, shows a similar risk
for both estrogen receptor-positive and -negative tumors.
Novel insights into lung cancer involving the target for nicotine were
detected by GWAS
Prior to the advent of GWAS, genetic susceptibility to lung cancer had been studied quite
widely using the candidate gene case control approach, though there was considerable
inconsistency between reports. The impact of this cancer worldwide and the availability of
existing DNA collections enabled GWAS studies to be performed and data collected from
large numbers of cases. However, the number of loci found to be significantly associated
with the disease has been smaller than for some other cancers, including breast cancer
(summarized in Table 9.3). The most significant association reported for lung cancer is in
a region of chromosome 15 where several genes encoding nicotinic acetylcholine receptors
(CHRNA3 and CHRNA5), the targets for nicotine, are located. It has been proposed that
these genes contribute to an increased risk of becoming addicted to nicotine through the
expression of these receptors in the brain. In addition, expression of these receptors in the
epithelial cells in the lung has been implicated in the development of lung cancer. More
recently, studies of lung cancer in individuals who have never smoked have shown that
this region is not associated with disease development. This observation provides further
evidence for a direct effect of the presence of nicotine in the lungs, which is unlikely to
happen in non-smokers, playing a key role in lung cancer development.
A large number of genetic risk factors for prostate cancer have been
revealed by GWAS
Prostate cancer is another cancer that has been well studied by GWAS. Up to now, GWAS
studies on this cancer have led to the identification of the largest number of genetic risk
factors for any common cancer, accounting for approximately 30% of total familial risk of
the disease. Some of the most established associations with specific gene loci are listed in
Table 9.4. One of the stronger associations described is with MSMB, which codes for the
prostate-specific protein β-microseminoprotein (PSP94). The gene product is secreted into
seminal plasma after synthesis in the prostate epithelial cells. MSMB appears to function
Table 9.3 Selected potential susceptibility loci for lung cancer and their function.
Table 9.4 Selected potential susceptibility loci for prostate cancer and their function.
as a tumor suppressor gene, it is believed that the SNP is associated with prostate cancer
risk and it affects transcription, possibly leading to decreased expression. Several other
genes that are expressed in the prostate only, are also risk factors. Though the number
of different loci associated with prostate cancer is relatively high, the biological basis of
these associations is not as well understood as for some of the breast cancer risk factors
discussed previously. Several separate loci on chromosome 8q24 have been identified that
may also contribute to prostate cancer risk (see Figure 9.4). It is possible that the genetic
risk is outside the exome as a large amount of genetic variation is encoded within intronic
sequences. It is also possible that epigenetics plays a significant role in this and some other
MYC
rs16901979 rs6983267
Figure 9.4: Multiple cancer susceptibility loci on 8q24 defined by cancer GWAS. Approximate locations of
GWAS-reported cancer susceptibility loci are indicated by vertical arrows on the linkage disequilibrium pattern of
the 1000 Genomes Project “CEU” (Northern Europeans from Utah) data (November 2010 release, chr8:127,878–
128,880 kb genomic region, reference build 37.1). The arrowheads indicate probable recombination hotspots as per
HapMap 1 and 2. Five distinct regions have been associated with prostate cancer risk (regions 1–5). Region 3 is also
conclusively associated with colorectal cancer. Region 4 also harbors a breast cancer susceptibility locus rs13281615,
and a bladder cancer susceptibility locus rs9642880 is telomeric to region 1 and approximately 30 kb centromeric to
the MYC oncogene. (From Chung CC & Chanock SJ [2011] Hum Genet 130:59–78. With permission from Springer.)
cancers. Finally, some but not all of the associations appear to be with polymorphisms that
act as risk factors in several cancers and they are not organ-specific risk alleles confined to
one organ or one system (see Section 9.4).
cell differentiation. Abnormal telomere length has also been demonstrated in many cancers.
The biology of CLPTMIL is less well understood, but there are limited data suggesting a
role for involvement in apoptosis following genotoxic stress. The cancers that show asso-
ciations with the TERT–CLPTMIL region include lung, brain tumors involving glial cells
(glioma), bladder, testicular, and pancreatic cancer together with basal cell carcinoma and
melanoma, which affects the skin. The relationship between polymorphisms in this region
and susceptibility to particular cancers is quite complex. A single SNP, rs401681, has been
identified as a risk factor for lung cancer and basal cell carcinoma. This SNP has also been
suggested to be protective against another skin cancer, melanoma, and pancreatic cancer.
For lung cancer, several different SNPs across the entire 100 kb region show associations.
One of these, rs2736100, lies within TERT and appears to be a risk factor for lung cancer in
both smokers and non-smokers, with the effect strongest in adenocarcinoma a tumor type
that is particularly common in non-smokers. It is also a risk factor for glioma.
Discovery of these common genetic risk factors for various cancers is an interesting devel-
opment and should add to our understanding of cancer biology. It is useful to consider
the situation in research into the genetics of immune-mediated diseases where many of the
risk loci are shared between diseases. This may indicate common pathways in disease and
illustrates how understanding genetics of disease, even when the risk ratios are small, can
help us to understand disease pathogenesis.
Table 9.6 Genetic associations relevant to carcinogen metabolism detected by GWAS and candidate
gene approaches.
Gene Associations
Alcohol dehydrogenases Both contribute to ethanol metabolism; SNPs within these genes
ADH1B and ADH7 have been shown to contribute to risk of upper aerodigestive
cancer by both GWAS and candidate gene studies
N-acetyltransferase 2 Polymorphisms in this gene associated with absence of enzyme
(NAT2) activity are risk factors for bladder cancer development in both
GWAS and candidate gene studies
Glutathione S-transferase Null allele is associated with increased risk of bladder cancer
M1 (GSTM1)
individuals positive for variant alleles in two alcohol dehydrogenase isoforms, ADH1B and
ADH7, show a decreased risk of upper aerodigestive cancer. Original findings based on
candidate gene case control studies have now been confirmed by GWAS.
NAT2 and bladder cancer
Tobacco smoking is an important risk factor for development of bladder cancer.
N-acetyltransferase 2 (NAT2) is an enzyme that conjugates certain chemicals (including
some prescribed drugs) with acetyl groups. Typical substrates include the aromatic amines
that are found in tobacco smoke. It has been known since the 1950s that NAT2 activity is
absent in approximately 50% of individuals, who are often referred to as slow acetylators.
The genetic basis of this recessive defect is now well understood with a number of poly-
morphisms in the coding gene (NAT2), including several non-synonymous SNPs, shown
to be associated with absence of activity. From the late 1970s onwards, there were isolated
reports that slow acetylators were over-represented among cases of bladder cancer. A recent
large well-controlled study in the USA found that slow acetylators who smoked heavily
were over-represented among bladder cancer cases compared with those with one or two
wild-type alleles, though the overall increase in risk was small. The association between
tagged SNPs predicting the NAT2 slow acetylator phenotype and increased risk of bladder
cancer in smokers has been confirmed by GWAS.
In 1982, a small study in the UK based on cases of bladder cancer among former workers
in the dye industry, who were likely to have been exposed to the carcinogen benzidine,
showed that the slow acetylator phenotype was significantly over-represented. However,
slow acetylators without occupational exposure to benzidine also appeared to have a
slightly increased risk of developing bladder cancer. The occupational exposure reported in
the 1982 study dated back to the 1950s. It is likely that subsequent improved regulations
on occupational exposure to chemicals mean that the currently lower incidence of bladder
cancer, which is down by 14% in women and 18% in men in the UK, may be partly due
to these regulatory changes. Nevertheless, the slightly increased risk of this tumor among
slow acetylators continues to be reported.
GSTM1 and bladder cancer
There is also report of a relationship between the GSTM1 genotype and increased risk of
bladder cancer. The GSTM1 gene encodes glutathione S-transferase M1—an enzyme that
conjugates and detoxifies chemicals. Approximately 50% of many populations lack this
enzyme because they are homozygous for a large deletion (null allele) in the GSTM1 gene.
There are a relatively large number of published studies showing that those homozygous
for the null allele are at increased risk of bladder cancer development. This observation was
confirmed in the same GWAS study from the USA as that mentioned above for NAT2,
though the overall effect for GSTM1 was smaller, with a 1.28-fold increased risk for those
predicted to have two null GSTM1 alleles on the basis of genotyping using tagged SNPs
specific for the null variant compared with other genotypes. However, a separate GWAS
has shown that the GSTM1 null genotype is a risk factor for bladder cancer in non-
smokers only. This is a slightly unexpected finding, but there is at least partial agreement
between the early candidate gene studies and more recent GWAS.
In summary, the importance of polymorphisms in genes relevant to carcinogen metabolism
as risk factors for cancer has been overestimated in the past with a small number of
exceptions, including those discussed above. There are a number of possible explanations
for this smaller than anticipated role, one of which is redundancy in biological systems,
whereby the existence of a number of different proteins with an ability to carry the same
function makes up for the deficiency created by the non-expressed or absent gene. This
compensation acts as a safeguard against extinction. It is possible that the biological effect
(the trait or disease) is only seen when the system is stressed to a point where compensation
alone is not enough to make up for the deficiency.
BCR BCR–ABL1
ABL1 ABL1–BCR
Figure 9.5 The Philadelphia chromosome. The Philadelphia chromosome is created by a translocation
between chromosome 9 and chromosome 22. The transformed chromosomes are labeled as “der (9)” for
chromosome 9 and “Ph1” for chromosome 22. Ph1 is the Philadelphia chromosome. The translocation joins exons
of the breakpoint cluster region (BCR) gene and the ABL1 oncogene. (From Strachan T, Goodship J & Chinnery P
[2014] Genetics and Genomics in Medicine. Garland Science.)
basis of the transcriptional profile of the tumor for expression of a number of genes. Clinical
trials of several different gene expression profiling protocols for breast tumors have been
performed. The number of different genes whose expression is profiled is typically between
20 and 70, and the main use of such a profile is currently to assist with decisions on whether
patients should receive chemotherapy in addition to other treatments, such as endocrine
therapy and radiotherapy. Use of this profiling is currently under investigation in the treat-
ment of several cancers, and in the future, designing individual care plans based on ques-
tions, such as which chemotherapeutic agents to apply, whether or not to use radiotherapy,
and which other drugs to prescribe, could become part of standard clinical practice.
Figure 9.6 Percentage change in cancer frequency in women 2000–2002 versus 2009–2011. The figure
shows how frequencies of some common cancers have changed in the new millennium. There are some
differences compared with men (see Figure 9.7). These are mostly based on anatomy, but some are surprising.
Thus, while bladder and stomach cancer have fallen in both groups, lung cancer has risen in women and fallen in
men. These clearly reflect lifestyle changes, whereas others are more difficult to pinpoint. By comparing sexes we
may be able to identify sex-specific risk factors in common cancers. (Data from Cancer Research UK.)
Figure 9.7 Percentage change in cancer frequency in men 2000–2002 versus 2009–2011. The figure shows
how frequencies of some common cancers have changed in the new millennium. Bladder cancer, stomach
cancer, lung cancer, and cancer of the larynx have fallen, while there have been major increases in liver, thyroid,
skin, prostate, oral and kidney cancers, as well as both Hodgkin’s lymphoma and non-Hodgkin’s lymphoma.
Some of these clearly reflect lifestyle changes, whereas others are more difficult to pinpoint. (Data from Cancer
Research UK.)
are based on the UK data, similar figures are seen throughout the developed world. Figures
are different in less developed countries and can vary quite widely, as we can see looking at
the WHO site (http://apps.who.int/gho/data/node.main.A864). The figures for the UK
that are shown illustrate important shared changes in male and female cancer, and also
some important differences between males and females. In particular, the increased fre-
quency of lung cancer in women (up 13%) compared with the reduction in men (down
14%). It is interesting to speculate on why this may have occurred, for example is this
due to an increase in smoking in young women? The data presented also indicate the low
frequency of non-solid tumors, which do not count in the top 10. Changes in the inci-
dence of various cancers in the new millennium reflect the impact of the environment and
societal changes on disease, and not just on cancers. They are unlikely to reflect genetic
changes in our population.
Conclusions
There are several different forms of cancer – some cancers are familial Mendelian cancers,
others are sporadic cancers. Genetics plays a role in both types. However, unlike other
diseases, there are two genomes to consider in cancer—the host genome and the tumor
genome. Looking at both early and more recent studies of cancer indicates that host genes
are important in all forms of cancer. Genetic markers in the form of tagged SNPs and
copy number repeat sequences have been used in GWAS to detect possible cancer genes
and their location across and throughout the genome. Some, though not all, of the genetic
risk factors previously detected by candidate gene analysis have also been confirmed by
GWAS. This research is important because not only does it enable us to identify poten-
tial cancer-causing genes and networks, it may also help us to use genetics to understand
the response to therapy in cancer and to develop better (more personal), novel treatment
options.
The last 10 years has seen considerable progress in our understanding of the genetic basis
of susceptibility and resistance to cancer. The application of GWAS in cancer has resulted
in the identification of some well-replicated risk factors for a number of cancers; however,
slightly disappointingly, the effect sizes remain small. Some may question the use of this
new knowledge. However, there is hope. These findings help to inform the debate on
the genesis of cancer and it is possible that, in the future, developments in the areas of
genome sequencing and epigenomics may enable additional risk factors to be identified
that will help us to develop more advanced screening programs. The future lies in inte-
grating knowledge of both the patient’s biology and the tumor biology at the levels of
DNA, RNA, protein, and other molecules, so that the likely response to a wide range of
treatments can be modeled mathematically. Ultimately, this information may be used to
inform clinical management of the cancer. In addition, the information gathered will help
to make systems biology a reality.
Further advances will require larger more extensive studies and the application of alter-
native technologies, such as exome sequencing and whole-genome sequencing, using a
variety of next-generation technologies. In 2013, studies in breast cancer and prostate
cancer, reported by the International Cancer Group (iCOGS), published the largest cancer
genetic association study to date (approximately 200,000 SNP probes in approximately
200,000 case/control samples for breast, prostate, and ovarian cancer). If these numbers
seem large there are plans underway to repeat the iCOGS with the larger OncoArray,
which has approximately 600,000 SNP/copy number variation probes using approxi-
mately 400,000 case control samples. Lung and colorectal cancer may also be included in
the next study. The iCOGS study represents an example of how the design in association
studies has developed over the years and how there will be increasing reliance on large
international collaborative networks.
Probably the most promising use of genetics in cancer so far has been in its use to inform
the development of personalized drug treatments for the disease. This has been generally
successful and is being used increasingly, especially in the treatment of advanced cancers
that may be refractory to other treatments. It is likely that personalized approaches will
become more common at all stages of cancer in the future.
Further Reading
Books Hosking FJ, Dobbins SE & Houlston RS
Spector T (2012) Identically Different: Why (2011) Genome-wide association studies for
You Can Change Your Genes. Weidenfeld & detecting cancer susceptibility. Br Med Bull
Nicholson. 97:27–46.
Maitland-van der Zee A-H & Daly A (2012) Parkin DM, Boyd L & Walker LC (2011) The
Pharmacogenetics and Individualized Therapy. fraction of cancer attributable to lifestyle and
Wiley. environmental factors in the UK in 2010. Br J
Cancer 105 (Suppl 2):S77–S81.
Strachan T, Goodship J & Chinnery P (2014)
Genetics and Genomics in Medicine. Garland Rothman N, Garcia-Closas M, Chatterjee
Science. N et al. (2010) A multi-stage genome-wide
association study of bladder cancer identi-
Weinberg RA (2007) The Biology of Cancer. fies multiple susceptibility loci. Nat Genet
Garland Science. 42:978–984.
Varghese JS & Easton DF (2010) Genome-
Articles wide association studies in common cancer-
Byrne HM (2010) Dissecting cancer through what have we learnt? Curr Opin Genet Dev
mathematics: from the cell to the animal 20:201–209.
model. Nat Rev Cancer 10:221–230. Wheeler HE, Maitland ML, Dolan ME et al.
Chung CC & Chanock SJ (2011) Current sta- (2012) Cancer pharmacogenomics: strategies
tus of genome-wide association studies in can- and challenges. Nat Rev Gen 14:23–34.
cer. Hum Genet 130:59–78.
Dowsett M & Dunbier AK (2008) Emerging
biomarkers and new understanding of tra- Online sources
ditional markers in personalized therapy for http://apps.who.int/gho/data/node.main.
breast cancer. Clin Cancer Res 14:8019–8026. A864
Figueroa JD, Ye Y, Siddiq A et al. (2014) A good source of data on cancer and other
Genome-wide association study identifies mul- diseases worldwide from the World Health
tiple loci associated with bladder cancer risk. Organisation (WHO).
Hum Mol Genet 23:1387–1398. http://www.cancerresearchuk.org/home/
Hanahan D & Weinberg RA (2000) The hall- This is an excellent source of data for cancer in
marks of cancer. Cell 100:57–70. the UK.
Heyn H & Esteller M (2012) DNA methyla- http://www.cogseu.org
tion profiling in the clinic: applications and This is the study database quoted in the text
challenges. Nat Rev Genet 13:679–692. section above regarding a Collaborative
Hindorff LA, Gillanders EM & Manolio TA Oncology Gene-environment Study consor-
(2011) Genetic architecture of cancer and other tium that has focused specifically on identify-
complex diseases: lessons learned and future ing genetic risk factors for breast, ovarian, and
directions. Carcinogenesis 32:945–954. prostate cancers.
Diabetes has a clear and demonstrable genetic component. There are two major forms
of diabetes with similar signs and symptoms: type 1 diabetes (T1D) and type 2 diabetes
(T2D). There is a considerable degree of clinical overlap between these two diseases. There
is also some clinical overlap with other diseases. In this chapter, we will look at these two
major forms of diabetes specifically focusing on their genetics. We will also briefly con-
sider the less common form of diabetes referred to as maturity onset diabetes of the young
(MODY). By comparing and contrasting the different forms of diabetes, we will be able to
demonstrate how genotype can be used to illustrate phenotype and to help us understand
disease pathogenesis.
18
16
14
12
10
0
Cardiovascular Cancer Respiratory Diabetes
Figure 10.1: WHO data for deaths due to NCDs in 2008. The figure shows the impact of three major disease
groups on mortality in 2008, reported in 2014. Though the diabetes column appears small compared with
cardiovascular disease and cancer, these are very broad disease categories compared to diabetes, which is more
specific. (Data from WHO.)
T1D
T2D
MODY
Figure 10.2: Worldwide incidence of T1D and T2D. Based on WHO data, there are approximately 347 million
cases of diabetes worldwide with the majority being T2D (approximately 85%), as indicated in this pie chart, and
the remainder being mostly T1D (10%) and the minority having MODY (5%).
Islets of Langerhans
Pre-pro-insulin
(signal peptide cleavage)
Pro-insulin
(further protease cleavage of 35 amino acids)
Mature insulin
Figure 10.3: Three phases of insulin production. Insulin produced in the islet cells goes through three phases
before it is functional.
while both T1D and T2D are genetically complex. To date, as with a number of other
genetically complex diseases, including most autoimmune diseases and some cancers,
only a proportion of the overall genetic risk for T1D and T2D has been accounted for.
The amount of progress that has been made varies between the different types of dia-
betes, for example excellent progress has been made with respect to T1D, but progress
has been slower for T2D.
Table 10.1 Sibling relative risk (λ) values for different diseases.
Disease Estimated λ
T1D 15
T2D 3
Celiac disease 50
Ulcerative colitis 7–17
Crohn’s disease 13–36
Multiple sclerosis 20–50
Primary biliary cirrhosis 10.5 (rising to 58 in daughters of affected mothers)
the first study on HLA found no association. This later proved to be incorrect, and asso-
ciations with B7, B8, B15, B18, Cw3, DR2, DR3, and DR4 were soon identified. Most
of these studies were classical case control association studies, though there were excep-
tions (Figure 10.4). In parallel with this work on the major histocompatibility complex
(MHC), others were busy looking outside the MHC looking for genetic linkage using
multi-case families.
HLA class II genotype is the strongest genetic risk factor for T1D
Early studies reported associations with HLA class I antigens, but in the 1980s the focus
shifted to the HLA class II antigens and DR in particular. Studies identified associations
with DR3 and DR4 as major risk factors, and DR2 (later DR15) as a protective factor.
The development of better technology for HLA typing, including the introduction of
both restriction fragment length polymorphism (RFLP) analysis and polymerase chain
reaction (PCR) analysis (Chapter 6) led to the confirmation of the associations with DR2,
60
50 0
1
40 2
Percentage
30
20
10
0
Sibling pairs T1D sibling pairs
Figure 10.4: Sibling pair analysis in T1D looking at HLA haplotype sharing. The figure shows haplotype
sharing in two groups: a group of non-diabetic siblings and a group of diabetic sibling pairs. This study illustrates
the strong impact of HLA in T1D. This type of study is unusual, but it can be very useful as it does not identify
specific alleles but concentrates on haplotype sharing. (Adapted from Parham P [2009] The Immune System, 3rd
ed. Garland Science.)
DR3, and DR4 at a molecular level (now labeled DRB1*03, DRB1*04, and DRB1*15). In
addition, the application of PCR enabled the associations with specific DQB1 and DQA1
alleles carried on the DRB1*03, DRB1*04, and DRB1*15 haplotypes to be identified for
the first time. Genotyping studies rapidly confirmed these associations in different cohorts
of T1D patients. In 1987, Todd et al. published a groundbreaking study suggesting that
HLA-encoded susceptibility to T1D is due to a specific amino acid at position 57 of the
DQβ-chain. The amino acid aspartate is found on DQB1 alleles that are associated with a
reduced risk of T1D. In contrast, DQB1 alleles that are associated with an increased risk
encode alanine, valine, or serine at position 57 (see Chapter 6). This study was the first to
consider HLA associations from a functional perspective. The study was published in the
same year that the crystal structure of HLA-A2 was published in Nature and the authors of
the T1D paper were able to show that this subtle structural difference determined whether
a salt bridge would form over the antigen-binding groove of the expressed DQ molecule.
Formation of the salt bridge involved the interaction between position 57 on the DQβ-
chain and arginine at position 79 on the DQα-chain. This interaction may be critical in
determining which antigens are preferentially presented to the T cell receptor (TCR) in
the formation of the immune synapse and the orientation of the bound peptide antigen
in that process.
This is certainly not the whole story for HLA and T1D. The associations established
in the 1980s and 1990s have mostly been confirmed. In 1994, an early genome-wide
study on affected sibling pairs using microsatellite markers confirmed a strong signal from
the MHC. The 2007 Wellcome Trust Case Control Consortium (WTCCC1) study also
found a very strong signal for the MHC [odds ratio (OR) = 5.49, P = 2.42 × 10−134]. The
HLA class II haplotypes that are positively associated with T1D include DRB1*03:01–
DQA1*05:01–DQB1*02:01 (the HLA 8.1 ancestral haplotype) and DRB1*04:01–
DQA1*03:01–DQB1*03:02. Approximately 95% of patients of European ancestry with
T1D are positive for one or both of the two haplotypes. In contrast, only 40% of non-
diabetics of European ancestry have one or both of these haplotypes. The HLA class II hap-
lotype DRB1*15:01–DQA1*01:02–DQB1*06:02 is significantly less common in affected
individuals. The estimated contribution of HLA to genetic risk for T1D is between 30%
and 50%. Interestingly, these three haplotypes, which are all common, are also associated
with a wide variety of different diseases, especially immune-mediated and autoimmune
diseases. The risk alleles do not always have the same effect in different diseases—the
DQB1*06:02 haplotype, which is protective in T1D, is associated with an increased risk
of both multiple sclerosis and severe narcolepsy.
Not all of the risk for T1D above may be associated with the DQB
allele or HLA class II
These extended haplotypes carry a number of other potential risk genes. For example,
the HLA 8.1 ancestral haplotype carries MICA*008, which has been associated with an
increased risk of the autoimmune liver disease primary sclerosing cholangitis and is involved
in killer cell activation. The 8.1 haplotype also carries a null allele for C4A, thought to
be the main factor in MHC-encoded susceptibility to sporadic cases of systemic lupus
erythematosus. Therefore, it is possible that some of these haplotypes carry more than one
risk allele. Such a multihit hypothesis would explain the strong effect of the MHC in T1D
and other autoimmune diseases.
Finally, more recent evidence suggests an additional HLA class I-encoded contribution
to susceptibility to T1D that is independent of HLA class II. This was reported in a
study involving more than 5000 cases. Association signals from HLA-A and HLA-B
were seen that were independent of HLA class II and not due to linkage disequilibrium.
Based on this study, HLA-B*39 appears to be the strongest independent class I risk
allele, but additional HLA-A and HLA-B alleles also appear to contribute and may
be important as determinants of age of onset of disease. Interestingly, an association
with HLA-B*39 was not listed in the compendium of early studies of HLA under the
subtitle diabetes.
Other genetic risk factors for T1D include the genotype for the
insulin gene
Evidence that the insulin gene was a predictor for susceptibility to T1D was reported in
the early 1980s in a case control association study that used RFLP analysis. This asso-
ciation was subsequently confirmed using the transmission disequilibrium test (TDT)
in families with at least one affected sibling pair. The 1994 genome-wide study that was
based on the use of variable number tandem repeats (VNTRs or microsatellites as they
are known) in affected sibling pairs also reported a signal in the insulin gene region,
though it was not as significant as that seen for HLA. The signal was given the label
IDDM2. The insulin gene codes for an insulin precursor known as pre-pro-insulin.
This precursor is converted in the endoplasmic reticulum to pro-insulin and pro-
insulin is converted to insulin by the enzymatic removal of a segment that connects
the amino end of the α-chain to the carboxyl end of the β-chain. This produces the
bipeptide chain of mature insulin. The segment that is removed is referred to as the
connecting (C) peptide. Despite ongoing studies, the underlying mechanism for
the association with the insulin gene remains unclear. However, it appears to involve
variation in the 5′-non-coding sequence that results in decreased expression of the
insulin precursor pre-pro-insulin. It is likely that this occurs in the thymus, but not
in the pancreas, and it has been proposed that the decreased expression in the thymus
leads to decreased immune tolerance to insulin resulting in an inappropriate immune
response in the pancreas.
PTPN22
PTPN22 encodes a lymphoid-specific intracellular phosphatase that is a negative
regulator of TCR signaling. This occurs by direct de-phosphorylation of various
cell signaling proteins including the Src family kinases LCK and FYN. Associations
with this gene have been described for a number of different autoimmune diseases
including rheumatoid arthritis.
Table 10.2 Pre-genome era regions identified as potential areas for candidate genes in T1D.
The table provides a summary of potential areas for investigation or known to be associated with T1D prior to
the publication of the Human Genome Mapping Project. Among the candidates, IDDM1, IDDM2, and IDDM7
have all been confirmed.
CTLA4
CTLA4 encodes a co-stimulatory molecule expressed by activated T cells. It binds to
CD80 and CD86 on antigen-presenting cells (APCs) (previously known as B7) and
transmits an inhibitory signal, thereby deactivating the T cells and turning off the
immune response. This process is competitive. In early T cell activation, T cells express
CD28 which binds with CD80 and CD86 on the APCs, sending a positive signal
that promotes the T cell immune response. As the immune response progresses, more
CTLA4 protein is expressed on T cells and the immune response is downregulated
(Figure 10.5). There are many polymorphisms in the CTLA4 gene. Prior to the use of
GWAS, Ueda et al. investigated 108 single nucleotide polymorphisms (SNPs) in and
around the CTLA4 gene in a large cohort of T1D cases and identified a specific SNP
that they labeled CT60 as the strongest associated T1D CTLA4 SNP. Recent studies
indicate that CTLA4 is associated with T1D, but have not pinpointed the specific risk
allele. Associations with this gene have been described for several autoimmune dis-
eases, most notably with the autoimmune thyroid disease Graves disease. Interestingly,
Ueda et al. included Graves disease in their study and also found the strongest asso-
ciation with the so-called CT60 SNP. Finding associations with the same gene in
several diseases is not surprising, especially when the gene in question has a broad (or
non-specific) function, such as CTLA4 that encodes a non-specific immunoregulatory
protein.
IL2RA
IL2RA codes for a subunit of the IL-2 receptor. Like many receptors, the IL-2 receptor
is a dimer composed of an α and a β subunit. IL-2 is a major cytokine involved in T cell
activation. IL2RA is constitutively expressed at high levels in CD4+ T cells, which are
also positive for the FoxP3 protein. These cells are believed to be important in immune
tolerance to self-proteins. The association with T1D is protective and appears to involve a
CD80/CD86 CD80/CD86
APC
Figure 10.5: Suppression of auto-reactive T cells by CTLA4 signaling. The figure illustrates the process
of CTLA4 signaling by the CD25+ CD4+ T cell. The process is the same as that for T cell activation; however,
in this situation the T cells send out a negative response on activation by the APC. Antigenic peptides are
represented here as stars. Antigenic peptides are presented to the TCR by MHC class II proteins (shown in
gray as an inverted triangle supporting a curved structure), which is mounted on the APC (also shown in gray).
There is co-recognition of this molecular complex by CD4 (shown as a lozenge with a short leg). The first part
of the process is the same for both T cells. However, a second interaction occurs which is different. The T cell
on the right expresses CD28, which interacts with the CD80/CD86 molecule (B7 in some texts; shown here as
dark curves) and this causes T cells to induce immune activation. The T cell on the left expresses CTLA4, which
interacts with CD80/CD86, sending a signal that downregulates the immune response. CD28 and CTLA4 are both
illustrated as gray tubes on the surface of the T cells.
(a) (b)
120
70
100 Genotype A Genotype B
60
80 50
Biological activity
Biological activity
60 40
30
40
20
20
10
0 0
Time Time
Figures 10.6: Illustration of the reason why both elements of a system need to be considered when
considering the impact of a polymorphism on a trait. The two graphs illustrate the interaction between
genetic polymorphisms in a biological pathway. If we consider a model where we have only two components,
e.g. IL-2 and its receptor, biological activity can increase to infinity as long as there is adequate production of
both the receptor and its ligand. If genotype A represents those with high ligand expression, activity will increase
for as long as there is sufficient receptor expression. Similarly, if genotype A is associated with high receptor
expression, activity will increase provided there is a continuous supply of ligand. Genotype B, however, indicates
a more reasonable proposition. In the second example activity begins to reach an upper limit because one or
other of the two components reaches saturation. In real biological systems high expression of a ligand or receptor
like IL-2 does not necessarily result in activity. Thus, we need to be careful about interpreting data that indicates
increased expression of a gene or protein associated with a specific polymorphism.
The 2007 WTCCC1 study was one of the first GWAS in T1D
The WTCCC1 2007 GWAS included 2000 T1D cases. Several regions were identified as
major risk determinants including all of the genes or regions listed above: MHC, CTLA4,
PTPN22, IL2RA/CD25, and IFIH1/MDAS. It is important to note that of these previ-
ously identified regions and genes only two had P values above the high statistical thresh-
old P < × 10−7 (MHC and PTPN22) and two were in the lower statistical threshold region
Table 10.3 Six major risk genes and regions for T1D in the post-genome era.
Four of the listed genes or regions were identified long before the publication of the Human Genome Mapping
Project and use of GWAS: MHC, INS, CTLA4, and IL2RA. Apart from INS, all of these genes or regions (MHC) have
associations with other diseases (i.e. they are not specific to T1D).
around P < × 10−5 (IL2RA and CTLA4) (Table 10.3). In addition to the associations above
a number of new regions were identified as showing strong risk (P < 10−7); these include
12q13, 12q24, and 16p13 together with other regions with similar levels of significance,
including 4q27, 12p13, 18p11, and 10p15 (CD25). Of these, the associations with 12q13,
12q24, 16p13, and 18p11 have all been confirmed in other studies. A number of potential
functional signals can be identified in this group, e.g. 12q13 is close to the ERBB3 gene
that encodes the receptor tyrosine kinase erbB-3 precursor, and 12q24 is close to SH2B3/
LNK (SH2-B adaptor protein 3), TRAFD1 (TRAF-type zinc-phosphatase domain con-
taining 1), and PTPN11 (protein tyrosine phosphatase, non-receptor type 11).
PTPN11 is a particularly interesting candidate for T1D as it is a member of the same fam-
ily of regulatory phosphatases as PTPN22. As discussed earlier, PTPN22 is associated not
only with T1D but also with Crohn’s disease and rheumatoid arthritis, suggesting over-
lapping pathology. The 12q24 association reported above is associated with a combined
signal (measured as probability value P) for T1D, Crohn’s disease, and rheumatoid arthri-
tis of 9.3 × 10−10. Overlapping genetic associations are reported for a number of different
complex diseases (Figure 10.7).
The association with the 10q15 region that contains the CD25 gene is found in Graves
disease, rheumatoid arthritis, and T1D. Graves disease was not included in the WTCCC1
study, but rheumatoid arthritis and T1D were both included. The study reported separate
independent associations for both diseases. CD25 encodes a high-affinity receptor for
IL-2, and this association may highlight the importance of the IL-2 pathway in T1D and
other autoimmune diseases.
Shared
susceptibility
genes
Figure 10.7: Genetic overlap in complex disease—finding the same associations in different diseases.
The figure illustrates some of the shared susceptibility genes for three very common complex diseases: T1D,
rheumatoid arthritis, and Crohn’s disease. The shared susceptibility regions or loci are the MHC (a region not a
locus), the CTLA4 genes, and the PTPN22 gene. The list is not exclusive as many other genes could be included; it
is only intended to illustrate a point.
MHC
30– 40%
Unidentified
INS
genes T1D
10%
30– 40%
Other
identified
genes
20%
Figure 10.8: Genetic accounts book for T1D. The figure shows the current estimate of the total genetic impact
of genes identified as risk markers for T1D. The MHC has the strongest impact, with the insulin (INS) gene second.
The other group includes numerous genes involved in immune regulation, but there is still a substantial portion
of the genetic risk to be identified if these figures are accurate.
The addition of replicated novel associations discovered in GWAS to those already well-
established associations from a mix of family studies and candidate gene analysis has further
increased the extent of the genetic account for T1D to approximately 70% (Figure 10.8).
This is much higher than for the majority of complex diseases and is a very different situ-
ation to that seen with T2D (below).
were subsequently confirmed by GWAS. These included significant associations with non-
synonymous SNPs in the PPARG, KCNJ11, and TCF7L2 genes. A number of other asso-
ciations originally described in the same era were not confirmed by GWAS. Confirmation
in later studies is a mark of quality and reflects study design, and especially sample size.
Unlike T1D, there are no associations with HLA or the MHC in T2D. This may indicate
that the disease pathogenesis is very different from that for T1D and does not involve the
same immunological (or other biological) processes.
a number of centers, including the WTCCC1 study, and the number of cases varied from
around 2000 to approximately 8000 cases. Most cases were of European origin. There was
good agreement between the various studies in terms of risk factors detected and the find-
ings were also replicated in additional cases in most studies.
but has not been universally reported. Surprisingly, the FTO association is consider-
ably stronger than that for SLC30A8. However, a number of studies have replicated
the association with SLC30A8, which contrasts with the situation for the FTO gene.
These differences between studies are important. They may reflect genuine evidence of
false positives or they may reflect differences in study design, e.g. the use of different
commercial chips (Illumina or Affymetrix) with different tagged SNPs. One solution to
this would be to use specifically designed chips for studying specific diseases and disease
groups. However, while many companies used to be willing to produce specific chips at
a reasonably low price, this is no longer considered such a viable option and it may not
be possible in the future. Observed differences in associations reported can arise from
differences in the close proximity between tagged SNPs and actual disease alleles for
which they are markers.
CDKAL1
The third strong association signal in the WTCCC1 GWAS was with the CDKAL1
(CDK5 regulatory receptor subunit associated protein 1-like 1) gene. The product of this
gene shares homology with the protein domain-level CDK5 regulatory subunit associated
protein 1 (CDK5RAP1), which is known to inhibit the activation of CDK5. CDK5 is a
cyclin-dependent kinase that has been implicated in normal β cell function. The associa-
tion has been confirmed in several GWAS.
Genes
T2D Lifestyle
that patterns of DNA methylation may vary considerably between tissues, so using DNA
from accessible tissues such as blood may not be representative of methylation patterns
in disease tissues, e.g. the pancreas. A second issue is that accurate assessment of methyla-
tion requires DNA sequencing following treatment with bisulfite, though it is possible to
narrow down the chromosomal regions differentially methylated prior to this sequencing.
Methylation chips that cover methylation sites across the genome are also now available.
Methylation patterns may also be influenced by a variety of environmental factors and so
can change over time. In addition, studying DNA methylation represents only one mea-
sure of epigenetic regulation, which also involves processes such as histone acetylation.
Despite these caveats, studies based on both methylation of candidate genes and EWAS
for T2D have now been reported. One EWAS used DNA from diabetic pancreatic islets
and detected differential methylation at promoter regions of 254 genes. These methylation
changes were not present in DNA from blood samples. Another recent EWAS reported
lower levels of methylation in the region of genes such as TCF7L2 in DNA from diabetic
blood samples, but did not study methylation patterns in other tissues. However, both
approaches may be valid. Further discussion of epigenomics is beyond the scope of this
chapter, but it seems likely that this area of research may provide novel insights into the
risk of T2D and other diseases in the near future (see chapter 12, section 12.5).
Gene Location
MODY1 20q13
MODY2 7p13
MODY3 12q24.3
MODY4 13q12.2
MODY6 2q31.3
MODY7 2p25.1
MODY8 9q34.2
MODY9 7q32.1
MODY10 11p15.5
MODY11 8p23.1
on the specific gene defect. These are monogenic diseases. Two subtypes of monogenic
diabetes are of particular interest in relation to our understanding of genetic susceptibility
to T2D: neonatal diabetes, which usually has an onset during the first 6 months of life, and
young onset diabetes, where the disease is due to mutations in transcription factor genes
such as HNF1A, which codes for hepatocyte nuclear factor (HNF)-1α. This mutation has
also been found to be associated with an increased risk of T2D.
A significant proportion of neonatal diabetes cases are due to gain-of-function mutations
in either KCNJ11 or ABCC8, which encode separate subunits of the pancreatic β cell
potassium channel. These mutations typically increase the electrostatic current through
the channel, preventing depolarization in response to glucose metabolism and impaired
insulin secretion. Polymorphisms in these genes are also associated with increased sus-
ceptibility to T2D. The relatively recent finding that many cases of neonatal diabetes
were due to mutations in potassium channel genes was an important advance in terms of
treatment. These patients would have been previously treated with insulin on the grounds
that their disease was insulin-dependent, but they are now normally treated with a high
dose of an orally administered sulfonyl urea that targets potassium channels directly.
Interestingly, another potassium channel gene, KCNQ1, has been found to be associated
with an increased risk of T2D.
Mutations in HNF1A are a common cause of young-onset diabetes. HNF1A encodes the
HNF-1α, a transcriptional activator that regulates tissue-specific expression of a range
of genes, particularly in the liver and pancreatic islet cells. The mutations reported are
dominant mutations which result in deterioration of pancreatic β cell function including
the ability to secrete insulin. HNF1A is now well established as a genetic risk factor with a
modest effect in T2D. Mutations in a gene encoding a separate transcription factor with
homology to HNF1A, HNF1B, are also associated with young-onset diabetes and T2D.
There is considerable overlap between the genes associated with these monogenic forms
of diabetes and those associated with increased risk of T2D. The individual mutations
involved in the monogenic disease differ from the polymorphisms predicting suscepti-
bility to T2D and the overall phenotypes also differ in a number of respects. However,
this is different to the example of breast cancer considered in Chapter 9, where the genes
contributing to the familial disease do not appear to make a significant contribution to
susceptibility in sporadic disease.
Conclusions
In this chapter we have considered the results of genetic studies in one of the most com-
mon forms of disease in the developed world, i.e. diabetes mellitus (T1D, T2D, and
MODY). There are profound differences between the genes identified in T1D versus T2D,
corresponding to the different phenotypes for these two diseases. T1D is an autoimmune
disease (see Section 10.1) with all of the features that we would expect with autoimmu-
nity. There is a strong association with the MHC and especially, but not only, with HLA
class II DQB1. There is also a strong association with a number of other genes, especially
the insulin gene INS and a number of immune-regulatory genes, including CTLA4 and
IL2RA. Overall, the genetic studies support the hypothesis that this is an autoimmune
disease. There is significant overlap with many other similar autoimmune and immune-
mediated diseases. However, there are also some differences, e.g. the same HLA-DQB1
allele that is associated with protection from T1D is associated with an increased risk of
multiple sclerosis and narcolepsy. The picture presented is not a simple one and yet a high
proportion of genetic heritability (up to 70%) may now be explained by a relatively small
to moderate group of genetic risk factors that have a mix of very strong to moderate effect
sizes.
Compare this to the situation in T2D where the genetic risk factors are very different from
those identified in T1D. There are no associations with HLA in T2D and there are no
major associations with other immune-regulatory genes. This almost certainly reflects the
different pathogenesis of T1D compared with T2D. Furthermore, most of the identified
and confirmed risk alleles have small effects. The risk genes are different from T1D, but
there is some overlap with some of the early-onset monogenic forms of the disease, for
example the potassium channel genes KCNJ11 and KCNQ1.
Compared with T1D, where current estimates indicate a very significant portion of the
total genetic risk has been identified, in T2D only 10% of the genetic risk has been identi-
fied. However, this is not because there has been a lack of effort to identify the risk genes
in T2D. On the contrary, over 60 different candidate genes have been identified so far.
In the absence of strong associations it is going to be difficult to use the current data from
T2D studies in disease diagnosis or in patient management. Studies will continue and it
is likely that at some point, through a better understanding of the genetic components of
this disease, it will be possible to use this knowledge in patient treatment and management,
and to develop new treatments for this disease. When considered on a global scale, the
ability to offer individualized risk assessment for T2D using genetic information remains
an important target. Future studies will use a combination of new and old technologies,
and there will be an increased use of epigenetics to understand diabetes, especially T2D.
Genotyping has led us this far in many diseases, but phenotyping will produce the next
piece in the jigsaw. It is OK to be selective in terms of deciding which genes to look at as
long as all of the potential candidates are considered at some time. Otherwise the genotyp-
ing will have been a waste of time and the phenotyping work will fall into the same trap
that applied to early genetic association studies, i.e. rejection of positive associations, but
not based on numbers or study design (as in the past), but this time based on absence of
knowledge or absence of an obvious (simple) link.
Further Reading
Books Murphy R, Ellard S & Hattersley AT (2008)
Armstrong L (2014) Epigenetics. Garland Clinical implications of a molecular genetic
Science. This is a useful informative book for classification of monogenic beta-cell diabetes.
students wanting to understand the applica- Nat Clin Pract Endocrinol Metab 4:200–213.
tion of epigenetics to complex disease and epi- Pal A & McCarthy MI (2013) The genetics of
genetics in general. type 2 diabetes and its clinical relevance. Clin
Holt RIG, Cockram C, Flyvbjerg A & Genet 83:297–306.
Goldstein BJ (2010) Textbook of Diabetes, 4th Polychronakos C & Li Q (2011) Understanding
edn. Wiley-Blackwell. type 1 diabetes through genetics: advances and
Parham P (2009) The Immune System, 3rd prospects. Nat Rev Genet 12:781–792.
edn. Garland Science. This is an excellent basic Steck AK & Rewers MJ (2011) Genetics of
immunology book for the student of human type 1 diabetes. Clin Chem 57:176–185.
immunology including links to immunogenet- Todd JA, Bell JI & McDevitt HO (1987) HLA-
ics and some useful data on diabetes. DQβ gene contributes to susceptibility and
Wass JAH, Stewart PM, Amiel SA & resistance to insulin dependent diabetes mel-
Davies MC (2011) Oxford Textbook of litus. Nature 329:599–604. This is a landmark
Endocrinology and Diabetes, 2nd ed. Oxford paper in understanding T1D and the relation-
University Press. ship between HLA and disease.
Toperoff G, Aran D, Kark JD et al. (2012)
Articles Genome-wide survey reveals predisposing dia-
betes type 2-related DNA methylation varia-
Ashcroft FM & Rorsman P (2012) Diabetes tions in human peripheral blood. Hum Mol
mellitus and the β cell: the last ten years. Cell Genet 21:371–383.
148:1160–1171.
Ueda H, Howson JMM, Esposito L et al.
Billings LK & Florez JC (2010) The genetics (2003) Association of the T-cell regulatory
of type 2 diabetes: what have we learned from gene CTLA4 with susceptibility to autoim-
GWAS? Ann NY Acad Sci 1212:59–77. mune disease. Nature 423:506–511. This is a
Davies JL, Kawaguchi Y, Bennett ST et al. very important study on CTLA4 highlighting
(1994) A genome-wide search for human the complexity of association studies even when
type 1 diabetes susceptibility genes. Nature targeted to a specific gene.
371:130–136. This describes one of the first van de Bunt M & Gloyn AL (2010) From
genome-wide genetic studies. This was a pre- genetic association to molecular mechanism.
genome map study and limited, but neverthe- Curr Diab Rep 10:452–466.
less inspirational.
Volkmar M, Dedeurwaerder S, Cunha DA
de Miguel-Yanes JM, Shrader P, Pencina MJ et al. (2012) DNA methylation profiling iden-
et al. (2011) Genetic risk reclassification for tifies epigenetic dysregulation in pancreatic
type 2 diabetes by age below or above 50 years islets from type 2 diabetic patients. EMBO J
using 40 type 2 diabetes risk single nucleotide 31:1405–1426.
polymorphisms. Diabetes Care 34:121–125.
Drong AW, Lindgren CM & McCarthy MI
(2012) The genetic and epigenetic basis of type Online sources
2 diabetes and obesity. Clin Pharmacol Ther http://www.who.int/diabetes/facts/en
92:707–715. This is an excellent site for finding information
McCarthy MI, Rorsman P & Gloyn AL (2013) on disease prevalence, incidence, morbidity,
TCF7L2 and diabetes: a tale of two tissues, and and mortality, and especially projections for the
of two species. Cell Metab 17:157–159. future.
In the previous chapters we have considered how, why, and what we hope to achieve by
investigating the genetics of complex disease, and used different example diseases to illus-
trate these points. Each chapter contains several examples of how common genetic variation
increases the risk of common genetically complex (i.e. non-Mendelian) disease. The issue of
common genetic variation being linked to common disease is critical in modern society. One
of the key elements in this consideration is ethics. In this chapter, we will consider different
ethical aspects of enquiry into complex disease with a brief overview of the general philoso-
phy behind ethical enquiry. We will consider lessons from the past, looking at how genetics
has been misused and why it is important to safeguard ourselves against the genetic inquisi-
tion. We will consider the use, and potential misuse, of complex disease genetics in a modern
society and why ethical guidance is so important in genetics. We will also consider the use of
genetic data, what is personal and confidential, and where the boundary of confidentiality
fits into the picture? Finally, we will consider who owns the genome, and the relationship
between commerce and academia. This chapter is designed to open the doors to ideas and
concepts not yet dealt with in other areas of the book; therefore, unlike the other chapters
in this book, this chapter is not crammed full of examples of genetic studies, except where
they are appropriate. Instead, it is designed to stimulate debate and reading for those with
an interest in ethical issues and in the history of genetics. The chapter includes some refer-
ences to the history of genetics that may be sensitive issues for some readers. Nevertheless,
it is thought important to include them. The chapter does not discuss the ethics of work on
animals. This latter omission does not reflect the views of the authors, but simply the fact
that the majority of the book reflects recent genetic research based on human subjects.
Others may say it is a step too far. Opinions will vary and it may be easier for us to ride
this issue out than become entangled within it, but online testing and application is part
of the future.
Upper–rich
Middle
Lower–poor Figure 11.2: Class division in the UK
around 1900. The figure illustrates the
division between the rich (upper classes)
and the poor (lower classes) in the UK
around 1900. The members of the Eugenic
Movement were convinced that poverty
was an inherited trait.
death camps of the holocaust (see below). He talks of how Galton supported the idea of
breeding from the best and sterilizing those individuals whose inheritance did not meet
with his approval and how the Eugenics Movement joined the gentle concern for the
unborn with a brutal rejection of the rights of the living. Eugenics was an idea shared by
many members of society from the political left to the far right. The list includes George
Bernard Shaw, Winston Churchill, and Charles Davenport (Professor of Evolutionary
Biology at Harvard).
New Germania
At the lowest level, isolated communities such New Germania in Paraguay (which was
set up by Bernhard Förster and his wife, Elisabeth, the sister of the philosopher Friedrich
Nietzsche), were founded based on selected groups of individuals who set up home in
isolation from their brethren to maintain and develop more pure communities and
strengthen the stock. A glance at the peoples of this area today indicates that the experi-
ment failed, and the descendents of the original settlers are a poor and sickly population.
Unfortunately, that was not the limit of the application of this pseudo-science. In North
America, 25,000 Americans were sterilized because they might pass feeble-mindedness or
criminality to a future generation.
Darwin Mendel
Genetics
Galton
Eugenics
Movement
Monist
Haeckel
League
New
Germania
Figure 11.3: From Darwin and Mendel to eugenics: how science can be misused. The figure illustrates the
period of eugenics, and the main characters and events involved in the misinterpretation and application of
genetics.
of male inmates with an extra Y chromosome (XYY). In fact, this condition occurs in
1:1000 males and is not associated with hyper-aggressiveness as originally suggested,
but is associated with a mildly reduced level of intelligence. The idea was seen once
again in 1993 with a report on the MAOA gene (chromosome X) which encodes mono-
amine oxidase A. Monoamine oxidase A is an enzyme that plays a key role in the catab-
olism of a range of neurotransmittors. The study reported a link between this gene and
criminal activity in a large Dutch family. Unlike the XYY story, this report has stood
the test of time with the caveat that aggressive behavior only occurs in those males car-
rying the polymorphism who have also been maltreated during childhood. Such was
public and scientific ethical concern about promoting the idea of a link between crimi-
nal behavior and genetics that a conference planned to discuss the idea was canceled to
avoid raising controversy that may have had public and societal consequences.
Eugenics has left a scar across the heart of genetics as a science—one that has been hard to
shake off. Given this, it is perhaps most surprising that some of these screwball ideas have
lasted into the new millennium, but even today these ideas occasionally surface as Lone
Frank reports in her book My Beautiful Genome (2011). The message from the past is clear.
We must not allow this ideology to raise its ugly head in the future.
We need to consider several other ethical issues as more and more samples undergo
genome-wide scanning for clinical or personal enquiries:
Human
Genome
Project
Improve
Aid to patient
knowledge of Aid to
care and
disease diagnosis
management
pathology
Figure 11.4: Promises of the HGP. The figure shows three of the major promises of the HGP with respect to
complex disease. In each case, there is a need for expert counseling as the impact of risk alleles and haplotypes is
very different from that seen in Mendelian disease.
identified a large number of risk alleles and this information is helping to advance our
understanding of the pathology of this disease.
The ethical issues concerning these promises
This is all very well, but what are the ethical issues arising from this. One of the major
problems with using genetic polymorphisms in disease diagnosis in complex disease is that
the polymorphisms are often common in the healthy population and have weak associa-
tions with the disease. Therefore, the value of the polymorphisms as diagnostic tools is
limited. Where they have larger effects, the presence of an allele may be a useful adjunct
in diagnosis and particularly in making a differential diagnosis between two diseases, e.g.
HLA-B*27 in ankylosing spondylitis (see above). However, we are faced with a dilemma.
If the allele is common we have to be careful not to over-promote the concept that the
allele is important in disease pathogenesis, because in most complex diseases possession
of the risk allele is neither necessary nor sufficient for the disease to occur. In some cases
there are strong genetic associations with very common polymorphisms, e.g. some human
leukocyte antigens (HLA), where up to 20% or more of the healthy population can carry
the risk allele or one member of the allele family. For example, the DRB1*04 family is the
most common group of the HLA-DRB1 alleles in the UK. This group is associated with
susceptibility and resistance to a range of autoimmune diseases, including rheumatoid
arthritis and autoimmune hepatitis (increased risk), and primary sclerosing cholangitis
(reduced risk). We cannot announce to the population at large that they carry a four- to
five-fold greater risk of a disease without a proper explanation of what this means. The
outcome of simply broadcasting risk values could create global (and unnecessary) panic.
The statistics need to be put into context.
When it comes to therapy there are some clear links between outcome measures and genetic
polymorphism. It is very helpful in some cases to perform genetic testing before a patient
is treated with a specific drug. In the science of pharmacogenetics, where the associations
are strong and the differences (i.e. in response to treatment) are clear, genetic testing will
be increasingly used prior to treatment. Here, the reason for using genetic information and
testing is obvious, e.g. consider the non-response to codeine (for pain relief ) or to β-blockers
(for cardiovascular problems), or the adverse reaction to abacavir used to treat HIV-1 infec-
tion. However, where the strength of the association is weaker, but the outcome poten-
tially lethal (e.g. in the case of some adverse drug reactions), we find ourselves in an ethical
dilemma. To genotype or not to genotype, that is the question. If the incidence of a severe
reaction is 1:100,000 new cases per annum, do we genotype everyone before we prescribe
the drug or do we monitor all new cases for signs and symptoms of adverse reactions? It is a
societal question and one for the ethics committee rather than one that we can answer.
Considering ethical issues is easier when it comes to understanding the disease pathogen-
esis. It is our human instinct that makes us question the world around us and our human-
ity that makes us want to solve the problems of disease. There is unlikely to be anyone who
would feel that we should not perform genetic enquiries to inform the debate on disease
pathogenesis. The question that we must be concerned with is how we handle the data and
use the outcomes from such research. Early publication of data can cause fear amongst the
“at risk” public.
Family
Confidence
and Social status
lifestyle
Knowledge
of
genome
Access Access
to to
insurance healthcare
Figure 11.5: The personal impact of knowing your own genotype. The figure illustrates five of the factors
that may be impacted by genetic testing. Individual responses to genetic testing can have negative effects on
self-confidence and on family members as well as those tested. In addition, there is potential for testing to have
a negative impact on insurance and healthcare costs depending on local rules and regulations. Social status can
also be impacted. However, there can also be positive impacts in all of these areas.
waiting patiently (or not so patiently) for the postman to deliver the results of genetic
tests for common polymorphisms identified in complex disease. Her book gives a personal
view of how we ourselves may feel waiting for such results. She describes joy when low-
risk alleles are reported, but deep concern when higher-risk alleles are reported. Even with
counseling, the author reports being nervous on some occasions. Of course it depends on
what is being tested; consider, for example, BRCA versus the interleukin (IL)-2 receptor
gene IL2R. It depends on the size of effect and on the disease or diseases that are associated
with the polymorphism (or occasionally the mutation). It depends on your knowledge of
your family medical history, and your understanding of medicine and science. For both
the academic elite and the general public, the level of concern depends on how much we
understand the subject. Do we have an adequate grasp of the concepts before us? Do we
understand genetic risk? Even if we do understand the science there is no promise that
we will be better equipped to cope with the news. Therefore, it is also important to note
that when doctors become patients they deserve the same level of consideration and should
receive the same level of counseling as the less informed member of the public receives.
Patients being tested for diagnostic reasons and those making independent enquiries are
one group. But, what do we do about volunteers? Here the ethical dilemma is simple—
when testing samples, should we release data to the volunteers? The question may be sim-
ple, but the solution is far from it. We get around this by making all samples anonymous,
so that back-tracking to the original ID is not possible (Figure 11.6). However, this has
limitations. What if we find a polymorphism in our studies that identifies an allele that is
very strongly associated with a severe disease, but a disease that can be easily treated? If we
find that 2% of the healthy controls are carriers for the risk allele, then 2% of our healthy
control volunteers are at increased risk of the disease. As the DNA samples will be handled
as anonymous samples there is no possibility of alerting these volunteers to the risk and giv-
ing them prophylactic treatment. The same individuals may suffer the disease in later life.
Genetic testing
Post-testing Anonymous
counseling no post-testing
and counseling or
information information
Figure 11.6: Common (good) practice counseling in different groups. The figure illustrates the importance of
consent and counseling in all groups undergoing genetic testing. Those being tested for clinical purposes need to
give consent for testing and receive appropriate counseling prior to consent being given. Those requesting tests
out of self-interest should be treated in the same way. Those who are being tested as part of a research program
(i.e. volunteers) must also give consent and should also be counseled. However, normal practice in research is
to make all samples anonymous and therefore volunteers should not expect to receive information in return
or counseling after testing. In contrast, both clinical cases and those being tested out of self-interest should be
counseled following testing as well as before.
In complex disease it is always difficult to know how to handle the data, because the size
of effect is small and some of the genetic relationships are very complicated. Nowhere is
this better illustrated than in the major histocompatibility complex (MHC). The majority
of the genetic associations with this area involve odds ratios (ORs) of less than 10, though
values of between 0.02 and 150 have been quoted for different diseases. The problem is
that most HLA polymorphisms and haplotypes associated with disease are common in the
healthy population (often as high as 10–20% for the major allele families, e.g. DRB1*04 ).
In contrast, the diseases are not as common. This means that the majority of those with
the risk allele, family of risk alleles or haplotype do not develop the disease. Therefore,
being positive for an allele or haplotype is not useful in a diagnostic setting. However, it is
easy for the public to get the impression that these associations are diagnostic.
ask to know their HLA type. On this occasion time passed without any issue until one
day the volunteer rounded on the immunogenetics team declaring their anger that they
had not been told they carried an allele that promotes ankylosing spondylitis. The team’s
defense, that though the association between ankylosing spondylitis and HLA-B*27 is
strong it is not found in all cases, and more importantly only one in approximately 20
HLA-B*27-positives develop the diseases, was no comfort. The volunteer had recently
discovered that a younger member of their family, also an HLA-B*27 carrier, had devel-
oped ankylosing spondylitis. The team were not even able to defend their position with
the observation that the volunteer had not indicated a history of ankylosing spondylitis
and was over the normal age at which they it would develop. Classical geneticists would
use two words: anonymity and counseling. Anonymity may have helped here, but prob-
ably not. Rigorous refusal to disclose the original data would not have been easy in
these circumstances. Counseling would have helped, but then that means either all those
undergoing genotyping would need to be informed of every HLA association that had
been reported and the risk with diseases before testing, or that all those with any high
risk alleles and haplotypes would require counseling after being typed, or both. If we
restricted counseling to the second group, when we consider HLA genotypes this would
mean almost everyone as nearly all the common HLA allele families have some positive
risk associations with one or more diseases. Considering HLA-B*27-positives alone, out
of 20 cases given counseling, only one would really need it. The effect in terms of stress
for the remaining 19 would be enormous. This would be a huge burden and represents
an impossible task.
Interestingly, Francis Collins reports in his book The Language of Life: DNA and the
Revolution in Personalized Medicine (2010) of having had himself tested by three com-
panies for a number of well-known risk alleles. He said “there was one test result that I
thought about just not looking at … the one for Alzheimer’s disease risk.” (We should
note Collins does not specify which Alzheimer’s gene.) Given that we are talking about
one of the best-informed members of the scientific community with regard to genetics
and disease, this is perhaps something of a surprise and it sets the question for the com-
munity at large. However, in considering his statement we need to remember that as yet
there is no cure for Alzheimer’s disease and that may be driving the concern expressed
here. Certainly the issue of whether there is a cure for this disease, or any other disease,
is one that is likely to affect an individual’s choice of whether to be tested at all and if so
which genes to be tested for.
The immediate impact of reduced confidence may be stress (as above), but it may also
lead to changes in lifestyle, not all of which may be beneficial. Some individuals when
diagnosed with chronic disease become determined to defeat the illness. They become
champions in the crusade against the disease, getting involved in fundraising events and
even starting new charities. Many will become more self-aware, and diet and exercise
may become higher priorities than previously. However, not everyone reacts positively
nor is increased activity and diet the right path for everyone. Though being told you
have an increased risk of a common disease is not the same as being diagnosed with the
illness, some members of society will react to this type of information as though it were.
In extreme cases, some become withdrawn—almost developing a Munchhausen’s-like
state. This is why and where we need to consider the ethical side of genetics of complex
disease. We need to reduce and avoid this problem, and avoid sending out a negative
message. Counseling is essential.
Family issues
When genetics is mentioned most people think of familial disease, but this is not always
the case in complex disease. Very few complex diseases have large numbers of affected
families and even though there is increased risk for family members, it is not on the scale
of risk associated with Mendelian disease. However, knowledge of genetic susceptibility
can have multiple effects in a family. Even though selected individuals may be aware of the
impact of the risk allele, and its biological and personal significance, there is certainly no
possibility that all family members will be aware. Suddenly we may find family members
selling up and cruising round the world, living every day as if it were their last, only to
run out of money and find themselves impoverished with another 30–40 years of life to
go. Others may take more radical action. However, not everyone will react in this way and
radical responses can be prevented with careful counseling.
One other family issue for more immediate consideration may be in the choice of a mate
or the decision to start a family. It is one thing to be aware of your own risk of a complex
disease; it is another thing to know you are likely to pass that risk on to your offspring.
If you are made aware that you carry an increased risk of a particular disease, you may
decide not have a family. Some people may decide not to marry into families with a high
risk of certain common diseases and others may decide to adopt rather than have chil-
dren themselves. With a growing world population, some might say this is a good thing.
However, when we say this we are ignoring the fact that the reason for the decision may
be completely wrong. The issue is risk and size of risk. In complex disease there is incom-
plete penetrance of the allele in the disease and having the risk allele is neither necessary
nor sufficient for the disease to develop. The size of the risk varies between diseases, and
between risk alleles and groups of risk alleles. For example, the Ueda et al. (2003) paper
on CTLA4 gene polymorphism suggested the maximum OR (risk) in type 1 diabetes was
1.2 and in Graves disease was closer to 1.6. Neither of these is particularly high. Would
you consider not marrying your bride-to-be on the basis of having the CTLA4 risk allele?
How high does the risk need to be? As our knowledge grows, we will be better equipped
to offer counseling and help individuals make informed decisions. These decisions should
not be made without careful counseling. It is also important to be aware that the impact
of this information may be different in different societies and ethnic groups.
There are of course broader issues when it comes to genetics and families. Though we are
concerned with complex disease, we cannot ignore the other uses of genetic testing. In
the third millennium tracking our kindred through multiple generations has become a
worldwide business, not just because of links with disease, but also because of links with
our history. However, understanding our ancestry can be used in both positive and nega-
tive ways. Paternity testing can be used to test for fidelity. There are strong financial and
legal reasons why testing is often requested by the mother or father of a child. For example,
fathers who prove to be surrogates as a consequence of such testing may withdraw their
support for the mother’s child or children. Ultimately it is the child who suffers.
On a more positive note, historical searches can reveal interesting details about our ances-
tors. In the USA, more than 50 companies are engaged in genetic genealogy studies.
Companies such as African Ancestry look specifically for ethnic and geographic matches
between mostly American clients and those of racial and tribal groups from various parts
of Africa. However, not all reports are confirmed and in some cases initial findings have
turned out to be incorrect. At least one high-profile case came to light recently where the
wrong link was proposed. Cases such as this illustrate the need for counseling here too.
Overall, we cannot get away from the need to inform the patient, client, or individual
fully about the consequences of testing as well as the strengths, weaknesses, and limita-
tions of any testing.
One of the most difficult issues in genetic testing is that of prenatal testing. Different coun-
tries have different laws regarding prenatal testing and the consequences of testing. In addi-
tion, opinions vary between different social, ethnic, and religious groups within societies.
In complex disease, genetic testing is not currently used to predict the health of the unborn
and indeed this would be an error. Due to low penetrance, the risk of disease is most often
very low and there is no justification for using such testing. However, we need to make
sure that this type of idea does not sneak in through the back door. We have the example
of prenatal testing for male status as an example of bad practice highlighted in Steve Jones’s
book The Language of the Genes. Termination of pregnancy based on testing for the sex
chromosomes is now less likely than it was 10 years ago, but undoubtedly it still occurs in
some countries and there is much written about this on the web and in papers cited there.
Social status
The societal impact of genetics in complex disease cannot be underestimated. Knowledge
of our genomes and how they may determine our lifestyle, health, even our wealth and
happiness can have a major affect both on our own view of our social standing and on
how others view us. One recent report questioned the value of knowing our genome, with
a gloomy prediction that the cost to our collective mental health is incalculable. Others
disagree saying most people would not be burdened. Opinions are divided.
Just as the Victorian and Edwardian Eugenics Movement was set against the survival of
the unfit, so too can modern society suddenly turn upon those who are seen to be less able.
Anyone with a disability will tell us that for them life is made more difficult through the
actions and attitudes of some members of society. In the context of this book we are not
considering the illness per se. We are considering just the potential impact of carrying the
knowledge of an increased risk of a common trait.
Some people when given bad news will develop a coping strategy based on ignoring the
information (or denial) or on accepting the information and using it to their benefit.
Some court sympathy. Others are activated and become associated with those in similar
circumstances. Other people when given bad news collapse. They go into a mental
meltdown of variable proportions and find it hard to cope with the information. Some
become hypochondriacs. In society as a whole, it is easier to cope with those who have a
positive response to a situation. No one wants to spend time with negative individuals.
However, it is not easy to maintain a positive response.
Of course with high risk genes, e.g. BRCA1 and BRCA2 that are linked to breast cancer
and also to ovarian cancer, many of the patients tested positive opt for preventative surgery
rather than risk the illness itself. At its most extreme this can mean having both breasts and
ovaries removed. The personal consequences of this are considerable, with prolonged sur-
gical and medical treatment, post-surgical medication, and potentially early menopause.
The personal and physical impact of all of this together can be quite considerable and the
individual’s societal status can change. There is a question about whether preventative
surgery is appropriate. This question is one that only the individual is equipped to answer.
It can be very positive or it can be very negative. On one hand, an individual may take the
view that they can deal with the problem in a positive way through surgery; on the other
hand, a different individual may take the view that surgery is the only option (this may be
seen by some as a negative response).
How we handle our genetic portfolio is going to be one of the most interesting challenges
of the post-genome era. We need to understand the implications of risk and accept our
situation. We cannot change our genome. Everyone will have some risk alleles that are
protective from and others that are predictive of common traits. The genome informa-
tion itself is not the same as having the trait. Society can relax and we can relax within
it. There is no great societal threat and no threat to our social status unless we allow it
to happen. We need to avoid creating a genetic underclass as in Huxley’s Brave New
World, whereby the human race is divided into subgroups, with the upper class being the
alphas and the lower classes being the betas, etc., all the way down to the epsilons at the
bottom of the (genetic) caste system. To those who scoff at this idea, look at the past. A
mathematician will tell us that by definition, “if anything is possible” then “anything”
includes this possibility.
Access to healthcare and health insurance
One of the downstream consequences of knowing our genomes is that we could face a
problem with healthcare planning and cost. In the UK, the majority of individuals cur-
rently rely on the National Health Service (NHS), which provides healthcare free at the
point of entry for all members of the population. The NHS is paid for through Tax and
National Insurance contributions. Though there is some debate about this provision and
who has rights to automatic NHS treatment, at present no one is suggesting that access
be based on genetic testing. In addition, no one is suggesting that genetic tests be used to
determine how much individuals each pay in Tax or National Insurance. However, that
does not mean this will always be the case, but such change is unlikely. Other countries
have different systems of healthcare, many based on a personal subscription and opt-in
insurance schemes. It is easy to see why a privately funded system is more likely to be
interested in an individual’s genetic portfolio.
Of more concern is the potential application of genetic testing for health and life insur-
ance. Currently in the UK, genetic information does not have to be given to insurance
providers and they are prohibited from asking customers to provide such information. The
situation in the USA is different as the health system is not based on the same plan as the
NHS in the UK. Instead, the vast majority of health provision is through private health
companies. Practice also varies elsewhere and as most insurance companies are global or at
least international, access to genetic data needs to be carefully monitored. In the future it
may be seen as reasonable (by some) for genetic data to be used to set insurance premiums.
Insurance companies understand risk and work with risk models. They are well equipped
to assess the likelihood of adverse circumstances on a personal basis given the appropriate
information. Information on drinking, smoking, and other health issues is routinely gath-
ered when setting up policies. Extending the list to include genes would be a simple step.
Overall, this is an ethical issue. Currently there are constraints in the UK and in some
other countries, and genotype information does not need to be provided on request. In
the USA, for instance, legislation is in place to outlaw the discriminatory use of predictive
genetic information in health insurance and the workplace. However, that does not mean
that the elevated risk of disability or illness cannot be used against an individual when
setting insurance costs. Privacy is a major issue and defending our individual freedom is
important, i.e. we need to defend our rights. These include the right to equality and free-
dom in the matter of our genome. Allowing this very minor misuse of genetics could be
seen as the first tiny step on the descent into madness.
Other issues
Imagine there was a gene for addiction, fidelity, sexual orientation, intelligence, or
spirituality. In each of these cases there may well be genes that predispose to these
traits. Each gene may carry a bad allele and a good allele. The question is do we want
to know the answer about which alleles we carry. How would we use the data? The
answer lies in each example. These are difficult issues. With regard to sexual orienta-
tion, there are those who wish to be able to say their sexual orientation is all (or partly)
in their genes, and others who want to be able to say it is a matter of choice and genes
have nothing to do with it. Underlying this are often different political agendas and
there are implications whichever explanation one applies. Spirituality is another diffi-
cult issue; again, there are some for and against a genetic explanation for this. Personal
agendas apply. None of the above are illnesses but all of them can influence behavior
and for that reason alone all are worthy of consideration in this chapter. These are both
individual and societal issues.
However, more recently the parent company deCODE (the major research base for the
company) has been purchased by Amgen. The service offered by deCODEme may not be
offered in future. The problem for commercial companies like deCODE and deCODEme
is that they need to generate payback from their investments. From a pure academic point
of view this can be a negative relationship, though individuals have different opinions
(Figure 11.7).
Many academics would rather see genomes mapped by university research laboratories
than in commercial companies. Unfortunately, there is no longer any option to exclude
commerce. Commerce has provided the technology for the great advances we have made.
Commerce has been involved frequently at the center of all we have done. We do not
exist in ivory towers in isolation any more. Our materials, apart from the DNA samples
we have collected, are all purchased. The equipment we use has been invented and devel-
oped by commerce, and today it is cheaper to contract out genotyping for all but a few
genes than it is to genotype in-house. As a consequence, we are inevitably tied to com-
merce as commerce is tied to us. Outsourcing or contracting out is going to be a major
part of the future.
Once more there are ethical issues, but there is no problem provided the anonymity of
the material sent out is maintained. This needs to be very strictly governed and assessed to
ensure compliance with ethical standards.
Do I own my genome?
Accepting that commerce is part of the current package, we need to know if our genomes
are our own possession or belong to another. If we allow others to genotype or sequence
our genomes at no cost it is possible that the sequence/genotype produced will belong
to the group who did the work. This could be a commercial company or it could be a
research group. However, just as with some tumor cell lines that have been grown in
laboratories for many years, if the research group or commercial company failed to get
permission or explain the situation clearly and obtain written consent, then this could be
contestable. Different countries have different regulations on this and it could be costly to
contest such an issue.
It is certainly true that a number of the genetics companies would like to develop geno-
typing techniques for the detection of common risk genes with large effects and thereby
be able to offer a commercial genetic testing service. Some companies concentrate on
just a few genes, e.g. the company Myriad Genetics (Salt Lake City, USA) have a pat-
ent to test the BRCA genes, others such as deCODEme (Iceland) and 23andME (USA)
will assess 500,000 to 1 million single nucleotide polymorphisms (SNPs) and provide us
with a personal profile, but they have not yet obtained a patent for the genes. Myriad
Genetics not only tests the genes, but they recently (February 2013) won a court case
enabling them to patent a number of cancer genes in Australia. Appeals against this rul-
ing in have been dismissed. Protesters in Australia say that this should not be allowed
and it could have a major affect on future research in cancer. The depth of debate about
patents is reflected in Luigi Palombi’s book Gene Cartels: Biotech Patents in the Age of Free
Trade (2009). The author states that “no matter how important it is to identify a gene
linked to a disease, it’s still not something that Myriad or anyone else has invented;” he
goes on, “Politicians must not change the law to prevent patenting of genetic materials”
(http://www.bloomberg.com/news/2013-02-15/).
The American Association for Molecular Pathology and the American College of Medical
Genetics have been quoted as saying they are worried that the company is trying to get
legal ownership of part of the human body. However, the situation in the USA is quite dif-
ferent. A Supreme Court ruling in June 2013, authored by Justice Clarence Thomas, states
that naturally occurring DNA segments are not patentable. This will apply to a vast range
of testable genes. However, the ruling permits edited forms of genes not found in nature to
be patented. In the USA there are two major implications from this ruling. Firstly, genetic
testing will become cheaper as companies will compete in the testing market. Second, the
cost of whole genome sequencing is likely to fall. This is because the restriction that is
imposed by patenting sequences, in the genome, is no longer an issue and therefore whole
genome sequence is not likely to infringe patents.
One of Myriad’s fact sheets states that under USA patent law:
“No-one can patent anyone’s genes. Genes consist of DNA that is naturally occur-
ring in a person’s body and as products of nature [they] are not patentable. In order
to unravel the mysteries of what genes do, researchers have had to separate them
from the rest of the DNA by producing man-made copies of only that portion of
the gene that provides instructions for making proteins (only about 2% of the total
DNA in your body). These man-made copies, called “isolated DNA,” are unique
chemical compositions not found in nature or the human body. The U.S. Patent
and Trademark Office has been granting patents on isolated DNA to universities,
hospitals, patient advocacy groups and companies for over 30 years.”
Currently this is correct. However, there have been efforts to appeal some of these laws in
the USA but so far they have failed.
Some scientists argue that their work has been affected because of the costs of pay-
ing royalties and in some cases they have even received letters demanding they stop
using patented inventions. In response, the biotech companies may reply arguing that
if they cannot protect their inventions, then they cannot compete with others in the
market place.
Conclusions
Looking at the genetics of complex diseases from a social point of view allows us to con-
sider the ethical issues associated with it. Ethics is best defined as the fulfillment of well-
being. Some would suggest this is not part of science and argue for the pursuit of truth
at any price. Others would argue that ethics is part of science, and that we should apply
more rigid rules and laws. When it comes to the genetics of complex disease, ethics refers
to the issues around testing, counseling, data handling and storage, publication, and access
to data.
Genetics has been misused in the past and for that reason there is a higher level of sus-
picion regarding genetics than for many other branches of science. Through the dark
pseudo-science of eugenics, genetics became attached to some of the darkest chapters in
human history. Eugenics illustrates the misuse of science extremely well. It also reminds us
not to allow our science to be misused in the future.
Given the promises of the HGP, it is important to understand how ethical and soci-
etal issues are themselves composite parts of the puzzle. Using genetics in diagnosis is a
key example. Should we, or should we not, test. Using genetics in disease treatment and
patient care also raises the same issues. Testing can be very advantageous for a patient for
whom a new treatment is seen as having a high likelihood of a good response, but it can
be to the disadvantage to a patient who, after testing, finds out that they are genetically
less likely to respond to a newly developed potent drug. No one would argue that using
genetic data to understand disease pathogenesis is not worthwhile. However, if as part of
the testing risk alleles are identified in healthy individuals we are once again faced with an
ethical and societal dilemma. We need to be aware of personal and societal issues relating
to confidence, confidentiality, social status, access to healthcare, and health insurance in
relation to all of these promises.
What is the relationship between the individual, commerce, and public bodies? Who owns
the genome, who owns an individual’s DNA sequence, and who should be able to access
an individual’s personal genetic data? Science and commerce are strongly interlinked in
genetics—more so now than ever before. Separating them is not possible, but protecting
one’s data is. Laws and regulations vary in different countries; however, most countries
currently do have some form of regulation to restrict access, especially the USA and UK.
All in all, the ethical issues are relatively straightforward in complex disease research at
present. There are rules and guidelines in the UK; rules on consent and the storage of
human tissue (e.g. the Human Tissue Act) are designed to prevent the misuse of materials
and enable the best practice to be followed. In the USA, it took 12 years to get a bill passed
that prohibited DNA being used to discriminate against individuals. The bill, a genetic
information non-discrimination act, was signed in the Oval Office on 4 May 2008.
Ethics is a philosophical activity, and therefore views and ideas change, and the debate
about the ethics of enquiry will continue. New challenges will arise with new procedures
and new opportunities. We need to maintain a balanced opinion. We have seen from the
past what happens when that balance is lost and the extremists have free reign. We cannot
allow that to happen again. Nor should we stifle scientific advancement. Simple measures,
such as informed consent, not only for invasive research but also for data collection, are
important. In many ways the ethics of scientific investigation is far more advanced than
other areas of enquiry and altogether this is a very positive sign for the future.
Further Reading
Books Tollefsen CO (2008) Biomedical Research
Collins F (2010) The Language of Life: DNA and Beyond: Expanding the Ethics of Inquiry.
and the Revolution in Personalised Medicine. Routledge. This book provides insight into the
Harper Collins. This is excellent book to delve ethical basis of scientific enquiry from a philo-
into to look at the whys and wherefores of gene sophical perspective.
hunting in complex disease.
Frank L (2011) My Beautiful Genome: Articles
Exposing Our Genetic Future, One Quirk at a Ueda H, Howson JM, Esposito L et al. (2003)
Time. Oneworld. This is an excellent book for Association of the T-cell regulatory gene CTLA4
the lay public and scientist alike. It is interest- with susceptibility to autoimmune disease.
ing, controversial, and humorous, and perfect Nature 423:506–511. This is a very important
for the student of complex disease genetics. study on CTLA4 highlighting the complexity
Jones S (2000) The Language of the Genes, 3rd of association studies even when targeted to a
ed. Harper Collins. This book has a very good specific gene.
introduction for those with an interest in the
history of genetics.
Online sources
Maynard-Moody S (1995) The Dilemma of
the Fetus: Fetal Research, Medical Progress and http://23andme.com
Moral Politics. St Martins. This site offers genotype information about
health and ancestry for a relatively low cost with
Palombi L (2009) Gene Cartels: Biotech Patents a rapid response. The site enables the reader to see
in the Age of Free Trade. Edward Elgar. This what is offered commercially by such companies.
book by patent law academic Luigi Palombi is
a timely contribution to heated global debates http://www.decode.com
about the ownership of human genes. The correct current site for deCODEme is that
Parham P (2009) The Immune System, 3rd ed. shown above. deCODE is part of the Amgen
Garland Science. Though the general subject of company.
this book is immunology, it is a good source http://www.myriad.com
for those interested in immunogenetics and the Myriad Genetics is a company from Salt Lake
impact of MHC alleles/haplotypes in complex City, USA with interesting patents on a num-
disease. ber of cancer genes. They offer testing for a
Spector T (2012) Identically Different: Why variety of common cancers and are working on
You Can Change Your Genes. Weidenfeld & genetic diagnosis for a number of other condi-
Nicholson. This book champions the subject tions. Their patent rights are a matter of some
of epigenetics in complex traits and provides controversy at the time of writing, and serve as
a different perspective for the scientist. The an example of the division between academia
study of identical twins represents a gold stan- and commerce.
dard for geneticists, but it also presents a chal- http://www.gtglabs.com
lenge because they are rare and because they are The Genetic Technologies website offers a vari-
genetically identical at birth. Therefore, testing ety of services for genotyping common diseases
inherited polymorphisms is of little value, but in humans and animals. Another informative
identifying concordance levels between identi- site that illustrates the potential for commercial
cal versus non-identical twins is valuable. Novel exploitation of genetic research, whatever our
sequences (new mutations) are informative and own individual views are.
epigenetic changes are also informative. In the
era of sequencing, more use may be made of http://www.hta.gov.uk
identical twin sets. Spector also tackles a series This site contains details about the Human
of occasional controversial topics in this book, Tissue Act (or links to) and is updated regularly.
some of which are covered briefly in this chapter. http://www.patient.co.uk/doctor/Medical-
Strachan T, Goodship J & Chinnery P (2014) Ethics.htm
Genetics and Genomics in Medicine. Garland This site and the one above are two sites for the
Science. UK Human Genetics Commission: one deals
In this chapter, we will discuss the present and future of complex disease genetics, look-
ing at what needs to be done to bring us closer to meeting the promises of the Human
Genome Project (HGP). Among the many promises of the HGP were the future use of
genetics in disease diagnosis, disease treatment, patient management, the development of
novel therapies (including new personalized therapies), and a better understanding of the
pathogenesis of the genetics of common complex non-Mendelian disease. Throughout
this book we have consistently focused on these points and the extent to which these out-
comes have or have not been achieved.
Many of the techniques and technologies discussed in this chapter are also discussed in
previous chapters of this book, for example genome-wide association studies (GWAS).
The purpose of this chapter is to look into the future, and consider how techniques and
technologies such as next-generation sequencing (NGS), genotyping the transcriptome,
the expanding field of epigenetics, the use of imputation analysis, the future for GWAS,
and the value of the HapMap, 1000 Genomes Project, and ENCODE may be applied to
further advance our understanding of complex disease. In addition, we will also consider
metagenomics as applied to the bacteria in our guts.
(a) (b)
P P
P P
P P
3’ 3’
OH H No OH
Figure 12.1: Decoding the DNA sequence during DNA replication using ddNTPs. The figure shows the
structure of (a) the normal substrate dNTP, which carries a hydroxyl group, and (b) the abnormal substrate ddNTP,
which does not carry the hydroxyl group. The interaction with the dNTP allows the DNA strand to grow, whereas
the interaction with the ddNTP substrate terminates the growth.
(ddNTP) during the synthesis of a DNA strand. The structures of dNTP and ddNTP
are illustrated in Figure 12.1. Dideoxynucleotide triphosphate molecules are identical
to deoxyribonucleoside triphosphates (dNTPs), except that they lack a hydroxyl group
(-OH) at the 3′ end. Sanger reasoned that if ddNTP is added to a growing DNA strand,
the strand can no longer grow because the dideoxyribonucleotide is missing the 3′ hydroxyl
group. The absence of a hydroxyl group prevents the formation of a phosphodiester bond
between the DNA and the new nucleotide. Consequently, after ddNTP has been incorpo-
rated into the DNA strand, no more nucleotides can be added and thus the incorporation
of ddNTPs terminates DNA synthesis.
In 1986, automated DNA sequencing methods based on the Sanger method were devel-
oped. These machines used fluorescent dyes to label each ddNTP, allowing the sequencing
reactions to be quickly monitored. The mechanics of Sanger sequencing are simple and
they are illustrated in Figure 12.2. First, the target DNA molecule is denatured into two
single strands by heating. A primer or a short fragment of DNA is then used to trigger
the sequencing reaction. The primer anneals adjacent to the sequence of interest and is
extended by a DNA polymerase. During the extension reaction the sequence chain is ter-
minated by the random incorporation of fluorescently labeled dideoxynucleotides, which
have complementary identity to the base on the opposite strand. Each ddNTP is labeled
with a different color fluorescent dye corresponding to one of the four bases (A, T, C, and
G). Once the extension reaction has been terminated, the resulting mixture containing
fluorescently labeled DNA strands of varying length is separated by capillary electropho-
resis. The final results are read as fluorescent peaks and are used to determine the DNA
sequence.
Sanger sequencing was used to map the first human genome; it has a base accuracy of
approximately 99.9% and can sequence DNA fragments up to 1 kb in length. Sanger
sequencing is used clinically to identify mutations in selected Mendelian disease genes.
However, the level of sensitivity of testing using the Sanger technique (generally estimated
at 10–20%) may be insufficient for the detection of somatic gene mutations in solid
tumors and in acute leukemia, and for the characterization of complex microbiological
specimens. In addition, it is not able to achieve efficient high throughput for analyzing
complex diploid genomes at low cost. Since the late 1990s, researchers in both academia
and industry have revisited DNA sequencing methods, creating a new generation of DNA
sequencing methodologies—the so-called NGS technologies.
(a)
C
CG
CGA
CGAT
CGATT
CGATTG
CGATTGG
CGATTGGC
CGATTGGCT
CGATTGGCTA
CGATTGGCTAG
CGATTGGCTAGT
(b)
Figure 12.2: Sanger sequencing. The figure illustrates the principles of Sanger sequencing. (a) DNA is denatured
into two strands by heating and is treated with a primer to trigger sequencing. The primer anneals with the
DNA sequence of interest and is extended by a polymerase enzyme. The sequence chain may be terminated at
any time during this extension phase by the addition of random fluorescently labeled ddNTPs. Each ddNTP is
labeled with a different color corresponding to one of the four nucleotide bases. (b) Once the extension reaction
has been terminated, the DNA strands of varying length are separated by capillary electrophoresis and the
fluorescent peaks can be used to interpret the sequence. (From Thomas A [2015] Introducing Genetics, 2nd edn.
Garland Science.)
These new sequencing machines all have the capacity to deliver significantly cheaper
and faster sequencing, and are having a significant impact in bioscience. Most NGS
technologies do sequencing in parallel, which means that hundreds of thousands or
even millions of DNA fragments are simultaneously sequenced, resulting in very high
throughput levels. It is likely that in the future NGS will become more widespread and
better established. A comprehensive technical overview of each platform is beyond the
scope of this book, but we will look briefly at the fundamentals of a number of them.
However, as this field is rapidly advancing, readers are advised to refer to the different
manufacturer’s websites and other sources referenced in this chapter for the most up-
to-date information on these technologies.
In each of the cases described below, the output of genome sequencing is not the whole
DNA sequence as a single item. Instead a series of sequences of short DNA fragments
called reads is created. These need to be assembled using bioinformatics software in order
to determine the complete sequence of the whole DNA sample. The bioinformatics step
is challenging and it needs to be performed with powerful computers able to handle the
very large amounts of data generated by the sequencing platforms. A comparison of some
of the different platforms can be found in Table 12.1.
Life Technologies
Roche/454 SOLiD Illumina HiSeq 2000
Library Emulsion PCR on bead Emulsion PCR on bead Enzymatic
amplification surface surface amplification on glass
method surface
Sequencing Polymerase-mediated Ligase-mediated Polymerase-mediated
method incorporation of addition of two-base- incorporation of end-
unlabeled nucleotides encoded fluorescent blocked fluorescent
oligonucleotides nucleotides
Detection Light emitted from Fluorescent emission Fluorescent emission
method secondary reactions from ligated dye-labeled from incorporated dye-
initiated by release of oligonucleotides labeled nucleotides
pyrophosphate
Post- Not applicable Chemical cleavage Chemical cleavage of
incorporation (unlabeled nucleotides removes fluorescent fluorescent dye and 3′
method are added in a base- dye and 3′ end of blocking group
specific fashion, oligonucleotide
followed by detection)
Error model Substitution errors rare, End of read End of read
insertion/deletion errors substitution errors substitution errors
at homopolymers
Read length 400 bp/variable length 75 bp/50 + 25 bp 150 bp/100 + 100 bp
(fragment/ mate pairs
paired end)
From Mardis ER (2011) Nature 470:198–203. With permission from Macmillan Publishers Ltd.
(a)
(b) Break
Clonal microreactors,
amplification and enrich for
occurs inside DNA-positive
microreactors beads
Load enzyme
beads and
centrifuge
Figure 12.3: An overview of the Roche/454 sequencing technique: 454 GS FLX sequencer workflow.
(a) Isolated genomic DNA is fragmented, ligated to adapters, and denatured into single strands. (b) Each fragment
is bound to beads in a proportion of 1:1 and each fragment–bead complex is isolated in droplets of a water/oil
mixture. PCR amplification is then performed within each droplet, resulting in beads carrying millions of copies
of a unique DNA template. The emulsion is broken down and the DNA strands are denatured; beads carrying
single-stranded DNA templates are enriched and are randomly deposited into wells of a fiber-optic slide. Clonally
amplified beads generated by emulsion PCR serve as sequencing features where the pyrosequencing is carried
out. (Adapted from Margulies M, Egholm M, Altman WE et al. [2005] Nature 437:376–380. With permission from
Macmillan Publishers Ltd.)
The 454 technology (http://www.454.com) was launched in 2005 by 454 Life Sciences
when Margulies et al. published the entire genome of the bacterium Mycoplasma geni-
talia with 96% coverage and 99.96% accuracy in a single GS 20 run. In 2007, Roche
Applied Science acquired 454 Life Sciences and introduced the second version of the
454 sequencer, the GS FLX sequencer, which has the ability to sequence a longer
sequence per run with longer read length. The 454 GS FLX sequencer is illustrated in
Figures 12.3 and 12.4.
An alternative platform is the Applied Biosystems SOLiD (Supported Oligonucleotide
Ligation and Detection) platform (http://www.appliedbiosystems.com). This uses
short-read sequencing technology based on ligation. This approach was applied in 2005
to re-sequence an evolved strain of Escherichia coli and is similar to the 454 approach
(above). One of the advantages of the SOLiD technique is that it uses an offset sequenc-
ing primer strategy. Therefore, each nucleotide in the sequence is interrogated twice.
A given nucleotide in the template sequence will generate two different fluorescent
Capture
bead
T
A
TG
ACGCTAGCC Single-stranded
A
TA DNA template
C
TG Polymerase
G dGTP
G
PPi
Apyrase
Sulfurylase + APS
ATP
Luciferase
Light
A T GC A TG C A T G Nucleotide added
Figure 12.4: The principles of pyrosequencing technology. The figure illustrates the basic principles
of pyrosequencing technology. DNA templates are linked to a capture bead and then exposed to multiple
sequencing rounds. Only one nucleotide is added in each round. As each nucleotide (dGTP in this example) is
incorporated into the DNA sequence, inorganic pyrophosphate (PPi) is released which reacts with adenosine
5′-phosphosulfate (APS) and sulfurylase to generate ATP. ATP is then used as a substrate for luciferase to
generate light, which can be detected and quantified. (From Armougom F & Raoult D [2009] J Comp Sci Syst
Biol 2:74–92.)
signals based on the identity of the neighboring base. As a result, the false-positive rate
for mutation detection is reduced, as a SNP will generate two color changes when com-
pared with the reference sequence. At the end of a 6-day run, the SOLiD instrument
is capable of generating 4 Gb of sequencing data. The SOLiD method is illustrated in
Figure 12.5.
The Illumina Genome Analyzer produced by the Illumina Corporation (http://www.
illumina.com) is a short-read sequencing platform. This sequencing-by-synthesis process
on an Illumina GA IIx appears to be the most widely used platform (at the time of writ-
ing). The method is said to generate read lengths of 35 bp with greater than 99% raw base
accuracy and an overall throughput of approximately 5 Gb over a 3-day run (Figure 12.6).
Primers P1<<P2
P1
Mate-paired library OR + + Polymerase + coupled
beads
(b)
PCR
3’
Emulsion
modification
(c)
Deposition
Bead
(d)
1. Prime and Ligate POH 5. Repeat steps 1-4 to extend sequence
+ Ligase Ligation cycle 1 2 3 4 5 6 7 ... (n cycles)
AT TT GT GT TT CA CG
TA AA CA CA AA GT CG
Universal seq primer (n) AT 3’
1μm
bead 3’
P1 adapter TA Template sequence
6. Primer reset
2. Image Excite Fluorescence
3’
–1
4. Cleave off fluor
Cleavage agent Universal seq primer (n-1) AT AT AT TC AA TA CC
3’
1μm
HO bead 3’
T GT GC AG TT AT GG
AT
P
3’
TA
8. Repeat reset with, n-2, n-3, n-4 primer
Read position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Primer round
Bridge probe
Bridge probe
Bridge probe
The third generation in the history of sequencing is now emerging and is expected to bring
new insights on DNA sequences. Third-generation sequencing has two main advantages
compared with second-generation technology (above). The first main advantage is that
the PCR step is no longer needed before sequencing. Though PCR amplification increases
the number of copies of DNA fragment, it can introduce errors into the sequence. The
ideal DNA sequencing platform, which is what the third-generation platform aims to be,
would combine the advantages of high-throughput and rapid-sequence technology with
the capability to sequence long stretches of DNA directly from a single DNA molecule,
without any pre-sequencing PCR step. This would prevent errors introduced by PCR
amplification being introduced into the sequence. The second advantage is that the signal
is captured in real-time.
Though this technology is still developing, several prototypes exist such as nanopore DNA
sequencing, single-molecule real-time (RNA polymerase) sequencing, and the HeliScope
sequencer, which has pioneered an innovative methodology able to perform high-through-
put DNA sequencing, starting from a single-molecule. The HeliScope sequencer relies
on the principle of true single-molecule sequencing that does not require a preclonal
amplification step. The absence of an amplification step during sample preparation cir-
cumvents the problems of sequencing errors attributable to PCR artifacts, as stated above.
This system is capable of sequencing up to 28 Gb in a single sequencing run in about 8
days, but it can also generate short reads with a maximal length of 55 bases. The Helicos
platform was first used in 2008 to sequence the 6407-base genome of the bacteriophage
M13 and a year later to sequence an individual human genome (Figure 12.7). Currently
there is major competition in the sequencing arena and in 2015 not all of the original
companies are still trading. Some have been subsumed into other companies and some
have closed down completely. Consequently the reader needs to be mindful that the avail-
ability of information from websites changes over time. Nevertheless websites when open
offer a good source of information.
There are a number of companies with an interest in nanopore (a tiny hole) sequenc-
ing and nanopore technologies, amongst which Oxford Nanopore Technologies
(https://www.nanoporetech.com) have declared an interest in DNA sequencing and
other activities. The US government recently released a number of grants for the fur-
ther development and use of this technology with five new grants in September 2013
alone. Though there is great interest, there have been a few problems. The speed at
which DNA is tracked through the nanopore makes the process of sequencing espe-
cially tricky. However, it appears that altering the wavelength of light that the system
employs can effectively slow down the flow and this may solve the problem. The poten-
tial for this technology in DNA sequencing is great.
Figure 12.5: The Applied Biosystems SOLiD sequencing-by-ligation method. Fragments of DNA are ligated
with oligonucleotide adaptors at each end and hybridized to complementary oligonucleotides attached to
magnetic beads (a). Beads are then placed in a water/oil emulsion where DNA amplification is performed (b).
Once amplification is finished, the beads are placed on a glass surface and entered into the sequencer (c). In
the sequencer a universal sequencing primer, complementary to the adaptor sequence, is used to trigger the
sequencing reaction where ligation cycles with fluorescently labeled degenerate probes is performed. Once the
probe anneals to the DNA template, a DNA ligase covalently binds the probe to the sequencing primer and the
fluorescence is recorded. The probe is then cleaved and another cycle starts. After seven rounds of sequencing,
the extended universal primer is removed and a new universal primer is added that is offset by one base (d). The
sequence of the read is inferred by interpreting the ligation results for the 16 possible dinucleotide interrogation
probes. (Courtesy of Applied Biosystems.) (See also color plate.)
(a)
A
A
T
+
T
T A
A T
Denature,
cleave
(c)
T
POL C
POL A
G
G
T T T T
A C
G
Fluor A C
C
Incorporation cleavage G
G
C
Block
removal
Template
strand
Figure 12.6: The principles of the Illumina sequencing technique. DNA fragments are first ligated onto
oligonucleotide adaptors to form double strands (a). Adapter-modified, single-stranded DNA is then added to
a flow cell and immobilized by hybridization. Bridge amplification generates clonally amplified clusters (b). The
cluster fragments are denatured, annealed with a sequencing primer, and subjected to sequencing-by-synthesis
employing DNA polymerase and four reversible dye terminators. Once incorporated, the terminator stops
the sequencing reaction, which restarts immediately by cleavage of the incorporated dye terminator. Post-
incorporation fluorescence is recorded (c). (See also color plate.)
F F
F C
F C
C
C
5’
A G T
F C T C
C T
G
C
A F
A
C 1. Synthesize
Next C T C G
base C
T
G
A
G
T
C
A
A
T
T
A
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T T T
T T T
5’
5’ 5’
A G T A G T
C T C C T C
T C A T C A
G A C F G A F C
C T C G C T C G
C G G C A T C G G C A T
T A T A T A T A T A T A
T A T A T A T A T A T A
T A
A
T
T
A
A
T A
A
4. Cleave T A T A T A 2. Wash
T T T A T A T A
T A T A T A T A T A T A
T A T A T A T A T A T A
T A T A T A T A T A T A
T T T T T T
T T T T T T
5’ 5’
5’
A G T
C T C
T C A
G A F C
C T C G
C G G C A T
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T A T A T A
T A T A T A 3. Image
T T T
T T T
5’
Figure 12.7: The principles of the Helicos sequencing technique. DNA fragments are captured by poly-T
oligomers linked to an array. During each sequencing cycle, a DNA polymerase and a single fluorescent labeled
nucleotide is added to the array. The array is imaged and the fluorescent label is then removed, and the cycle
repeated.
108
106
104
102
0
Platforms
ABI 3730xl 454 GS 20 Solexa/illumina ABI SOLiD Roche/454 Illumina Ca Ilx, Illumina HiSeq
capillary pyrosequencer sequence sequencer Titanium, SOLiD 3.0 2000
sequencer analyzer Illumina GA ll
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1000
1000 Genomes, Watson Genomes pilot
Draft human Human Microbiome genome and HapMap3
genome HapMap Project begins ENCODE Project begins projects begin publication publications
ENCODE Project First tumor: normal Human genetics
pilot publications genome publication syndromes publications
Figure 12.8: Changes in instrument capacity over the past decade and the timing of major sequencing
projects. The figure provides a perspective on a decade of advances in technologies applied to complex
disease and a time-line of the major research projects during the same period. The figure illustrates how
technology and science are linked. (From Mardis ER [2011] Nature 470:198–203. With permission from
Macmillan Publishers Ltd.)
Construct
shotgun library Hybridization
Wash
Pulldown
AGGTCGTTACGTACGCTAC
GACCTACATCAGTACATAG
GCATGACAAAGCTAGGTGT
Mapping, alignment,
variant calling DNA sequencing Captured DNA
Figure 12.9: Diagram showing workflow for whole-exome sequencing. The figure shows the sequence of
events in whole-genome sequencing. The DNA is broken up into fragments that are hybridized, washed, and
captured on beads before undergoing sequencing. (From Bamshad MJ, Ng SB, Bigham AW et al. [2011] Nat Rev
Genet 12:745–755. With permission from Macmillan Publishers Ltd.)
to sequencing the whole genome. When applied to the whole genome, restricting analysis
to the protein-coding regions, that is exome sequencing, is more cost-effective compared
with sequencing the entire genome. Whole-exome sequencing has been applied in the
identification of germ-line mutations underlying some Mendelian disorders, some somatic
mutations in a variety of different cancers, and de novo mutations in some neuro-devel-
opmental disorders. In the future the greatest application may be in cancer therapy where
it may be useful to screen genes encoding protein targets before applying chemotherapy.
These so-called enrichment strategies that target specific genomic regions will save time
and money, and because smaller data bundles are formed than in whole genome analysis,
which includes non-coding regions, whole-exome analysis can be performed more rapidly.
Use of exome as opposed to genome sequencing may speed up the transfer and application
of sequencing to clinical practice.
Restricting analysis to the exome does introduce some limitations. This analysis does not
look at mutations and polymorphisms within non-coding regions, and it is not able to
detect most structural variants, such as chromosomal translocations happening within
intronic break points. In contrast, whole-genome sequencing will identify all of the varia-
tion in the genome not just that found in protein coding regions (Figure 12.9).
(b)
(a) (c)
–log10P
–log10P
1 2 3 4 5 6
After imputation
Figure 12.10: Imputation analysis. The two graphs illustrate a genotype sequence before (on the left) and
after (on the right) impute has been applied. The central section of the figure illustrates in detail six different
haplotypes numbered 1 to 6 for a combination of sixteen bi-allelic genes illustrated as double rings. Each gene
is polymorphic, eight of the genes have been successfully genotyped. There are two alleles for all of genes. The
alleles are illustrated by gray shading (A allele) or no shading (white—B allele) in those genes that have been
successfully genotyped. The eight genes illustrated in black have not been genotyped. Imputation analysis can
be applied to enable the genotypes of the black genes on this haplotype to be estimated based on the known
pattern of linkage disequilibrium for the surrounding genes which have been successfully genotyped. To do this,
data are extracted from large genotype databases to fill in the “missing” genotypes. For example if we know that
persons with haplotype 1 which runs top to bottom; white, unknown, unknown, unknown, white, unknown, gray,
unknown, white, gray, gray, unknown, unknown gray, unknown, white (or B-?-?-?-B-?-A-?-B-A-A-?-?-A-?-B)
carries the white (B) alleles at positions 1, 5, 9, and 16 and the gray (A) alleles at position 7, 10, 11, and 14 in the
sequence, then the full sequence for this haplotype can be assigned. Using the known genotypes from positions 1,
5, 7, 9, 10, 11, 14, and 16 we can identify the unknown genotypes for the other eight genes on the remaining five
haplotypes (haplotypes 2 to 6). This imputed data can be included in the analysis. Because linkage disequilibrium
is so strong in certain areas of the human genome, imputation analysis from initial GWAS data can massively
extend the studies. Thus an initial GWAS study of based 300,000 to 400,000 SNPs can generate data for up to one
million or more markers. (From Marchini J & Howie B [2010] Nat Rev Genet 11:499–511. With permission from
Macmillan Publishers Ltd.) (See also color plate.)
substituted with predicted values based on the known levels of linkage disequilibrium.
When all missing values have been imputed, the resulting data set can then be analyzed
using standard techniques. The sort of databases used for this exercise include those pro-
duced by the 1000 Genomes Project and HapMap.
Table 12.2 Some Mendelian and a few complex diseases identified up to 2011 via exome
sequencing.
PubMed Mode of
Disorder ID inheritance Na Strategy Gene(s)
Comparison of unrelated cases
Kabuki 20711175 AD 10 10 cases/10 kindred MLL2
Schinzel–Giedion 20436468 AD 4 4 cases/4 kindred SETBP1
Fowler 20518025 AR 2 2 cases/2 kindred FLVCR2
Sensenbrenner 20817137 AR 2 2 cases/2 kindred WDR35
Comparison of related cases
Miller 19915526 AR 4 4 cases/3 kindred DHODH
Retinitis pigmentosa 21295283 AR 3 3 cases/1 kindred DHDDS
Spinocerebellar ataxia 21106500 AD 4 Linkage + 4 cases/1 TGM6
kindred
Primary failure tooth eruption 21404329 AD 4 Linkage + 4 cases/1 PTH1R
kindred
TARP (talipes equinovarus, 20451169 XLR 2 Linkage + 2 cases/2 RBM10
atrial septal defect, Robin kindred
sequence, and persistent left
superior vena cava)
X-linked 21415082 XLR 2 Linkage + 1 case/1 MCT8
leukoencephalopathy kindred
Homozygosity mapping
Autoimmune 21109225 AR 1 1 case/1 kindred FADD
lymphoproliferative syndrome
Complex I deficiency 21057504 AR 1 1 case/1 kindred ACAD9
Non-syndromic mental 21212097 AR 2 2 obligate carrier TECR
retardation parents
Identification of de novo mutations
Sporadic mental retardation 21076407 Complex 30 10 parent–child Multiple
trios
Autism Complex 60 20 parent–child trios Multiple
The table lists a number of diseases, the traits associated with them, and the genes thought to carry the
responsible mutations.
Number of exome-sequenced individuals.
a
healthy controls cannot be causative. Though this strategy can be exceptionally powerful
for rare Mendelian disorders, it is less useful in complex disease because the assumption
that a causative allele must be absent from the healthy population does not work in com-
plex disease, where many potentially causative alleles have been found to be expressed in
both diseased and healthy individuals. In addition, not all cases are positive for the caus-
ative allele in complex diseases.
Exome sequencing reduces the number of candidate genes tested. However, as with GWAS,
these studies still require large sample pools to provide strong statistical evidence for asso-
ciation and even though costs are currently falling, they are still too high. Therefore, two
different strategies have been proposed to reduce these costs: family-based sequencing and
basic extreme-trait design.
Family-based sequencing
Family-based studies are impractical for most complex diseases. Even when familial
cases occur we cannot assume that these rare cases represent the whole disease popula-
tion. Therefore, making assumptions based on testing families in complex disease is a rare
option. However, if families are tested, the strategy is to sequence the index case (i.e. the
first identified case) and any co-affected family member(s) to identify overlapping vari-
ants. The closer the family cases are to the index case, then the greater the number of can-
didate genes that will need to be tested as closer family members will share more genes. In
contrast, the more distant related cases are compared with the index case, then the lower
the level of genetic similarity is and therefore a smaller number of candidate alleles will
need to be genotyped.
Basic extreme-trait design
Basic extreme-trait design offers more potential. In this situation cases are carefully
selected from one or both ends of a phenotype distribution and then sequenced. This
strategy is based on the studies of Cohen et al., who in 2004 demonstrated the effec-
tiveness of sequencing candidate genes at the extreme ends of a phenotype distribution
to find rare alleles involved in the risk for a complex trait. An obvious example of this
extreme phenotype study is seen in individuals who are highly exposed to HIV-1, but
remain uninfected. As there is survival advantage in those who carry genetic variants that
contribute to the exposed/uninfected trait, these alleles will be enriched in the popula-
tion. This means that studies of small sample size can be performed with a higher degree
of confidence than if the alleles were less common. Identified candidate alleles can then
be genotyped for confirmation in much larger samples. In HIV-1, where genetic variants
contribute to protection against HIV-1 infection and HIV-1/AIDS, the high impact of
the protective variants/alleles means that whole-exome (or whole-genome) sequences of
a few individuals may be powerful enough to identify candidate variants/alleles associ-
ated with the disease trait. Once identified, the genotype of the identified candidate
can be re-investigated in a much larger group to confirm or refute the findings of the
sequencing report.
The use of NGS technology applied to the new wave of GWAS has the potential to dis-
cover the entire spectrum of sequence variations in a sample of well-phenotyped indi-
viduals. However, the use of NGS platforms in GWAS studies has raised some concerns.
The error rate of the NGS platforms is higher than the Sanger sequencing methods and,
because not all errors are random, they may obscure true associations or generate false-
positive associations if they are frequent enough. In addition, NGS technologies can also
produce data pools with high missing rates. In practice, however, high missing rates are
less of a problem because imputation methods may be used to recover (or back-fill) the
missing data. Authors using imputation should make sure that any data generated by this
means are clearly acknowledged in their publications.
Nucleus
Cross-link and
fractionate
chromatin
ChIP:
Enriched DNA
binding sites
Sequence
production. In a typical metagenomic study, the genomic content of all the microbes
in a sample is sequenced using NGS technology. This results in millions of sequences of
DNA fragments. Computational methods are then employed to predict the taxonomic
affiliation of these DNA sequence fragments, which are then compared against a ref-
erence database such as those of the National Center for Biotechnology Information
(http://www.ncbi.nlm.nih.gov) where the sequences of microorganisms are collected.
Sequences are then affiliated to the corresponding microorganism and compared with those
of others to identify the level of biodiversity present in the sample. This approach has been
used to identify novel viral pathogens, detect viral drug resistance mutations, and diagnose
bacterial infections. However, given the relatively high cost of high-throughput sequencing,
these techniques are unlikely to replace traditional microbiological techniques for routine
pathogen identification in the immediate future, but it may not be long before metage-
nomic screening is standard. Metagenomic technology is at present the main approach
employed by the Human Microbiome Project (http://commonfund.nih.gov/hmp/)—an
initiative launched in 2008 that aims to characterize the microbial communities associated
with both health and disease in different parts of the human body, such as nasal passages,
oral cavities, skin, and gastrointestinal and urogenital tracts.
Genetic
variation
Lifestyle Diet
Disease
Epigenetics Bacteria
Figure 12.12: Genetic and non-genetic factors impacting on disease risk. The figure illustrates the idea
that multiple factors influence susceptibility to complex disease and not all are genetic. The factors interact in
complex ways. In order to understand the genetics of complex disease we need to apply a holistic approach that
takes account of these other factors.
Happy mouse
No Lactobacillus in No increase in
yogurt GABA
neurotransmitter
pathogens. The close interaction suggests there may be a level of symbiosis. Once again,
this is not a new concept. For example: millions of people consume food supplemented
with Lactobacillus which is thought to be beneficial in maintaining homeostasis between
the host and the bacterial flora in the gut. Looking at this from a genetic perspective pro-
vides a different focus. Simply altering the diet of mice by giving Lactobacillus-enriched
yogurt can modify the levels of expression of the γ-aminobutyric acid (GABA) neurotrans-
mitter receptor. The mice with the yogurt supplement appear to become happier mice
(Figure 12.13). The Lactobacillus must gain something from this relationship, which
appears to be a true symbiotic relationship. The balance of such relationships will deter-
mine whether a bacterium is beneficial to the host or not and may lead to some traits being
promoted, while others may be restrained. Once again we are talking about susceptibility
and resistance, i.e. risk—the one word that defines complex disease.
pinpoint the genes behind common diseases. Phase 1 typed 270 individuals from four
geographically diverse populations (Table 12.3), identified approximately 1.3 million
SNPs, and was published in 2005 (http://www.hapmap.org). Phase 2 of the project aimed
to increase the resolution of the haplotype map identified in phase I by adding a further
2.1 million SNPs to the test pool.
The third phase of the HapMap project (HapMap3) published findings in 2010 based
on 1.6 million common SNP genotypes in 1184 individuals from 11 different popula-
tions (Table 12.3). In addition, 10 regions of 100 kb were sequenced in 692 genomes.
HapMap3 includes both SNPs and copy number polymorphisms, and pinpoints pop-
ulation-specific differences for low-frequency variants. The idea behind phase 3 was to
produce high-resolution haplotype maps (including both the most common and some of
the less common variations from different populations) aimed at helping us to examine
patterns of linkage disequilibrium within different populations. Haplotypes are an essen-
tial piece in the jigsaw for imputation analysis and the design of association studies.
The map produced by the HapMap project describes common patterns of human genetic
variation and thus it has been extremely valuable in the design of association studies, accel-
erating the search for genetic factors in human disease. The value of the HapMap should
not be underestimated and it is likely to guide research for years to come. With approxi-
mately 38 million SNPs existing in the human population (in a biallelic locus there are in
principle 2n haplotypes, where n is the number of SNPs), there is a tremendous potential
for diversity; however, fewer haplotypes than expected are observed in practice due to the
extensive linkage disequilibrium across the genome. This fact has an important conse-
quence in association studies whereby if a causal variant is not directly tested it could still
be identified indirectly. The haplotype map also confirms the idea that the human genome
contains recombination hot spots. This means that linkage disequilibrium is not continu-
ous (or equal) across the genome, but there are regions where the recombination rate is
Phase
Population Code 1 2 3
Utah residents with ancestry from Northern and CEU
Western Europe
Yoruba in Ibadan, Nigeria YRI
Japanese in Tokyo, Japan JPT
Han Chinese in Beijing, China CHB
African ancestry in South-Western USA ASW
Chinese in metropolitan Denver, CO, USA CHD
Gujarati Indians in Houston, TX, USA GIH
Luhya in Webuye, Kenya LWK
Maasai in Kinyawa, Kenya MKK
Mexican ancestry in Los Angeles, CA, USA MXL
Tuscans in Italy TSI
higher than expected and others where it is lower than expected (cold spots). The human
major histocompatibility complex (MHC) is a noted area for cold spots with higher than
expected levels of conservation of haplotypes.
All of the data produced by the HapMap project are freely available for unrestricted public
use at the HapMap website (http://www.hapmap.org). This site offers bulk downloads of the
data set, as well as interactive data browsing and analysis tools for data mining and visualiza-
tion. In 2015, the HapMap is a key resource for researchers designing any genetic association
studies aimed at identifying genetic variants associated with common disease or investigating
responses to common therapeutic drugs as well as other environmental factors.
EUROPE
CEU IBS GBR FIN TSI
A. 85 A. 14 A. 89 A. 93 A. 98
B. All LCL B. All LCL B. All LCL B. All LCL B. All LCL
C. 45m/40f C. 7m/7f C. 41m/48f C. 35m/58f C. 50m/48f
D. 78t/3d/4s D. 14t D. 3d/86s D. 93s D. 98s
Finland
AMERICAS Great Britain
EAST ASIA
Utah,
MXL JPT
A. 66 USA Beijing, China A. 89
Southwest, Italy
B. All LCL Los Angeles, Spain Tokyo, B. All LCL
C. 31m/35f USA Hu Nan and Fu Jian C. 50m/39f
USA Japan
D. 59t/3d/4s Provinces, China D. 89s
HapMap 3 AFRICA
Figure 12.14: Populations collected within the HapMap and 1000 Genomes Projects. Populations collected
as part of the HapMap Project are shown in blue and for the 1000 Genomes Project in green. The populations
involved include European: IBS (Iberian Populations in Spain), GBR (British from England and Scotland), CEU
(Utah residents with ancestry from Northern and Western Europe), FIN (Finnish in Finland), TSI (Tuscans in
Italy); East Asian: JPT ( Japanese from Tokyo), CHB (Han Chinese in Beijing, China), CHS (Han Chinese South);
African: ASW (African ancestry in Southwestern USA), YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye,
Kenya); and Americans: MXL (Mexican ancestry in Los Angeles, CA, USA), PUR (Puerto Ricans in Puerto Rico),
CLM (Colombians in Medellín, Colombia). (A) Total number of samples sequenced. (B) Source of DNA [blood
or lymphoblastoid cell lines (LCL)]. (C) Gender composition (male/female). (D) Number that are part of mother–
father–child trios (t), parent–child duos (d), or singletons (s); for trios and duos, only parent samples were
sequenced. (From The 1000 Genomes Project Consortium [2012] Nature 491:56–65. With permission from
Macmillan Publishers Ltd.)
has cataloged a large number of human genomes, it is not yet complete. The extended
project will only be complete when the scientists have sequenced 2500 individuals.
project is to bring together genotype and phenotype across the human genome. The proj-
ect is likely to continue for several years. The data generated so far can be viewed at
https://www.encodeproject.org/publications. Some scientists are critical of the ENCODE
project and not all believe that it will achieve all of its aims. However, time will tell.
Chromosome 2q12–q22
Other genes in
IL-1 cluster IL1F
IL1R2
10 kb 10 Mb 450 kb
Figure 12.15: A map of some of the members of the IL-1 family located on chromosome 2q22. Only a
selected few of these genes are discussed in the example of a complex genetic system. The figure illustrates that
there are many more family members to consider, though not all are likely to be functional.
both low-level expression genotypes (2/2 in each case), then the potential activity of IL-1β
will remain high. However, where the genotypes for IL1B and IL1R1 are both high expres-
sion genotypes (1/1 and 1/1). IL-1β activity may be lower even when the genotype is that
associated with high IL1-B production (IL1B 1/1). In this model, high levels of the decoy
receptor will effectively reduce IL-1β binding to the active receptor and higher levels of
the receptor antagonist will interfere with the activity of the functional receptor. Overall,
IL-1β activity will be determined by the complex interaction of these four elements. This
interactive genetic minefield is illustrated in Figure 12.16.
Of course this is an oversimplified model for illustration only (though the genes do exist).
There are further levels of complexity in this system, e.g. IL-1β is produced in the cell in
the form of pre-IL-1β and needs to be converted by the enzyme caspase 1 [also known as
IL-1-converting enzyme (ICE)] before it can be exported from the cell. ICE may also be
polymorphic and so conversion of the pre-IL-1β is likely to be affected by this.
Agonist Functional
IL-1β receptor
IL-1R1
Receptor Decoy
antagonist receptor
IL-1RN IL-1R2
Figure 12.16: Simplified illustration of interaction in the IL-1 system. The figure illustrates the activity of the
products of four IL-1 genes: two genes encoding the agonist IL-1β and IL-1R, the receptor antagonist IL-1RN,
and the decoy receptor IL-1R2. IL-1β interacts with the receptor agonist IL-1R1 but this can be blocked by either
IL-1RN (the receptor antagonist) or by the decoy receptor (IL-1R2). This simple figure indicates how systems may
interact and coupled with the knowledge that the genes that encode these proteins may be polymorphic we can
produce a simple mind model to consider the potential complexity of this system. The gene family as illustrated in
Figure 12.15 is more complex.
Even in this grossly simplified model it is possible to understand how systems biology
is important in helping us to understand the genetics of complex disease. It helps us to
understand why not all polymorphisms have functional significance and why in many
cases the risk associated with specific polymorphisms is small. Multiple interacting poly-
morphisms and extended haplotypes are likely to be important. Systems often also include
a level of redundancy whereby the effects of a dysfunctional gene or genotype can be com-
pensated for by another gene operating in the same, or in some cases another, pathway.
When either of these situations exist it is possible that effects of the susceptibility alleles are
only seen when the system is stressed by other factors, such as diet, illness, and infection.
Thus, most individuals carrying the risk alleles can cope with normal life and only when
stressed does the system collapse, leading to an increased risk of illness.
Conclusions
The past is part of the future, and the present is also part of the past and the future. Science
goes through phases that illustrate this concept well. Understanding the genetics of com-
plex disease began with old-fashioned technologies testing single genes and haplotypes for
case control association studies in the 1970s, 1980s, and early 1990s. Then came RFLP
and PCR analysis, which were faster and more efficient, but equally challenging. Just as we
were settling down, the new millennium was upon us with the HGM, HapMap, and SNP
Map. From HapMap and SNP Map, rapid genome wide case control and linkage studies
evolved (GWAS and GWLS). We are now entering a new phase. This is the beginning of
the third age in complex disease genetics. One in which techniques like whole genome
sequencing will be commonplace. These new technologies will produce data that will
increasingly be applied in clinical practice. However, as always with developments, there
are some pros and some cons associated with new technologies. The choice in the future
will be whether to use whole-genome or whole-exome sequencing. Choices over low- and
high-resolution genotyping will be applied in initial studies, but are unlikely to be on the
menu for large-scale studies. Missing data will be handled by systems like impute, which
will fill in the blanks.
There is clearly more to do in the study of the genetics of complex disease. The past 10
years have revolutionized the way we investigate the genetics of complex disease, but not
the reasons why we do it. We are beginning to understand just how complex the genetics
of common non-Mendelian disease may be. It is clear in 2015 that scientific isolationism
cannot be tolerated in genetics. If we are to understand how our genome contributes to
disease and use this information to the patient’s advantage, we must also understand how
factors other than our genes impact on human disease. Future strategies will also need to
include epigenetics and the environment, especially diet (discussed earlier in this book),
bacteria, and metagenomics.
We need to think of systems in biology and we need to interact more with our fellow bio-
scientists in order to identify the complex relationships that make the human body work.
A systems approach to genetics is required so that we can understand how genetic varia-
tions can predispose to, or protect from, complex traits. It is not going to be simple, but
we have made a start, and with hard work and perseverance great advances in biomedicine
are possible.
Studies will be able to take advantage of ongoing projects such as the 1000 Genomes
Project and ENCODE (and modENCODE), which were designed to inform studies
of human genetics, including those involving disease. Of course we cannot ignore the
cornucopia of information provided by the HGP and the disease studies performed so far.
Better, faster, and more precise technologies will also contribute.
Ultimately, progress will be measured in terms of the extent to which our research fulfils
the three major promises of the Human Genome Mapping Project, i.e. that genetics will
be used in disease diagnosis, that genetics will be used in patient management (especially
in the development of individualized and novel therapies), and that genetics will be used
to inform the debate on disease pathology.
These three promises are linked. Understanding disease pathogenesis goes cap in hand
with better patient management and development of individualized and novel therapies.
The use of complex disease genetics in diagnosis is perhaps questionable in all but a few
cases, but there is certainly evidence that studies are helping us to understand disease
pathology and there are some examples where allelic variation may determine the selection
of specific therapeutic options. Some would call this translational medicine, but it is not
a new idea; scientists in medicine have always sought to relate basic findings to medical
practice. “Bench-to-bedside” is a community goal and not just the dream of a few.
Overall, great advances have been made in understanding complex disease. This is espe-
cially true in some areas, such as pharmacogenetics and in immune inflammatory diseases
such as Crohn’s disease. However, we should not be misled by wishful science, and we must
remember the words of satirist and philosopher Hugo Mencken who stated in 1917 “there
is always an easy solution for every human problem (that is) neat, plausible and wrong”.
The future lies with unpicking and reassembling these complex human problems and in
doing so reaching our final goal, i.e. to advance medical practice through patient diagnosis
and treatment, with a hope that some diseases (at least) may one day be eradicated.
Further Reading
Books the first complete genome sequence from an
Armstrong L (2014) Epigenetics. Garland African DNA sample; see also Rasmussen et al.
Sciences. This is an excellent book on epi- (2010), Wang et al. (2008), and Wheeler et al.
genetics for the graduate and final-year under- (2008).
graduate scientist with an interest or need to Blencowe BJ, Ahmad S & Lee LJ (2009)
study this subject. Current-generation high-throughput sequenc-
Collins F (2010) The Language of Life: DNA ing: deepening insights into mammalian tran-
and the Revolution in Personalised Medicine. scriptomes. Genes Dev 23:1379–1386. This
Harper Collins. This is an excellent book to delve paper reports on sequencing the transcriptome;
into to look at the whys and wherefores of gene see also Wang et al. (2009).
hunting in complex disease. Branton D, Deamer DW, Marziali A et al.
Frank L (2011) My Beautiful Genome: (2008) The potential and challenges of nano-
Exposing Our Genetic Future, One Quirk at a pore sequencing. Nat Biotechnol 26:1146–1153.
Time. Oneworld. This is an excellent book for Braslavsky I, Hebert B, Kartalov E & Quake SR
the lay public and scientist alike. It is interest- (2003) Sequence information can be obtained
ing, controversial, and humorous, and perfect from single DNA molecules. Proc Natl Acad Sci
for the student of complex disease genetics. USA 100:3960–3964.
Maitland-van der Zee A-H & Daly AK (2012) Campbell PJ, Stephens PJ, Pleasance ED et al.
Pharmacogenetics and Individualized Therapy. (2008) Identification of somatically acquired
Wiley. rearrangements in cancer using genome-wide
Spector T (2012) Identically Different: Why massively parallel paired-end sequencing. Nat
You Can Change Your Genes. Weidenfeld & Genet 40:722–729. This is one of three papers
Nicholson. on pair-ended mapping that demonstrate the
Strachan T & Read AP (2011) Human use and application of this technique; see also
Molecular Genetics, 4th edn. Garland Science. Korbel et al. (2007) and Stephens et al. (2009).
Strachan T, Goodship J & Chinnery P (2014) Cohen JC, Kiss RS. Pertsemlidis A et al. (2004)
Genetics and Genomics in Medicine. Garland Multiple rare alleles contribute to low plasma
Science. levels of HDL cholesterol. Science 305:869–
872. This study demonstrates the effectiveness
of sequencing candidate genes at the extreme
Articles ends of the phenotype distribution to find rare
Anderson CA, Soranzo N, Zeggini E & alleles.
Barrett JC (2011) Synthetic associations are Dean M, Carrington M, Winkler C et al.
unlikely to account for many common disease (1996) Genetic restriction of HIV-1 infection
genome-wide association signals. PLoS Biol and progression to AIDS by a deletion allele of
9 (1):e1000580. This paper suggests that the the CKR5 structural gene. Hemophilia Growth
importance of synthetic associations generated and Development Study, Multicenter AIDS
by multiple variants has perhaps been over- Cohort Study, Multicenter Hemophilia Cohort
stated; see also Wray et al. (2011). Study, San Francisco City Cohort, ALIVE
Barski A, Cuddapah S, Cui K et al. (2007) Study. Science 273:1856–1862. This report is
High-resolution profiling of histone methyl- of great importance in understanding the inter-
ations in the human genome. Cell 129:823– play between host genetics and the pathogen.
837. This paper describes the use of the ChIP Here the host is protected from infection and
sequencing method to characterize histone development of the disease by the presence of
and transcription factor binding sites on CD4 the 32-bp deletion CCR5-Δ32; see also Liu
T cells; see also Mikkelsenet et al. (2007) and et al. (1996) and Samson et al. (1996).
Robertson et al. (2007). de Souza N (2012) The ENCODE Project. Nat
Bentley DR, Balasubramanian S, Swerdlow HP Methods 9:1046.
et al. (2008) Accurate whole human genome Dickson SP, Wang K, Krantz I et al. (2010)
sequencing using reversible terminator chem- Rare variants create synthetic genome-wide
istry. Nature 456:53–59. This paper describes associations. PLoS Biol 8 (1):e1000294. This
paper, along with Maher (2008) and Schork use of genotype imputation, which is increas-
et al. (2009), poses the question of whether ingly used in studies of complex disease either
studies so far have identified the true suscep- to fill in the gaps or to extend the study; see also
tibility alleles for complex disease or are they Marchini et al. (2007).
synthetic associations? Liu R, Paxton WA, Choe S et al. (1996)
Eid J, Fehr A, Gray J et al. (2009) Real-time Homozygous defect in HIV-1 co-receptor
DNA sequencing from single polymerase mol- accounts for resistance of some multiply-
ecules. Science 323:133–138. exposed individuals to HIV-1 infection. Cell
Freeman JL, Perry GH, Feuk L et al. (2006) 86:367–377. This report is of great importance
Copy number variation: new insights in in understanding the interplay between host
genome diversity. Genome Res 16:949–961. genetics and the pathogen. Here the host is
This paper illustrates how second-generation protected from infection and development of
sequencing will be used to identify CNV in the disease by the presence of the 32-bp dele-
the genome; see also Redon et al. (2006) and tion CCR5-Δ32; see also Dean et al. (1996) and
Zhang et al. (2009). Samson et al. (1996).
Harris TD, Buzby PR, Babcock H et al. (2008) Lupski JR, Reid JG, Gonzaga-Jauregui C et al.
Single-molecule DNA sequencing of a viral (2010) Whole-genome sequencing in a patient
genome. Science 320:106–109. This paper with Charcot–Marie–Tooth neuropathy. N
describes the use of the Helicos platform (third- Engl J Med 362:1181–1191.
generation sequencing). Maher B (2008). The case of the missing heri-
Hayden E (2008). International genome proj- tability. Nature 456:18–21. This news feature
ect launched. Nature 451:378–379. This paper along with papers by Dickson et al. (2010)
announces the launch of the 1000 Genomes and Schork et al. (2009), poses the question of
Project. whether studies so far have identified the true
susceptibility alleles for complex disease or are
Howie BN, Donnelly P & Marchini J (2009) they synthetic associations?
A flexible and accurate genotype imputation
method for the next generation of genome-wide Maher CA, Kumar-Sinha C, Cao X et al.
association studies. PLoS Genetics 5:e1000529. (2009) Transcriptome sequencing to detect
gene fusions in cancer. Nature 458:97–101.
International Human Genome Sequencing This study demonstrates the potential use of
Consortium (2001) Initial sequencing and transcriptome sequencing in clinical practice.
analysis of the human genome. Nature
409:860–921. This paper, together with that of Marchini J, Howie B, Myers S et al. (2007) A
Venter et al. (2001), describes the preliminary new multipoint method for genome-wide asso-
sequence of the first whole human genome. ciation studies by imputation of genotypes. Nat
Genet 39:906–913. This paper discusses the use
Johnson PL & Slatkin M (2008). Accounting for of genotype imputation, which is increasingly
bias from sequencing error in population genetic used in studies of complex disease either to fill
estimates. Mol Biol Evol 25:199–206. This report in the gaps or to extend the study; see also Li
discusses some of the concerns that have been et al. (2009).
raised about the use of NGS in GWAS.
Margeridon-Thermet S, Shulman NS, Ahmed
Korbel JO, Urban AE, Affourtit JP et al. (2007) A et al. (2009) Ultra-deep pyrosequencing of
Paired-end mapping reveals extensive struc- hepatitis B virus quasispecies from nucleoside
tural variation in the human genome. Science and nucleotide reverse-transcriptase inhibi-
318:420–426. This is one of three papers on pair- tor (NRTI)-treated patients and NRTI-naive
ended mapping that demonstrate the use and patients. J Infect Dis 199:1275–1285. This
application of this technique; see also Campbell paper describes a study of a viral drug resistance
et al. (2008) and Stephens et al. (2009). mutation using a metagenomics approach; see
Le T, Chiarella J, Simen BB et al. (2009) Low- also Nakamura et al. (2008) and Palacios et al.
abundance HIV drug-resistant viral variants in (2008).
treatment-experienced persons correlate with Margulies M, Egholm M, Altman WE et al.
historical antiretroviral use. PLoS One 4:e6079. (2005) Genome sequencing in microfabri-
Li Y, Willer C, Sanna S & Abecasis G (2009) cated high-density picolitre reactors. Nature
Genotype imputation. Ann Rev Genomics Hum 437:376–380. The paper describes the use
Genet 10:387–406. This paper discusses the of a novel system for rapid “shotgun” gene
enabled automated sequencing of DNA to disease. The impact of these maps and studies
become a reality. cannot be overestimated.
Schork NJ, Murray SS, Frazer KA & Topol EJ Tomlinson I, Webb E, Carvajal-Carmona L
(2009) Common versus rare allele hypotheses et al. (2007) A genome-wide association scan
for complex diseases. Curr Opin Genet Dev of tag SNPs identifies a susceptibility variant
19:212–219. This paper, along with Maher for colorectal cancer at 8q24.21. Nat Genet
(2008) and Dickson et al. (2010), poses the 39:984–988. This report indicates that tagged-
question of whether studies so far have identi- SNPs can be located a long way from putative
fied the true susceptibility alleles for complex causative mutations in complex disease, but
disease or are they synthetic associations? establishing a relationship with function can be
Stephens PJ, McBride DJ, Lin ML et al. (2009) difficult; see also Prokunina-Olsson L & Hall
Complex landscapes of somatic rearrange- JL (2009).
ment in human breast cancer genomes. Nature Venter JC, Adams MD, Myers EW et al. (2001)
462:1005–1010. This is one of three papers The sequence of the human genome. Science
on pair-ended mapping that demonstrate 291:1304–1351. This paper, together with
the use and application of this technique; see that of the International Genome Consortium
also Korbel et al. (2007) and Campbell et al. (2001), describes the preliminary sequence of
(2008). the first whole human genome.
Sultan M, Schulz MH, Richard H et al. (2008) Wang J, Wang W, Li R et al. (2008) The dip-
A global view of gene activity and alternative loid genome sequence of an Asian individual.
splicing by deep sequencing of the human tran- Nature 456:60–65. This paper describes the
scriptome. Science 321:956–960. This paper first complete genome sequence from an Asian
reports on the application of transcriptome DNA sample; see also Bentley et al. (2008),
sequencing in human kidney and B cell lines; Rasmussen et al. (2010), and Wheeler et al.
see also Maher et al. (2009). (2008).
The 1000 Genomes Project Consortium Wang Z, Gerstein M & Snyder M. (2009)
(2010) A map of human genome variation RNA-Seq: a revolutionary tool for transcrip-
from population-scale sequencing. Nature tomics. Nat Rev Genet 10:57–63. This paper
467:1061–1073. reports on sequencing the transcriptome; see
The 1000 Genomes Project Consortium (2012) also Blencowe et al.
An integrated map of genetic variation from Wheeler DA, Srinivasan M, Egholm M et al.
1,092 human genomes. Nature 491:56–65. (2008) The complete genome of an individual
This is the most recent report from the 1000 by massively parallel DNA sequencing. Nature
Genomes Project at the time of writing. More 452:872–876. This paper describes the first
up-to-date information will be available on the complete genome sequence from a European
website cited below. DNA sample; see also Bentley et al. (2008),
The ENCODE Project Consortium (2012) An Rasmussen et al. (2010), and Wang et al.
integrated encyclopedia of DNA elements in (2008).
the human genome. Nature 489:57–74. This is
Wray NR, Purcell SM & Visscher PM (2011)
the original report from the ENCODE Project
Synthetic associations created by rare vari-
reporting from phase II, which started in 2007.
ants do not explain most GWAS results. PLoS
The International HapMap Consortium (2005) Biol 9:e1000579. This paper suggests that the
A haplotype map of the human genome. Nature importance of synthetic associations generated
437:1299–1320. by multiple variants has perhaps been over-
The International HapMap Consortium (2007) stated; see also Anderson et al. (2011).
A second generation human haplotype map of Zhang F, Gu W, Hurles ME & Lupski JR (2009)
over 3.1 million SNPs. Nature 449:851–861. Copy number variation in human health, dis-
The International HapMap Consortium (2010) ease and evolution. Annu Rev Genomics Hum
Integrating common and rare genetic variation Genet 10:451–481. This paper illustrates how
in diverse human populations. Nature 467:52– second generation sequencing will be used to
58. The HapMap data can be accessed through identify copy number variation in the genome;
the weblink in the Online sources and provides see also Redon et al. (2006) and Freeman et al.
a very useful database for all studies in complex (2006).
are often referred to as professional antigen-presenting cells. All of these three cell types
express HLA class II molecules and a number of other co-stimulatory molecules on their
surface required for activation of naive T cells.
Antigenic peptides – refers to short antigen protein sequences that are presented to T
cells for the immune response. They can be derived from self-antigens or from non-self-
antigens (e.g. pathogens).
Appreciable frequency – can be any value suggested to be significant for a study. Generally
refers to numbers in studies, size of difference reported, and the level of expected probability
value. Also a term used in the definition of polymorphic genes as opposed to mutations.
Ascertainment bias – distorted recruitment (usually of individuals for a study, either as
controls, disease cases, or both).
Association(s) – tendency of a marker and a trait to occur together more often than
expected by chance. Association is technically a statistical observation and is not a genetic
phenomenon; however, association is used in genetic studies and can also indicate linkage
disequilibrium.
Association analysis (studies) – attempts to find genetic associations between genetic
markers i.e. alleles, SNPs, mutations, or other genetic variation and diseases by comparing
the frequencies of markers in two different groups.
Autophagy – a form of programmed cell death in which the cell components are broken
down as a means of coping with adverse circumstances or during infection.
Autosomal dominant disorder or disease – a disorder or disease (discord is also used)
showing a Mendelian pattern of inheritance whereby only one parental genome needs to
pass on the disease-causing mutation to their offspring for the trait to occur.
Autosomal recessive disorder or disease – a disorder or disease (discord is also used)
showing a Mendelian pattern of inheritance whereby both parental chromosomes need to
pass on the disease-causing mutation to their offspring for the trait to occur.
Autosome– any chromosome other than the sex chromosomes.
Balanced structural abnormalities – occur when there is no loss or gain of function
associated with an abnormality.
Balancing selection – occurs where mutant alleles confer increased fitness on the popula-
tion, but only in heterozygotes and not in homozygotes.
Basal cell carcinoma – common form of skin cancer; rarely fatal.
Bayesian – a form of mathematical calculation based on Bayes’ theorem.
Bonferroni’s correction – a mathematical correction applied by statisticians to correct for
multiple testing in different studies. It is mostly used in case control studies. It is much
criticized, but it has been widely used in the past.
Bottleneck effect – this occurs when a founder population is reduced in size as a result of
a catastrophic event and it leads to reduced genetic diversity in the population.
Candidate gene-based association study – a genetic study based on investigation of one
or more identified “candidates.” This is a hypothesis-constrained study.
Carcinogenesis – literally the creation of cancer. The process by which normal cells
become cancer cells.
been used to describe sequencing without the need for cloning of the DNA fragment
under investigation.
Directional selection – a situation in natural selection where a particular phenotype is
favored and the allele frequency moves in one direction.
Disease cassette – the concept that a number of risk alleles carried on the same haplotype
may each contribute to disease susceptibility.
Disease-causing mutation – a mutation that shows a direct causal link with disease (see
Mutation).
Disease phenotype – the pattern of a disease, including signs, symptoms, progression,
and response to treatment.
Disease profile – idea that a disease may arise when we inherit series of different alleles
with different patients having different personal profiles of risk alleles in complex disease
(see also personal portfolio).
Dispermy – two sperm fertilizing a single egg. A common cause of triploidy.
Dizygotic twins – twins who arise as a result of independent fertilization of two different
eggs by two different sperm at the same time. Dizygotic twins are as genetically identi-
cal as other siblings in the same family; they do not carry an identical genome. Also see
Monozygotic twins.
DNA – abbreviation for deoxyribonucleic acid.
DNA methylation – conversion of cytosine in DNA to 5-methylcytosine. This is an
important process in the regulation of gene expression.
Duplications – increase in the number of copies in a gene or a DNA sequence. Where there
are duplications in specific genes, individuals may carry more than one copy of a gene. This
does not always have any influence on the phenotype such as in circumstances where the dupli-
cated copy is not expressed. However, duplications can have interesting effects on phenotype.
Electrostatic charge – the charge on an amino acid: positive, negative, or neutral. Charge
is important in considering protein structures, formation, and function.
ENCODE – Encyclopedia of DNA Elements. An ongoing project targeting functional
elements in selected model organisms (new name modENCODE).
Endocytosis – uptake of extracellular materials into cells. Endosomes are formed from
segments of the cell membrane and capture extracellular materials. These are then trans-
ported through the cell. This process is particularly important in immunity.
Endoplasmic reticulum (ER) – an intercellular organelle that forms an interconnecting
network with the cell membrane. There are two forms of the ER: smooth (SER) and rough
(RER). The RER is involved with protein synthesis, while the SER is involved with lipids
and detoxification.
Enrichment strategies – targeting of specific selected genomic regions in the treatment of
disease by using data from studies of somatic and de novo mutations identified in whole-
exome sequencing.
Epidemiological – in the context of this book this is the study of health and disease in
populations – considering causes, response, and outcomes.
Epigenetic(s) – means above, outside or in addition to genetics and is usually used to refer
to heritable effects not produced by changes in the DNA sequence. DNA methylation is
the most often quoted form of epigenetics. The wider context for this growing sub-branch
of genetics is to be found in all things outside of “genetics.”
Epigenetic regulation – DNA methylation, in particular.
Epigenome-wide association studies (EWAS) – epigenetic studies performed on a
genome-wide basis have started and more will follow in the future.
Epigenomics – the growing interest in the wider consideration of genetics, especially
epigenetics.
Epistasis – means standing above. In this case, we are concerned with the interaction
of genes in pathways. Where there is epistasis, there is a sequence of activation of genes.
Failure in the early stages due to genetic variation would be expected to cause failure fur-
ther down the pathway. This is an important concept in systems biology.
Eugenics (Eugenics Movement) – an idea conceived by Francis Galton (the cousin of
Charles Darwin) based on the principle that social status is an inherited trait and therefore
the upper classes should do everything possible to conserve themselves, whereas the lower
classes are simply the unfortunate consequence of their genes. This misconception had
major historical consequences.
Exome – the region of the genome that encodes the genes (protein sequences).
Exome sequencing – restriction of genome sequencing to these coding regions only; this
reduces cost and time.
Exon(s) – DNA sequences in genes that encode the protein/peptide sequences. Also see
intron(s).
Extended major histocompatibility complex (xMHC) – the major histocompatibility
complex (MHC) after 2004, at which point the MHC was extended to include the hemo-
chromatosis gene HFE (telomeric of HLA-A). The extension added approximately 3.5 Mb
to the existing MHC.
Factorial (!) – mathematical value whereby the value is the sum of that value multiplied
in sequence with lower values, e.g. 5! = 5 × 4 × 3 × 2 × 1 = 120. Used in Fisher’s exact
probability test. The symbol “!” is the mathematical sign for factorial.
False discovery rate – the number of false positives discovered in a study compared with
the total number of tests performed and the total number of true positives.
False negative – acceptance of the null hypothesis for an allele that later turns out to be
incorrect. In GWAS, this means dismissing a genetic association that later turns out to be
true.
False positive – the rejection of the null hypothesis for an allele that later turns out to
be true. In GWAS, this means accepting the presence of a genetic association that is later
proven to be incorrect.
Familial (family studies/familial disease) – studies based on the collection of informa-
tive families.
Family-based association test (FBAT) – a form of transmission disequilibrium testing
(TDT).
Fisher’s exact probability test – a statistical method for testing for associations usually
applied when any individual value is less than 5 though not all studies follow this rule.
Fisher worked with Yates to set up the original χ2 test.
Fixation – the situation that occurs when through natural selection in a population there
is a change in the gene pool, such that where there were once two alleles, there is now only
one allele. In this situation the gene is said to be fixed, i.e. invariant.
Fluorescence in situ hybridization (FISH) – in situ hybridization using fluorescently
labeled DNA or RNA.
Founder effect – a high frequency of a particular allele due to small numbers in the origi-
nal population (founder population).
Frameshift – a shift in the remaining DNA when part of a DNA sequence is deleted. This
can have consequences for the gene product as, for example, with the CCR5-Δ32 mutation.
Gametocytes – precursors to the male and female gametes.
Gene – a basic unit of inheritance. A functional DNA unit that determines the phenotype
of an individual and segregates in pedigrees according to Mendel’s laws.
Gene duplications – see Duplication.
Gene flow – transfer of alleles or genes between populations through migration.
Gene pool – all the genes (or a selection) in a specified population.
Genetic anticipation – the tendency of the severity of a condition to increase over genera-
tions, often due to the increase in copy number from generation to generation.
Genetic divergence – accumulation of genetic variation from two or more ancestral pop-
ulations over time giving rise to even more genetic variation.
Genetic drift – random changes in gene frequencies over generations.
Genetic imprinting – when the expression of a gene is determined by the parental ori-
gin of that gene. Thus, expression can be different if the same allele is inherited from the
paternal or maternal route.
Genetic portfolio – the sum of an individual’s genotype, including the high-, low-, and
medium-risk alleles for all of the genes in the genome or in research reports for all of the
alleles tested.
Genetic protection – alleles that reduce the risk of a disease.
Genetic selection – the concept that survival is determined in part by genetic variation in
the population. This is a central pillar of Darwinism.
Genetic underclass – the distinction of a subgroup as inferior based on their genetic
makeup. Also see Eugenics Movement and Chapter 11.
Genome – the complete set of human DNA stored in our chromosomes (22 pairs plus X
and Y) and our mitochondria.
Genome sequencing – sequencing the entire genome as opposed to sequencing only
selected sections.
Genome project – an international project to sequence the genome of humans and other
species, and to catalogue variation within the human genome.
Genome-wide association analysis/studies (GWAA/GWAS) – analysis of genetic
markers across the whole genome to identify significant differences in allele distribution
between groups (usually healthy controls and disease cases, though not always). GWAS
has been highly successful in recent years.
Morbidity – something affecting the quality of life, usually associated with some tissue
damage or loss of function.
Mosaic mutations – a mosaic mutation is one that does not occur in every cell (see
Mosaicism).
Mosaicism – situation when only some of the cells in the body possess the mutant gene. It
can also apply to a situation when only a proportion of the mitochondria in a cell possess
the mutant gene. This is especially relevant in cancers.
Multicellular eukaryotic organisms – an organism that has both nucleated cells (i.e. each
cell has a nucleus – eukaryote) and is composed of more than one cell (multicellular).
Multiple linear regression model – attempts to model the relationship between two or
more explanatory variables and a response variable by fitting a linear equation to observed
data.
Multihit hypothesis – the idea that a haplotype may carry more than one risk allele–thus
common haplotypes may have several “hits” perhaps accounting for the strong association
with certain haplotypes (e.g. HLA 8.1).
Multipoint mapping – this may involve actual genotyping on a large scale or imputation
(see Impute) on a large scale (i.e. multipoint imputation).
Multiple sclerosis – an inflammatory autoimmune disease in which the myelin sheath
that insulates the nerves is degraded, leading to a variety of symptoms.
Mutation – technically, the majority of genetic variation arises as a result of mutation
irrespective of frequency. However, in clinical genetics some use the term mutation when
genetic variation gives rise to a disease (i.e. it is a disease-causing mutation). Others pre-
fer to use the term for genetic variations that are found at a low frequency in the popula-
tion (usually less than 1%). Genetic variations at a frequency of greater than 1% tend to
be called polymorphisms. Both terms are used interchangeably in places in this book, but
polymorphism is preferred in complex disease studies.
Myelin sheath – a material that insulates the electrically charged neurons and is broken
down in multiple sclerosis.
Myopathy – general term used to describe muscle pain or adverse reactions involving the
muscle.
Myositis – general term for the inflammation of the muscles which can be common in
some autoimmune diseases.
Natural killer (NK) cell – large granular cytotoxic lymphocytes, particularly important
in innate immunity.
Natural selection – Darwin’s major work; the idea (or hypothesis) of evolution driven by
natural selection.
New Germania – a region in Paraguay where a group of German settlers set up a com-
munity, with a view to maintaining race purity and thereby improving the breeding stock.
Originally suggested and supported by Wagner and Elisabeth Nietzsche, the sister of the
famous philosopher.
Next-generation sequencing (NGS) technologies – technologies used to sequence DNA
in ever-shorter periods of time. This does not refer to one technique, but to all of the
recent and some of the future developments to come.
Non-Mendelian complex diseases – diseases that do not have a known pattern of inheri-
tance, but clearly have a heritable (genetic) component.
Non-parametric linkage analysis – linkage analysis in which the pattern of inheritance
is not predetermined or preset. This can be used for linkage analysis in complex disease.
Non-random segregation – the distribution of allelic sequences between daughter cells at
meiosis. Segregation should be random, but it is not due to the presence of hot spots and
cold spots in the genome.
Non-synonymous SNPs – changes in the DNA sequence that result in amino acid dif-
ferences. Unlike synonymous (conservative mutations) these SNPs do have a biological
effect. See Single nucleotide polymorphism.
Novel somatic mutations – new mutations such as those that occur in cancer. It is pre-
dicted that next-generation sequencing (NGS) techniques will have a greater ability to
detect these.
Nuclear genome – the DNA encapsulated in the nucleus (i.e. the 22 pairs of chromo-
somes plus X and Y); not the mitochondrial DNA.
Nucleosome – see the Nuclear genome.
Null hypothesis (H0) – the null hypothesis states that there is no difference (null) between
the tested groups or populations (see Alternative hypothesis).
Neutrophils – granular white cells. They often have segmented nuclei and are the most
abundant of all white cells in the body. They are important in innate immunity, in particular.
Odds ratio (OR) – a crude measure of relative risk, though more commonly used than
relative risk (see Relative risk).
Offset sequencing primer strategy (OSPS) – part of the SOLiD short-read sequencing
method. This is illustrated in Figure 12.5.
Oligogenic – many forms (oligo) of a gene (genic). In complex disease this term refers to
a disease or subgroup of patients with a disease where there is more than one susceptibility
allele, but the number of susceptibility alleles is limited. There is no numerical value that
can be applied to the term “oligo” (see also Polygenic).
OMIM (Online Mendelian Inheritance in Man) – a comprehensive and free-to-access
database of human genes and phenotypes.
OncoArray – a new chip array that is designed to carry 500,000 tagged SNPs for genotyp-
ing patients with cancers, including breast, ovarian, prostate, colorectal, and lung cancers
(see also iCOGS Illumina chip).
Over-dominance – heterozygous advantage, where the heterozygote is at greater advan-
tage than the homozygote.
Pair-wise kinship estimates – a measure of the level of genetic similarity with a kinship
group and indicator of the level of inbreeding in ancestral populations.
Pair-wise probability (of identity by descent) – this test would be performed using
PLINK.
Paired-end mapping – pair-ended sequencing or mapping is increasingly used to access
genome rearrangements and structural variations on a global scale. Using this technique, a
single DNA fragment is sequenced from each end, producing two reads that can overlap.
Paneth cells – major cell type in the epithelium of the small intestine.
Paracentric inversions – a stable outcome after the incorrect repair of two breaks in a
single chromosome. Here, the fragments rejoin in reverse order and there is no loss of
genes, only a change in the sequence – so this is an inversion. Paracentric inversion occurs
when the repair does not involve the centromere.
Parametric linkage (analysis) – linkage analysis where the parameters are preset (or
known), i.e. the pattern of inheritance has been proposed.
Pathology – the science of seeking, observing, and recording normal and damaged tissues
in medicine. In practice, it is the detectable changes in and damage to the tissue caused
by a disease.
Pearson’s χ2 test – the most frequently used form of the χ2 test first applied in 1900, and
later developed by Yates and Fisher. The test is based on the normalization of the sum of
the squared deviations between observed and expected results.
Pedigree file – a collection of family data.
Penetrance – a measure of the occurrence of the trait in individuals with the risk allele or
vice versa the risk allele in those with the trait.
Pericentric inversions – occur when two breaks occur at different ends of the same chro-
mosome and the two fragments created rejoin the terminal fragments. The gene sequence
along the chromosome is altered, but the overall content remains the same.
Permutation procedure – permutation procedures provide a computationally intensive
approach to generating significance levels empirically.
Personal profile – idea that we each inherit a personal profile of risk alleles for complex
disease (see also Disease portfolio).
Phagocytosis – internalization of matter from outside the cell by specialized cells called
phagocytes.
Pharmacogenetics – the study of genetic variation and response to different pharma-
cological agents, including adverse drug reactions. This term was first used in 1959 by
Freidrich Vogel.
Pharmacogenomics – the study of the whole genome from a pharmacological perspective
(see Pharmacogenetics).
Phase (phasing) – a method that can be used to determine which SNPs are inherited
together.
Phenotype – the result of genetic variation; this can be seen in the expression of a trait,
character, or disease (disease phenotype).
Phenotyping – the observation and recording of the expression of a trait.
Philadelphia chromosome – this results from an unusual translocation between chromo-
some 9 and chromosome 22. It is found in 95% of patients with chronic myelogenous
leukemia and 25% of patients with acute lymphoblastic leukemia.
Phylogenic tree – a diagram showing variation between species and populations.
pi-hat – a statistical test that can be performed in PLINK.
PLINK software – open-source GWAS toolkit designed to perform basic large-scale anal-
ysis: http://pngu.mgh.harvard.edu/purcell/plink/.
Selective pressure – any factor that reduces reproductive success within a population.
Environmental factors include infectious agents, e.g. the malaria parasite.
Sequencing-by-synthesis – developed by 454 Life Sciences using what is called a pol-
ony sequencing protocol. This was an early form of next-generation sequencing (NGS)
moving towards rapid methods for DNA sequencing. See also SOLiD.
Semi-dominant – same as co-dominant; heterozygotes have an intermediate phenotype
compared with homozygotes.
Serotyping (serological typing) – HLA typing using antibodies in human serum to iden-
tify specific HLA antigens. Multiple sets of microtiter trays with 60, 72, or 96 wells are
used to determine each phenotype at varying degrees of resolution.
Sex-linked – Mendelian disease involving the two sex chromosomes X or Y.
Shared epitope – said to occur when two or more alleles share a sequence that encodes
amino acids in a region of functional importance within a molecule. It is most frequently
applied to studies of HLA to explain why there are often several different HLA alleles
associated with susceptibility to the same disease.
Sibling relative risk (λ) – a crude calculation of the increased risk of disease that a sib-
ling of an affected case is likely to have. The calculation is based on a comparison of the
incidence or prevalence in cases versus the healthy population (controls). The quality of
the result depends on the size of the population surveyed and the frequency of the trait.
Significance threshold – the preset measure at which a result will be considered as signifi-
cant. For example, in the WTCCC1 study any numerical value lower than the significance
threshold is significant (i.e. less than 5 × 10−7) and any value higher is not significant.
Single nucleotide polymorphism (SNP) – self-defined units, variations in single nucleo-
tides commonly polymorphisms used extensively in genetic studies.
Slow acetylators – a group of individuals who inherit genetic variation that causes them
to express this phenotype.
Small call rate – the number of samples that are successfully genotyped in a study.
SNP – see Single nucleotide polymorphism.
SNP Map – a map of the positions of all SNPs in human genome.
Social Darwinism – Social Darwinists generally argue that the strong should see their
wealth and power increase, while the weak should see their wealth and power decrease.
The precise definition of who belongs in either group is a matter of debate amongst those
who hold these outdated beliefs.
SOLiD (Supported Oligonucleotide Ligation and Detection) – a next-generation
sequencing (NGS) platform developed by Life Technologies. Performs short-read
sequencing based on “sequencing-by-ligation” (see also Sequencing-by-synthesis). It also
uses the offset sequencing primer strategy.
Somatic chromosomal abnormalities – abnormalities in any chromosome found in a
somatic cell.
Statistical confidence – the level of acceptance of a probability value. Similar to the sta-
tistical threshold, but can be expressed as a range of confidence intervals.
Statistical error (type I and type II) – errors in statistical calculation caused by low
sample size causing false positives (type 1) and false negatives (type 2). These are not errors
in the application of the statistical test, but errors introduced in the planning of the study
that are picked up through statistical analysis (see Statistical power).
Statistical power – in the context of this book, the likelihood of finding true-positive
associations and reporting true-negative associations.
Stratification score analysis – measure of the degree of stratification within a population.
Structural abnormalities – a missing, extra, or irregular fragment of DNA.
Structured association – relates to population stratification.
Sum rule – a complex mathematical calculation used in genetic studies; best left to the
statistical geneticist.
Super-coiled – the multiple rounds of coiling of DNA strands.
Survival of the fittest – concept based on Darwin’s theory of evolution. The term was first
coined by Herbert Spencer.
Synthetic association – low-frequency causal variants. A form of indirect association.
Tagged SNPs – selected single nucleotide polymorphisms (SNPs) used in GWAS in
particular. Each SNP is selected based on the likelihood of amplification during testing
and proximity to a marker gene.
TAP – abbreviation for transporter associated with antigen processing.
T cells – white cells or leukocytes that mature and undergo selection in the thymus.
T cell receptor (TCR) – the receptor for peptide antigens on mature T cells. A critical
element in the formation of the T cell synapse.
Telomerase – an enzyme that adds DNA repeat sequences at the 3′ end of the DNA mole
cule in the telomere region.
Telomeric/telomere – the end of the chromosome.
Tetraploidy – possession of four copies of one or more chromosomes. All cases are lethal.
Thalassemias – a group of inherited autosomal recessive blood disorders.
Thrifty gene hypothesis – hypothesis proposed by JV Neel in 1962 to explain the grow-
ing incidence of diabetes in the Western World.
Thymus – area of the body where T cells develop, mature, and go through selection.
Toll-like receptors (TLRs) – receptors important in innate immunity.
Trait – a term used to describe a character or phenotype, which can be a disease, response
to treatment, subgroup within a disease, or a behavioral character in a study population.
Transcriptome – that part of the genome that encodes transcribed genes.
Trans expression – the expression of two different alleles from different loci on the same
chromosome, but from different parents. Resulting in the formation of a heterodimer in
trans as opposed to a heterodimer in cis. This can create a different molecular structure
than expected if proteins were only able to interact in cis. See Cis expression. Particularly
relevant in creating diversity of variation in HLA DQ and DP molecules.
Transfer RNA – serves as a physical link between DNA and RNA and the amino acid
sequence by loading amino acids onto the messenger RNA in peptide generation.
HLA (human leucocyte antigen) 9F, 39, 76, 108, HLA genotyping 212
109B, 158 170, 175–220, 236T, 245T, 245, HLA haplotypes 91, 207, 208
246F, 246, 247, 248F, 248, 249, 255T, 266, HLA 7.2 ancestral haplotype 183, 200, 208,
267, 268T, 268, 269F 269, 270, 271, 285, 209, 210, 211, 213
297, 298F, 298, 299, 300, 307, 312, 321, HLA 8.1 ancestral haplotype 9F, 183T, 207,
322, 324, 325 208, 210, 211, 213, 215, 217, 299
HLA alleles 39, 46, 76, 118, 175–220, 267, 268T, HLA types 177
269, 325 A1 9F, 109B, 183T, 183, 202, 207
DQB1*02 9F, 188, 208, 218, 268T, 269, 299, B7 200, 298
304T B8 9F, 176, 202, 207, 298
DQB1*03:01 112F, 112, 137, 267, 299, 304T Bw4 245T, 246, 246F, 247
DQB1*06:02 183T, 196, 198, 199T, 199, 200, Bw6 246
286T, 299 DR2 183T, 185T, 190, 198, 200, 211, 298
DRB1*03 9F, 181, 183T, 183, 188F, 203–208, DR3 183T, 185T 186T, 188, 190T, 202, 211,
218, 299 298, 299
DRB1*04 82, 112, 112F, 117, 118T, 118, 181, DR4 183T, 185T 186T, 190T, 202, 205T, 211,
181, 183T, 202–208, 299, 322, 324 298, 299
DRB1*07 181, 187T, 188F, 197, 204F, 208T, DR7 183T, 186T, 188, 190T
267, 268, 269 DR8 185T, 186T, 211
DRB1*08 181, 182, 203, 211, 212 DR13 190T, 205T, 213
DRB1*11 112F, 112, 181, 183T, 188, 204F, DR15 186T, 190T, 198, 298
211, 212 Holistic view 127, 359
DRB1*13 181, 183T, 203F, 204F, 205T 205, Homozygosity 14, 19, 20, 22, 45, 154, 245T, 245,
208T, 211, 212, 213 246, 247, 261, 355
DRB1*14 117, 118T, 181, 183T Homozygote 18, 21, 27, 44, 45, 62T, 154, 156T,
DRB1*15 181, 183T, 196, 198, 200, 204F, 157, 158, 231, 232, 233, 240, 241, 246, 270
268T, 299 Homozygous 4, 4F, 14, 21, 23F, 24B, 29, 45, 87,
HLA-A*01 9F, 109B, 181B, 218 118, 143, 154, 155, 156F, 230, 240, 246,
HLA-B*07 196, 200 256, 259, 260, 261, 287, 288, 354
HLA-B*08 9F, 118, 183T, 197, 218 Hot spots 167, 186, 361
HLA-B*15:02 267 Hot springs 6
HLA-B*27 46, 47F, 108–110, 115, 116F, Human disease 1, 6, 36, 40, 48, 49, 53, 101, 107,
197, 247 114, 217, 226, 349, 361, 366
HLA-B*57 110, 245T, 247, 248, 255, 267, Human genetics commission (UK) 331
269T, 269, 271, 321, 322, 324, 325 Human microbiome project (HMP) 348, 358
HLA-C*02 197 Human tissue act (UK) 333
HLA-C*06:02 197 Huntington’s disease 48
HLA binding groove 117 Hutchingson-Gilford progeria syndrome (HGPS)
HLA class I genes 181, 245, 267 53
HLA-A gene 109B Hypersensitive reaction 266, 267, 269
HLA-B gene 5, 109B Hypertension 115
HLA-C gene 109B, 248, 249
HLA class II genes 190T, 245, 267
DPA1 gene 189F, 190T, 216 I
DPB1 gene 189F, 190T, 216 Identical twins (monozygotic twins) 61, 79F, 79,
DQA1 gene 187–190, 196, 197, 198, 200, 202, 80T, 178B
203F, 205T, 207, 208T, 211, 212, 216 IGF2 31, 49
DQB1 gene 5, 112, 183T, 185, 187–190, Illumina 309, 341T, 343, 346F, 348F, 354, 358F
196–200, 202, 203F, 205T, 207, 208T, 208, Imatinib 289
211 212, 216, 218 Immune inflammatory disease 65, 82, 116, 128,
DRA1 gene 179T, 181, 185, 188F, 189F, 190T, 367
191, 192, 211 Immune mediated disease 36, 198, 199, 200, 201,
DRB1 gene 5, 119, 120T, 175–220 268, 299, 312
DRB3 gene 179T, 181, 182F, 183T, 184, 189F, Immune regulation 63, 201, 214, 219, 302, 306F
190T, 205T, 205–208, 216, 218 Immune regulatory genes 64, 201, 216, 219,
DRB4 gene 179T, 181, 182F, 183T, 184, 189F, 312, 313
190T, 206–208, 216 Immune regulatory pathways 63, 64, 116, 119
DRB5 gene 181, 182F, 184, 189F, 190T, 207, Immune response 63, 64, 65, 66F, 82, 115, 117,
208, 216 134, 135, 191, 193, 196, 202, 215, 217, 224,
HLA genes 65, 108, 109B, 117, 118, 180B, 184, 227, 228, 233, 235, 238, 240, 241, 242F,
185, 187T, 189, 213, 217 244, 246, 249
Immune stress 182, 218 Likelihood 6, 9F, 32, 39, 46B, 48, 77, 82, 84, 94,
Immune surveillance 196 96, 98, 110F, 115, 144B, 144, 145B, 147,
Immune synapse 192F, 193, 194F, 219, 299 148, 160, 161, 169, 201, 224, 249, 290,
Immune tolerance 64, 91, 300, 301, 359 328, 332
Immunoglobulin (Ig) 119, 198 Lipid metabolism 82, 101, 134, 137
Immunoglobulin domains 191F, 191, 193F, 216 LOD score 56, 84, 144B, 144, 145B, 145, 171
Immunology 176, 177, 178B Logistic regression 152, 158, 160, 162, 171
Immunoregulatory genes 175, 176, 301 Lung cancer 261, 275, 277F, 278, 281, 283T, 283,
Imprint control elements 31, 49 286, 289, 291F, 292F, 292, 293
Imprinted genes 31
Imputation 90F, 91, 100, 118, 125, 236, 237, 250,
349, 352F, 352, 356, 357, 362, 366
M
Imputation analysis 236, 248, 337, 349, 352F, 361 Macrophage 65, 129, 227T, 237, 238F, 239F,
Impute 88, 89, 90F, 212, 352F, 353 240, 242F
Inbreeding depression 21 Major histocompatibility complex (MHC) 5, 9F, 9,
Index cases 84, 94, 95, 356 45, 86, 108, 109B, 175–220, 285, 246, 247,
Indirect association 165, 353 249, 269F, 285, 298, 299, 301T, 302F, 303,
Infectious disease 15F, 39, 40, 76, 83F, 83, 88, 304T, 305F, 306F, 307, 312, 324, 362
111, 112, 176, 216, 223–250, 278 MHC class I 189, 190, 193, 246
Inflammatory bowel disease (IBD) 61, 94, 108, MHC class II 66, 189, 217F, 302F
261 MHC class III 189, 209, 213, 214, 215, 216
Influenza virus 199, 226 MHC class I chain-related (like) 190, 214,
Innate immunity 66F, 193, 197, 216, 219 215, 216
Insertions 5, 6, 39, 41, 42T, 42, 62, 261, 341, MICA 187F, 189F, 190, 204F, 207, 208T, 209,
350, 362 210, 214, 215, 216, 217F, 218, 219, 299
Insulin gene (INS) 16, 300, 312 MICB 167F, 190, 209, 215, 216
Interactive networks 122 Malaria 19, 76, 176, 223–237, 239, 247,
Interferon (IFN) genes 112, 200 248, 250
Interferon-gamma (IFN-γ) 200 Malaria Genomic Epidemiological Network
Interleukin 7F, 63, 116, 229, 300, 323, 364 (MalariaGEN) 235
Intermediate metabolizers 256F, 256 Manhattan plot 166, 167F, 171, 269F
International cancer group (iCOGS) 293 Maturity onset diabetes of the young (MODY)
International Genetics of Ankylosing Spondylitis 284, 295, 296F, 296, 311T, 311, 312
Consortium (IGAS) 116 MCY oncogene 284F, 285T, 285, 353
International Histocompatibility Testing Medical Research Council (MRC) 317
Workshop(s) (IHTW) 177, 189B Meiosis 7, 8F, 8, 10, 27, 31, 40, 42, 126T, 357
International HIV controllers study 248 asymmetric 10
Interstitial deletions 43F diploid spermatocyte (sperm cells) 10, 41F
Intracellular proteins 194 haploid gametes 10, 40
Intracellular transport 123 polar body 10
Intronic sequences 4, 181B, 284 Melanoma 130, 286, 289, 227F, 285T, 286, 289,
Invariant chain 195 291, 292F
Inversions 41, 42T, 42, 43F, 350 Mendelian patterns 37, 44, 47, 54, 106, 110, 137,
Irinotecan 261, 288 227, 250, 278
Isoform 101, 182, 215, 258, 287 additive dominant effect 29, 35
Isoniazid 254B, 269 autosomal dominant 10, 31, 36, 44, 45F, 45, 48,
49, 78T, 78, 142F, 142, 143, 335T
autosomal recessive 2, 11, 18, 19, 21, 36, 44,
K 45F, 45, 48, 196, 198, 260, 355T
Karyotype(s) 2F, 43F semi-dominant 44
Karyotyping 8F sex-linked 36, 44
Killer cell immunoglobulin like receptor (KIR) X-linked 44, 45F, 227T, 355T
193, 245T, 246 Y-linked 44, 45F
Kilobases (kb) 2, 353 Meta-analysis 62F, 86, 96, 97B, 117, 119, 124T,
125, 130, 163, 171, 163, 171, 309
Metagenomics 337, 357, 359, 366
L Methylation chips 311
Lamin A gene (LMNA) 54 Migration 1, 9, 10T, 10, 12, 13, 14, 16,
Leukocytes 109B, 158, 175, 178B, 179B, 180B 22, 23B, 27, 28F, 31, 65, 123, 126T,
Ligand 60, 60F, 119, 126T, 129, 239F, 243T, 134, 282
244, 245, 246F, 247, 249, 283T, 302, 303, Miller syndrome 349, 355T
307, 364 Missing heritability 67, 77
Missing-self hypothesis 246 Natural selection 1, 9, 10T, 10, 15F, 16, 17F,
Mitochondrial disease 40 163, 323
Kearns-Sayre syndrome 52T, 52 balancing selection 16, 17F, 18
Leber hereditary optic neuropathy 52T, 52 positive (or adaptive) Darwinian selection 16,
mitochondrial myopathy 52T, 52, 53F 17F, 18
Pearson’s syndrome 52T, 52 purifying (or negative) natural selection 16, 17F
Mitochondrial genes 51 New Germania 319
Mitochondrial genome 1, 27 - 28, 36, 50, 51F, 51, Next generation sequencing (NGS) 337, 340, 349
52T, 52, 53F Nicotine 258T, 283
heteroplasmy 27 Nicotinic acetylcholine receptor 283T, 283
homoplasmic 27 Non-concentric heterochromatin 3F
hot spot positions 28, 167, 196 Non-Hodgkin’s lymphoma 175, 277, 285T, 291F,
mitochondrial DNA (mtDNA) 2, 27, 31, 292F
28, 28F Non-identical twins (dizygotic twins) 61, 79F, 79,
molecular clock 28 80T, 80, 122, 200
Mitochondrial genotype 27, 28F Non-random mating 21, 151
Mitosis 40, 42 Nortriptyline 256, 257F, 259
Modifier 33, 47, 56, 59T, 59, 60 Novel somatic mutations 349
Molecular chaperone 195 Nuclear genome 51
Molecular mimicry 116F, 137 Nucleosome 30, 30F
Molecular signature 357 Nucleosome core particle (NCP) 30
Molecular structures 117, 183T, 191, 199 Nucleus 2, 30, 31, 50, 54, 195F, 238F, 358F
Molecular weight markers 7F
Monist League 319, 320F
Monogenic 37, 44, 47, 54, 55, 56, 57, 68, 84,
O
227T, 227, 296, 311, 312, 313 Odds ratio (OR) 22, 46, 46B, 86, 108, 129T, 154,
Monomorphic 5, 14, 22, 134, 188F 161, 213, 227, 281, 299, 324, 354
Morphine 257, 258, 259 Oligogenic 37, 54–59, 66, 68, 78T, 78
Morula 79F Online mendelian inheritance in man (OMIM)
Mosaic 40, 42T, 107, 349 11, 22, 44, 58, 59T, 66, 119, 121T, 123T,
MSMB 283, 284T 124T, 311
Multi-hit hypothesis 206, 217F, 217, 218, 299 Over dominance 9
Multicenter AIDS Cohort Study (MACS) 342
Multiple sclerosis (MS) 78T, 80T, 119, 121F, 176, P
197, 199–202, 220, 298T, 299, 313
demyelination 199, 200 P value 119, 120T, 124T, 125, 129T, 131T, 133T,
international multiple sclerosis genetic 146, 147, 148, 151, 152T, 164, 165, 166,
consortium (IMSGC) 201 170, 213, 269F, 303, 304, 308T
myelin 199, 200 Paired-end mapping 350
sclerotic plaque Paired-end sequencing 341T, 350
Mycobacterium 226 Paleo 16, 28
Myocardial infarction 127, 266 Pancreas 268F, 291F, 330, 307, 308, 309, 311
Myopathy 52T, 52, 263, 270 pancreatic β cells 198, 297, 307, 308, 311, 312
Myotonic dystrophy 48 Paternity and maternity testing 327
Myriad genetics 330, 331 Pathogenic 63, 66, 116, 125, 194, 212, 223, 225
Pathogens 63, 66, 137, 193, 195F, 214, 223, 224,
225, 226, 240, 241, 249, 250, 358, 359,
N 360
N-acetyltransferase 2 (NAT2) 269, 280, 287T, 287, Patient management 105, 107F, 111, 113, 120,
288 137, 250, 260, 313, 357, 354, 367
Nanopore 345 PCR (polymerase chain reaction) 6, 7F, 109B, 181,
Narcolepsy 176, 197–200, 208, 299, 313 177F, 186T, 207, 211, 229, 298, 299, 338,
axonal pruning 198 341T, 342F, 345, 366
cataplexy 198 polymerase enzyme 6, 338, 339, 340, 341T,
microglial cell expression 198 344F, 345, 346F, 347F
National Center for Biotechnology Information Pearson’s χ2 test 151, 152T, 152, 153T, 153, 154,
(NCBI) 358 157, 163, 164, 171
National Health Service (NHS) 328 Penetrance 44, 45, 46, 47F, 54, 55T, 55, 56, 59,
National human genome research institute 60, 84, 109, 196, 215, 227, 253, 326, 327
(NHGRI) 136, 362 Personalised medicine 2, 325
National Institutes of Health (NIH) 136, 317 Personalised therapy 272, 290, 293, 309, 337
Natural killer (NK) cells 115, 193, 246 Phagocytosis 65, 195F, 195, 233, 234T, 236T
Pharmacogenetics 6, 40, 86, 113, 196, 253, 254, Product rule 24B, 25B, 25, 26B
263, 271, 272 Prognosis 75F, 75, 78, 111
Pharmacogenomics 253–272, 322, 367 Prostate cancer 275, 277F, 278, 280, 281, 283,
Pharmacological agents 39, 113, 272 284T, 284F, 284, 285T, 285, 286, 292F, 293
Phenotype Protease 195F, 195, 282, 297
phenotypic expression 29 Proteasome 194, 195F
phenotypic traits 1 Protective alleles 39, 161, 163
thrifty phenotype 16 Protein polymorphism 185
Phenylketonuria (PKU) 29, 55T Pseudogene 182F, 182, 188, 189T, 189,
dysfunctional phenylalanine hydroxylase enzyme 190T, 215
(PAH) 29 Psoriasis 108, 176, 196, 197
Phosphodiester bond 339 PTPN22 (Protein tyrosine phosphatase non-
Phylogenetic tree 28 receptor type 22) 86, 115, 117, 118, 119,
Plasmodium 225, 229–237 120T, 121F 122F, 303, 303, 304T, 304,
Pleiotrophy 37, 38F, 39F 305T, 305F
PLINK 89, 125, 150, 151 Publication bias 93, 94, 97, 98
Ploidy 40 Punnett square 22, 24B, 46B, 154, 167
aneuploidy 40, 41, 42T Pyrosequencing technology 342F, 343F, 348F
diploid 4, 10, 11, 13, 20, 40, 41F, 46B, 339
haploid 10, 13, 40
ploid 40 Q
polyploidy 40, 42T Q-Q plot 165, 166F, 166, 171
tetraploidy 40, 41F, 41 Qualitative trait 94
triploid 41F, 41, 42T Quantitative estimation 21
Polygenic 37, 54, 55, 56, 57, 61, 63, 65, 67, Quantitative trait 29, 94, 160, 348
68, 78T
Polymorphic 5, 90F, 109B, 177, 179T, 181B,
182, 187, 192, 247, 262, 264, 271F, 352F, R
364, 365F Random genetic drift 9, 13, 22, 162
Polypeptide 51F, 51, 109B, 117, 15, 188F, 192, Random mating 22, 24B
194, 198, 230, 231F, 305 Receptor(s) 59, 64, 116, 125, 126T, 137, 163, 93,
Poor metaboliser 254T, 255, 256, 257F, 260 216, 224, 228, 235–246, 263, 264 265F, 265,
Population bias 22 282T, 282, 283T, 283, 288, 289, 290, 299,
Population genetics 8, 9, 11, 13, 21, 29, 114 300, 301, 302, 303F, 304T, 304, 305T, 307,
Population size 13, 15, 16, 21, 26, 28, 31, 224 308T, 309, 323, 354, 355F, 360, 364, 365F,
Population stratification (substructure) 26, 27, 365
95T, 95, 151, 161–165, 228, 236 Recombination 7, 8F, 10, 18, 20, 27, 41, 126T,
Population vigor 21 282T, 283T, 284F, 361
Positional cloning 84, 85F Recruitment bias 94
Positive natural selection 9 Red blood cells 19F, 19, 109B, 179B, 229, 230F,
Post-genome 4, 84, 85, 88, 101, 107, 219, 303, 231F, 231T, 231, 232F, 262F
304T, 328 Reductionist 101
Power calculations 93, 94, 97B Redundancy 101, 185, 217F, 241, 288, 302, 364,
Prader-Willi syndrome 31, 49, 50T, 50 366
Pre-genome 84, 85, 115, 147, 301T, 303 Redundant 85
Preferentially bound 117, 199, 203, 210, 247 Regeneration 134, 137
Preferential presentation 299 Regulator 18, 65, 126T, 133, 300, 301, 307
Preponderance 87F, 95, 116, 202, 210 Regulatory cytokines 63F, 63
Prevalence 164, 200, 232, 303 Relative risk (RR) 56, 58, 61, 67, 77, 78T, 78, 85F,
Primary biliary cirrhosis (PBC) 78T, 91, 176, 197, 86, 93, 102, 117, 118T, 122, 127, 164, 210,
201, 203F, 204F, 210–213, 299, 322 227, 233, 297, 298T, 354
Primary ciliary dyskinesia 349 Replication 80, 91, 92T, 93, 94, 97B, 98, 99F, 119,
Primary immune deficiencies 7F, 272 130T, 130, 132, 239, 240, 338, 339F, 350
Primary sclerosing cholangitis (PSC) 82, 176, 197, Reproductive fitness 349
201, 203F, 204F, 207–210, 211, 212, 213, Resistance 19, 39, 88, 175, 203, 205T, 208T, 208,
216, 217, 218, 299, 322 223–250, 280, 289, 290, 293, 322, 353, 358,
biliary obstruction 207 360, 362
cholestasis 297 Restriction enzymes 6
portal hypertension 207 Retinitis pigmentosa 36, 52T, 355
Probability 87, 96, 119, 121T, 128, 132, 144B, RFLP (restriction fragment length polymorphism)
146–149, 152, 154, 155, 157, 158, 160, 161, analysis 6, 109B, 177F, 185T, 186T, 207,
165, 166F, 170, 171, 235, 304 211, 298, 300, 366
Rheumatoid arthritis (RA) 65, 80T, 81, 82, 86, SOLiD (support oligonucleotide ligation and
105, 115, 116–122, 127, 137, 176, 198, detection) platform 341T, 342, 343, 345F,
201, 203, 207 209, 262, 300, 304, 305T, 348F
305F, 322 Sparteine 254T, 255
cyclic-citrullinate peptide (anti-CCP) 116 Spleen 118, 233
seropositive arthritis 117, 118 Splits 182, 184F, 197
severe erosive disease 118 Statin 100, 263, 269, 270
synovial fluid 116 Statistical analysis 11, 40, 93, 96, 97B, 141–171,
synovial joints 116 202, 352, 354
Ring chromosomes 41, 43F significance threshold 98, 99F, 115, 116, 119,
Risk gene 44, 58, 62,120T, 130, 135F, 137, 202F, 125, 130, 147, 148, 151, 167F, 201, 237,
217F, 299, 304T, 313, 328, 330 247, 354
Risk portfolio 39F, 55, 67, 110, 111F, 114F, 114, statistical confidence 93, 94, 148, 348, 354
11F, 116, 201, 216, 217, 218 statistical error 146F, 146, 147, 149, 157, 162
RNA 2, 30, 51, 225, 237, 238, 239, 256, 278, statistical likelihood 84
293, 302, 308, 345, 359, 355F statistical power 6, 39, 85, 90, 93, 94, 96,
Roche/454 sequencing technique 341T, 342F, 146–150, 157, 163, 170, 171, 228,
342 348F 280, 348
Rs number (ref-SNP cluster ID number) 5, 258 statistical significance 39, 87F, 90, 98, 354
statistical threshold 171, 303, 353
Stroke 52T, 128F, 259
S Sum rule 25B, 25, 26B
Sacroileitis 108 Survival of the fittest 31, 318
Salt bridge 198, 204, 205, 206F, 299 Synthetic associations 353
Sample size 90, 93, 94, 96, 97B, 147, 150, 171, Systemic lupus erythematosus (SLE) 82, 119, 121F,
307, 349, 353, 354, 356 176, 210, 215, 216, 217
Sanger center 329, 338, 362
Sanger sequencing 338, 339, 340F, 356
Schizophrenia 76, 78T, 80T, 82, 122, 123, 124T, T
126T, 349, 350 T cells 135F, 191, 193, 201, 202F, 227–248, 301,
DISC1 82, 123 302F, 357
GABRB1 82, 123, 124T, 126T B7 (B7.1/B7.2) 301, 302F
GMR7 82, 123, 124T, 125, 126T CD4+ T cells 193, 194F, 210, 237, 238, 239F,
Selective pressure 1, 10, 16, 226 242F, 244, 245, 301, 302F, 302, 357
Segregate 42, 84, 151, 168, 255 CD8+ T cells 191, 193, 194F, 210, 236T, 245T,
Segregation 8, 20, 40, 85F, 91 245, 246, 248
Segregation at meiosis 8 CD28 301, 302F
Serotype 180B, 197 CD80 202F, 301, 302F
Serotyping 184F, 211 CD86 202F, 301, 302F
Serum 58, 109B, 181, 200 γδ T-cells 209, 210, 216
Severity 39, 48, 67, 78, 75F, 94, 95, 110, 111, 113, T cell co-stimulator 201
161, 162, 199, 205, 223, 224, 229, 259 T cell receptor (TCR) 117, 191, 192F, 193,
Shared epitope 117, 118T, 118, 202, 203F, 194F, 195, 199, 203, 299, 300, 302F
204, 205 T cell tolerance 201
Sibling pairs 55, 67, 142, 298T, 299, 300 Tagged SNPs 64, 88, 89, 90, 91, 97B, 100, 124T,
Sibling relative risk (λs) 58,61, 77, 78T, 85F, 102, 125, 129, 130, 150, 167F, 170, 213, 236,
117, 122, 127, 164, 210, 297, 298T 248, 281, 287, 288, 293, 303, 309, 349
Sickle cell anemia 4, 19F, 19, 230–235 Tamoxifen 272, 288
Significance threshold 98, 99F, 115, 116, 119, 125, Telomerase 285T, 285
139, 147, 148, 151, 167F, 201, 237, Thalassemia 230–235
247, 354 The thrifty gene hypothesis 16
Silver syndrome 31, 49 Thermophilic bacteria 6
Skin rash 266, 267, 268T Thiopurine methyltransferase (TPMT) 254T, 255,
SLCO1B1 263, 270 256F, 256, 261, 262F, 262, 263, 271
Smallpox 226, 240 Third generation sequencing 343, 345
SNP chip 91, 93, 96, 97B Thymus 118, 201, 300
SNP frequency 6 Tissue type 176
SNP map 4, 85, 88, 101, 107F, 107, 119, Toll-like receptor 64, 216
360, 366 TOX3 282T, 283
SNURF-SNRPN 31, 49, 50T Trans 187, 188F, 188
Social Darwinism 318 Transcriptome 337, 350
Society 14, 315, 317, 318, 319, 326, 328 Translocations 6, 41, 42, 350, 351
Primers P1<<P2
P1
Mate-paired library OR + + Polymerase + coupled
beads
(b)
PCR
3’
Emulsion
modification
(c)
Deposition
Bead
(d)
1. Prime and Ligate POH 5. Repeat steps 1-4 to extend sequence
+ Ligase Ligation cycle 1 2 3 4 5 6 7 ... (n cycles)
AT TT GT GT TT CA CG
TA AA CA CA AA GT CG
Universal seq primer (n) AT 3’
1μm
bead 3’
P1 adapter TA Template sequence
6. Primer reset
2. Image Exite Fluorescence
3’
–1
4. Cleave off fluor
Cleavage agent Universal seq primer (n-1) AT AT AT TC AA TA CC
3’
1μm
HO bead 3’
T GT GC AG TT AT GG
AT
P
3’
TA
8. Repeat reset with, n-2, n-3, n-4 primer
Read position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Primer round
Bridge probe
Bridge probe
Bridge probe
Figure 12.5: The Applied Biosystems SOLiD sequencing-by-ligation method. Fragments of DNA are ligated
with oligonucleotide adaptors at each end and hybridized to complementary oligonucleotides attached to
magnetic beads (a). Beads are then placed in a water/oil emulsion where DNA amplification is performed (b).
Once amplification is finished, the beads are placed on a glass surface and entered into the sequencer (c). In
the sequencer a universal sequencing primer, complementary to the adaptor sequence, is used to trigger the
sequencing reaction where ligation cycles with fluorescently labeled degenerate probes is performed. Once the
probe anneals to the DNA template, a DNA ligase covalently binds the probe to the sequencing primer and the
fluorescence is recorded. The probe is then cleaved and another cycle starts. After seven rounds of sequencing,
the extended universal primer is removed and a new universal primer is added that is offset by one base (d). The
sequence of the read is inferred by interpreting the ligation results for the 16 possible dinucleotide interrogation
probes. (Courtesy of Applied Biosystems.)
A
A
T
+
T
T A
A T
Denature,
cleave
(c)
T
POL C
POL A
G
G
T T T T
A C
G
Fluor A C
C
Incorporation cleavage G
G
C
Block
removal
Template
strand
Figure 12.6: The principles of the Illumina sequencing technique. DNA fragments are first ligated onto
oligonucleotide adaptors to form double strands (a). Adapter-modified, single-stranded DNA is then added to
a flow cell and immobilized by hybridization. Bridge amplification generates clonally amplified clusters (b). The
cluster fragments are denatured, annealed with a sequencing primer, and subjected to sequencing-by-synthesis
employing DNA polymerase and four reversible dye terminators. Once incorporated, the terminator stops
the sequencing reaction, which restarts immediately by cleavage of the incorporated dye terminator. Post-
incorporation fluorescence is recorded (c).
(a) (c)
–log10P
–log10P
1 2 3 4 5 6
After imputation
Figure 12.10: Imputation analysis. The two graphs illustrate a genotype sequence before (on the right) and
after (on the left) impute has been applied. The central section of the figure illustrates in detail six different
haplotypes numbered 1 to 6 for a combination of thirteen bi-allelic genes illustrated as double rings. Each gene
is polymorphic six of the genes have been successfully genotyped. There are two alleles for all of genes (both
genotyped and unknown – not genotyped). The alleles are illustrated by gray shading (A allele) or no shading
(white – B allele) in those genes that have been successfully genotyped. The seven genes illustrated in black
have not been genotyped. Imputation analysis can be applied to enable the genotypes of the black genes on
this haplotype to be estimated based on the known pattern of linkage disequilibrium for the surrounding genes
which have been successfully genotyped. To do this, data are extracted from large genotype databases to fill in
the “missing” genotypes. For example if we know that persons with haplotype 1 which runs top to bottom; white,
unknown, unknown, unknown, white, unknown, gray, unknown, white, gray, gray, unknown, unknown (or B-?-
?-?-B-?-A-?-A-B-A-?-?) carries the white (B) alleles at positions 2, 3, 4, and 6 and the gray (A) alleles at position
8, 12, and 13 in the sequence, then the full sequence for this haplotype can be assigned. If we use the same
alphabetical system (A and B) to name the alleles of the seven unknown genes the sequence for the thirteen
genes of haplotype 1 would be: B-B-B-B-B-B-A-A-B-A-A-A-A. Using the known genotypes from positions 1, 5, 7,
9, 10, and 11 we can identify the unknown genotypes for the other seven genes on the remaining five haplotypes
(haplotypes 2 to 6). This imputed data can be included in the analysis. Because linkage disequilibrium is so strong
in certain areas of the human genome, imputation analysis from initial GWAS data can massively extend the
studies. Thus an initial GWAS study of based 300,000 to 400,000 SNPs can generate data for up to one million or
more markers. (From Marchini J & Howie B [2010] Nat Rev Genet 11:499–511. With permission from Macmillan
Publishers Ltd.)
Cross-link and
fractionate
chromatin
ChIP:
Enriched DNA
binding sites
Sequence
Binding site
mapping
Figure 12.11: Illumina ChIP sequence experiment workflow. DNA fragments are first cross-linked in situ,
fractionated, and then immunoprecipitated. DNA is then sequenced using an Illumina platform to identify
genome-wide sites associated with proteins of interest. (Courtesy of Illumina.)