Embc2016 Yts
Embc2016 Yts
Embc2016 Yts
Abstract We set out to use machine learning techniques knowledge of accurate reference values of the risk factors.
to analyse ECG data to improve risk evaluation of cardio- This approach has grown popular in risk evaluation and
vascular disease in a very large cohort study of the Chinese diagnostics for chronic diseases [4]. For example, in CVD,
population. We performed this investigation by (i) detecting
abnormality using 3 one-class classification methods, and (ii) Knuiman et al. have predicted coronary mortality in the
predicting probabilities of normality, arrhythmia, ischemia, Busselton cohort using logistic regression[5]. The electrocar-
and hypertrophy using a multiclass approach. diogram (ECG), being an important measurement of cardiac
For one-class classification, we considered 5 possible defini- function that is relatively easy to obtain, is surprisingly
tions for normality and used 10 automatically-extracted ECG rarely used for risk prediction. The aim of this paper is to
features along with 4 blood pressure features. The one-class
approach was able to identify abnormality with area-under- address the need for risk metrics that include ECG-derived
curve (AUC) 0.83, and with 75.6% accuracy. features by analysing CVD risks associated with abnormal
For four-class classification, we used 86 features in total, ECG, using the China Kadoorie Biobank dataset. This pa-
with 72 additional features extracted from the ECG. Accu- per describes two risk evaluation tasks: (i) abnormality
racy for this four-class classifier reached 75.1%. The methods detection and (ii) prediction of probabilities of normal,
demonstrated proof-of-principle that cardiac abnormality can
be detected using machine learning in a large cohort study. arrhythmia, ischemia, and hypertrophy. Since abnormality is
relatively rare in this database, novelty detection (the aim of
I. I NTRODUCTION which is to classify an under-sampled abnormal class) is an
Cardiovascular diseases (CVD) are the leading causes appropriate approach to address the first task. To address the
of mortality worldwide and in China [1]. There are large second task, we build models of normality, arrhythmia,
geographical and economic variations of CVD mortality in ischemia, and hypertrophy using a multiclass approach.
China [2], suggesting appropriate measures are needed for
II. DATASET D ESCRIPTION
prevention and effective treatment of the disease. The World
Health Organisation advises people at high CVD risk to The China Kadoorie Biobank (CKB) is a prospective
access early detection and treatment for prevention of CVD cohort study of over 520,000 adults from 10 areas in China
[1]. Identifying risks in individuals in the population could during 2004-2008 [6]. Data were collected using question-
help to provide advice to people to improve their lifestyle and naires and anthropometric and physiological measurements
help clinicians to discover appropriate treatments for specific were recorded at baseline and all participants provided a
conditions to reduce mortality and healthcare expenditure. blood sample. Information on cause of death rates was col-
Traditional CVD risk factors include smoking, hyper- lected from health insurance data and mortality and disease
cholesterolaemia, hypertension, diabetes, and obesity [2] registries. After five years, approximately 25,000 surviving
among many others. Traditional risk factors do not fully ex- participants were resurveyed with further questionnaires,
plain the risk of CVD in populations. Personalised medicine measurements, and blood collection. We have institutional
requires integrating all risk factors to which a person is ethics approval to use the data. Public access to the CKB data
exposed and then predicting risks for specific diseases, so can be found at http://www.ckbiobank.org/site/Data+Access.
to optimise preventative measures for individuals. Examples The data available for our study include:
of research for this purpose include [3], in which 5-year
mortality rate was predicted on 20 risk factors from 498,103 A. ECG time series
participants in UK Biobank using proportional hazard models Standard 12-lead ECG (10-s duration, 500Hz) was
and Harrells C-index. With an increasing number of risk fac- recorded on 24,369 participants using a Mortara ELIx50
tors being identified, and especially with abundant genetic- device in 2013-2014. Also available is a typical cycle from
and lifestyle- data now available, it can be expected that such each lead for each participant, which was generated by the
an approach will face difficulty as the healthy range of the device using a proprietary algorithm.
newly-identified factors is difficult to quantify.
Machine learning has the advantage of estimating the B. ECG Features
associations between risk factors and diseases without prior The Mortara device provides 10 main features (age,
average RR interval, P wave duration, the time point of
1 Department of Engineering Science, University of Oxford, UK
2 QRS offset, PR interval, QRS duration, QT duration, P axis,
Department of Mechanical Engineering, Shanghai Jiao Tong University,
200240, China QRS axis, and T axis) which were automatically extracted
3 Nuffield Department of Population Health, University of Oxford, UK from the typical cycles for each participant. A schematic
2420
2) Balancing of the Test Sets: To make a fair comparison equation 4. Similarly the posterior was thresholded at 0.5 for
between the normal criteria C1-C5 which have different classification.
class ratios (i.e. balance between normal and abnormal data), 5) Discriminative Support Vector Machine for one-class
we use the accuracy and AUC in balanced sets for model classification: To compare the results of KDE, we also used
evaluation. SVM. The coefficients C and (when using a Gaussian
We therefore created a balanced test set (a subset of kernel) control the flexibility of the separation boundary,
the unbalanced test set), containing all abnormal test data and were optimised by a grid search via 5 fold cross-
and the same number of normal data. The training set validation on the training set. The classification score was
remained unbalanced. The balanced test sets under criteria mapped to probabilities and thus the training set posterior
C1-5 contain 1,824, 5,688, 2,452, 3,586, and 688 data points P (C|y) was learned.
respectively.
3) Generative Kernel Density Estimator: We adapted the B. Four-class classification
model described in [8]. In brief, the normal probability 1) Constructing the training and test sets: We obtained
density function was learned from the training set by placing balanced training and test sets by taking all data from the
a multivariate Gaussian distribution on each 14-dimensional smallest class (Table I) and the same number of data points
data point. For ease of computation, we performed k-means from each of the other classes were randomly selected to
clustering to summarise the normal data with 500 cluster construct the training and test sets for 5-fold cross-validation.
centres in the 14-dimensional space. Only the most normal For example, the four-class balanced training-and-test set
(i.e., those labelled Normal ECG) were used in clustering. contains 1868 4 = 7472 datapoints. To illustrate the
The data likelihood is calculated via: distinctiveness of each class, three-class and two-class clas-
sifications of any combinations of the normal, ischemia,
N
1 X | xxi |2 arrhythmia, and hypertrophy were also performed for
p(x) = D e 2 2 (1)
N (2) D
2 comparison using the same approach to balance classes.
i=1
2) Training the 4-class model with Support Vector Regres-
A novelty score, y, is then calculated using equation 2. sion: A P (C|x) was estimated for each of the classes using
y(x) = log p(x) (2) support vector regression in a one-vs-all approach; i.e. the
regressor i was learned in a training set with only the class
i labelled 1 and other classes were labelled 0. The class
P (y|C)P (C) probability P (C|x) was calculated from the predicted value
P (C|y) = (3) of the regressor i according to Equation 5. Finally the data
P (y)
point was classified to the class with the highest probability:
We propose treating this novelty score as a univari-
|1yi |
ate summary of the 14-dimensional data, which may then e i
2421
TABLE II
AUC AND ACCURACY OF PREDICTING THE 5 NORMAL CRITERIA BY GENERATIVE KDE, DISCRIMINATIVE KDE, AND DISCRIMINATIVE SVM IN THE
BALANCED SETS . R ESULTS ARE PRESENTED AS THE MEAN STANDARD DEVIATION IN 5- FOLD CROSS - VALIDATION .
2422