Transparent Decision Support Using Statistical Evidence
by
Andrew Michael Hamilton-Wright
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Doctor of Philosophy
in
Systems Design Engineering
Waterloo, Ontario, Canada, 2005
c Andrew Hamilton-Wright, 2005
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any
required final revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
An automatically trained, statistically based, fuzzy inference system that functions as a classifier is
produced. The hybrid system is designed specifically to be used as a decision support system. This hybrid
system has several features which are of direct and immediate utility in the field of decision support, including a mechanism for the discovery of domain knowledge in the form of explanatory rules through the
examination of training data; the evaluation of such rules using a simple probabilistic weighting mechanism; the incorporation of input uncertainty using the vagueness abstraction of fuzzy systems; and the
provision of a strong confidence measure to predict the probability of system failure.
Analysis of the hybrid fuzzy system and its constituent parts allows commentary on the weighting
scheme and performance of the “Pattern Discovery” system on which it is based.
Comparisons against other well known classifiers provide a benchmark of the performance of the
hybrid system as well as insight into the relative strengths and weaknesses of the compared systems when
functioning within continuous and mixed data domains.
Classifier reliability and confidence in each labelling are examined, using a selection of both synthetic
data sets as well as some standard real-world examples.
An implementation of the work-flow of the system when used in a decision support context is presented, and the means by which the user interacts with the system is evaluated.
The final system performs, when measured as a classifier, comparably well or better than other classifiers. This provides a robust basis for making suggestions in the context of decision support.
The adaptation of the underlying statistical reasoning made by casting it into a fuzzy inference context
provides a level of transparency which is difficult to match in decision support. The resulting linguistic
support and decision exploration abilities make the system useful in a variety of decision support contexts.
Included in the analysis are case studies of heart and thyroid disease data, both drawn from the University of California, Irvine Machine Learning repository.
iii
Acknowledgements
I’d like to thank Dan Stashuk for support, patience and many enlightening discussions throughout this
work. Without these, this work would be quite different, and have significantly less immediate application.
Tim Doherty has greatly aided my understanding of the clinical environment, to which this work is
destined, and has thereby significantly assisted in the choosing of the goal.
Andrew Wong and Yang Wang deserve significant thanks, not only for their ideas regarding the Pattern
Discovery algorithm, but also for the use of some of the software from Pattern Discovery Inc. which was
used to verify the output of the routines developed here.
Hamid Tizhoosh has provided considerable experience and insight into the functioning of the fuzzy
systems upon which this work is based and has introduced me to several new ideas within this arena.
I would like to thank the other members of my committee for their input and suggestions regarding the
final form of this document and the arguments therein; without these ideas, the work would be significantly
poorer.
I would like to sincerely thank my parents, family and friends for their support and encouragement
while pursuing a goal which I know seemed like a crazy one to many people.
Perhaps most significantly however: thanks to Kirsty, for making all this seem like a good idea.
iv
Contents
1
Introduction
1.0.1
1
Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.1
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Medical Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Decision Support Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Hybrid Pattern Discovery/Fuzzy Inference System . . . . . . . . . . . . . . . . . . . . . .
7
1.4.1
Outline of Decision Support and its Evaluation . . . . . . . . . . . . . . . . . . .
7
1.5
Confidence Estimation and System Reliability . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
I
Preliminaries
10
2
Pattern Recognition and Decision Support
11
2.1
Pattern Discovery Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2
Pattern Recognition and Classification Techniques . . . . . . . . . . . . . . . . . . . . . . 14
2.3
2.4
2.2.1
Considerations Derived from the Data Domain . . . . . . . . . . . . . . . . . . . 15
2.2.2
Learning System Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1
Overall Classifier Reliability — ROC Analysis . . . . . . . . . . . . . . . . . . . 28
2.3.2
Input Specific Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
3
II
4
(Non-Fuzzy) Pattern Discovery
3.1
Pattern Discovery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2
Definitions for Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3
Pattern Identification Using Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1
Weight of Evidence Weighted Patterns . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2
Pattern Absence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5
Quantization: Analysis of Continuous Values . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6
Expectation Limits and Data Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Hybrid System Design and Validation on Synthetic Data
Construction of Fuzzy Pattern Discovery
43
44
4.1
Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2
Fuzzification of Input Space Data Divisions . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3
4.4
4.5
5
34
4.2.1
Creation of Fuzzy Membership Functions . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2
Fuzzy Membership Function Attributes . . . . . . . . . . . . . . . . . . . . . . . 47
Use of Pattern Discovery Based Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1
Mamdani Inference Using WOE Weighted Rules . . . . . . . . . . . . . . . . . . 48
4.3.2
Using WOE Directly With Fuzzy Rules . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3
Occurrence Weighted Fuzzified PD Rules . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4
Selection of a Classification Label . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Selection and Firing of PD Generated Rules . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.1
Fuzzy Rule Firing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2
Independent Rule Firing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Synthetic Class Distribution Data and Analysis Tools
54
5.1
Covaried Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2
Covaried Log-Normal Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3
Covaried Bimodal Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4
Spiral Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vi
6
7
5.5
Training Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6
Jackknife Data Set Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7
Classifiers Used in Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7.1
Back-Propagation Classifier Construction . . . . . . . . . . . . . . . . . . . . . . 60
5.7.2
Minimum Inter-Class Distance Classifier Construction . . . . . . . . . . . . . . . 60
5.8
Weighted Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Synthetic Data Analysis of Pattern Discovery
63
6.1
Covaried Class Distribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2
Covaried Log-Normal Class Distribution Results . . . . . . . . . . . . . . . . . . . . . . 66
6.3
Covaried Bimodal Class Distributions Results . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4
Spiral Class Distributions Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.1
Covaried Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.2
Log-Normal Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5.3
Covaried Bimodal Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5.4
Spiral Class Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5.5
Overall Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Synthetic Data Analysis of Fuzzy Inference System
77
7.1
Covaried Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2
Bimodal Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.3
Spiral Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.5
7.4.1
Performance Across Class Distributions . . . . . . . . . . . . . . . . . . . . . . . 83
7.4.2
Crisp versus Fuzzy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.4.3
Performance Across Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . 83
7.4.4
Performance Based on Training Set Size . . . . . . . . . . . . . . . . . . . . . . . 84
7.4.5
Cost of Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vii
7.6
III
8
Real-World Data and Decision Support
Analysis of UCI Data
8.1
8.2
8.3
9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
87
88
Analysis of Thyroid Disease Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.1
UCI Thyroid Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Analysis of Heart Disease Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2.1
Choice of Heart Disease Database . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2.2
Heart Disease Database Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2.3
Overall Statistics from PD Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2.4
Pattern Discovery Generated Rules . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2.5
Analysis of Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2.6
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Conclusions from UCI Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Confidence and Reliability
97
9.1
Confidence and Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2
Implementation of Confidence in Hybrid System . . . . . . . . . . . . . . . . . . . . . . 98
9.2.1
Construction of τ and δ Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2.2
Evaluating Confidence Through δ and τ . . . . . . . . . . . . . . . . . . . . . . . 99
9.2.3
Candidate Confidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.2.4
Confidence Measure Evaluation Methodology . . . . . . . . . . . . . . . . . . . . 105
9.2.5
Confidence Measure Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 107
9.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
10 Decision Support and Exploration
127
10.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
viii
10.2.1 Unimodal, 2-Feature Covaried Data . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.2.2 Unimodal, 2-Feature Covaried Classification With High Confidence . . . . . . . . 131
10.2.3 Unimodal, 2-Feature Covaried Classification With Low Confidence . . . . . . . . 134
10.2.4 Heart Disease Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.2.5 Heart Disease Classification With High Confidence . . . . . . . . . . . . . . . . . 138
10.2.6 Heart Disease Classification With Conflict . . . . . . . . . . . . . . . . . . . . . . 145
10.2.7 Interactive Exploration — Brushing and Selection . . . . . . . . . . . . . . . . . 148
10.2.8 “What If?” Decision Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.3.1 Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.3.2 Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11 Conclusions
152
11.1 Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.3 Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.4 Decision Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.5 Quantization Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
IV
Appendices
158
A Mathematical Notation
159
A.1 Greek Letter Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.2 Roman Letter Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
B Derivation of PD Confidence
163
C Conditional Probabilities Derived From Synthetic Data
166
C.1 Confidence and Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.2 Conditional Probability As Calculated Using z-Scores . . . . . . . . . . . . . . . . . . . . 166
C.3 Calculating z-Scores from Data Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
ix
D Further Tables Regarding Reliability Statistics
172
D.1 Symmetric Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
D.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
D.3 Interdependency Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
D.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
E Statistical Measure Performance Under Quantization
179
E.1 Performance Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
E.1.1
Entropy Statistics Versus SpearmannRanking . . . . . . . . . . . . . . . . . . . . 180
E.1.2
Choice of Binning Mechanism for Entropy Summary Statistics . . . . . . . . . . . 180
E.1.3
Choice of Q for Entropy Summary Statistics . . . . . . . . . . . . . . . . . . . . 180
E.2 Quantization Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
x
List of Tables
2.1
Mushroom Example Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1
Mamdani Output Values for WOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1
Covariance Matrices for 4 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1
Weighted Performance: Covaried Class Distribution . . . . . . . . . . . . . . . . . . . . . 64
6.2
Weighted Performance: Log-Normal Class Distribution . . . . . . . . . . . . . . . . . . . 66
6.3
Weighted Performance: Bimodal Class Distribution . . . . . . . . . . . . . . . . . . . . . 68
6.4
Weighted Performance: Spiral Class Distribution . . . . . . . . . . . . . . . . . . . . . . 68
7.1
Covaried Class Distribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2
Bimodal Class Distribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.3
4-feature Spiral Class Distribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4
Spiral Class Distribution Unclassified Records . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1
UCI Thyroid Data Record Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2
6 Feature Thyroid Data Correct Classification . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3
UCI “Hungarian Heart Disease” Database Feature Names . . . . . . . . . . . . . . . . . . 92
8.4
Rule Order for Hungarian Heart Disease Data . . . . . . . . . . . . . . . . . . . . . . . . 93
8.5
Heart Disease Data Correct Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.1
Conditional Probability -vs- Reported Confidence SpearmannRanked Comparison . . . . . 108
10.1 Heart Disease Classification High Confidence Data Record . . . . . . . . . . . . . . . . . 140
10.2 Heart Disease Classification Conflict Data Record . . . . . . . . . . . . . . . . . . . . . . 146
xi
D.1 Mutual Information Comparison Summary: Equal Binning, Q=10 . . . . . . . . . . . . . 175
D.2 Symmetric Uncertainty Comparison Summary: Equal Binning, Q=10 . . . . . . . . . . . 176
D.3 Interdependency Redundancy Comparison Summary: Equal Binning, Q=10 . . . . . . . . 177
D.4 Confidence Correlation Comparison Summary . . . . . . . . . . . . . . . . . . . . . . . . 178
E.1 Mutual Information of Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . 182
E.2 SpearmannRanking of Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . 182
E.3 Symmetric Uncertainty of Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . 182
E.4 Mutual Information of MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . 183
E.5 SpearmannRanking of MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . 183
E.6 Symmetric Uncertainty of MME Binned Uniform Data (1000 Points) . . . . . . . . . . . 183
xii
List of Figures
1.1
Data Flow Through Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
A Three-Dimensional Hyper-Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2
Pattern Discovery Rule Extraction Based on Uniform Random Null Hypothesis . . . . . . 13
2.3
A Back-Propagation Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4
Typical ART Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5
Exemplar matching in ART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6
Tree Characterizing the Mushroom Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7
Rough Set Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8
Specificity and Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9
Example ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1
MME Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2
Example MME Division of 2-class Unimodal Normal Data . . . . . . . . . . . . . . . . . 40
4.1
Fuzzy Mapping of MME Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1
Bimodal Data Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2
Spiral Data Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1
Covaried Data N=1000 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2
Covaried Data N=10, 000 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3
Log-Normal Data N=1000 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4
Covaried Bimodal Data N=1000 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1
Covaried Class Distribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xiii
8
7.2
Bimodal Class Distribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3
4-feature Spiral Class Distribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.1
δ/τ Histograms from Covaried Data, s=0.125 . . . . . . . . . . . . . . . . . . . . . . . . 100
9.2
δ/τ Histograms from Covaried Data, s=4.0 . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.3
CMICD Plots on 4-Feature Covaried Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.4
CMICD Plots on 4-Feature Bimodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.5
CMICD Plots on 4-Feature Spiral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.6
Cδ/τ Probability Plots on 4-Feature Covaried Data . . . . . . . . . . . . . . . . . . . . . . . 114
9.7
Cδ/τ Probability Plots on 4-Feature Bimodal Data . . . . . . . . . . . . . . . . . . . . . . . . 115
9.8
Cδ/τ Probability Plots on 4-Feature Spiral Data . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.9
Cτ[0...1] Plots on 4-Feature Covaried Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.10 Cτ[0...1] Plots on 4-Feature Bimodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.11 Cτ[0...1] Plots on 4-Feature Spiral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.12 CPD-Probabilistic Plots on 4-Feature Covaried Data . . . . . . . . . . . . . . . . . . . . . . . 122
9.13 CPD-Probabilistic Plots on 4-Feature Bimodal Data . . . . . . . . . . . . . . . . . . . . . . . 123
9.14 CPD-Probabilistic Plots on 4-Feature Spiral Data . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.1 Hierarchical Design Encourages Drill-Down Exploration . . . . . . . . . . . . . . . . . . 128
10.2 MME Division of 2-class, 2-feature Covaried Data . . . . . . . . . . . . . . . . . . . . . 129
10.3 δ/τ Histograms from Covaried 2-feature Data . . . . . . . . . . . . . . . . . . . . . . . . 130
10.4 Covaried Classification With High Confidence Summary . . . . . . . . . . . . . . . . . . 131
10.5 Covaried Classification With High Confidence Rule Set . . . . . . . . . . . . . . . . . . . 132
10.6 Membership Values For 2-Feature Simple Classification Example . . . . . . . . . . . . . 133
10.7 Covaried Classification With Low Confidence Summary . . . . . . . . . . . . . . . . . . 135
10.8 Covaried Classification With Low Confidence Rule Set . . . . . . . . . . . . . . . . . . . 136
10.9 Membership Values For 2-Feature Simple Classification Example . . . . . . . . . . . . . 137
10.10δ/τ Histograms from Heart Disease Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.11Heart Disease Classification High Confidence Summary Display . . . . . . . . . . . . . . 139
10.12Heart Disease Classification High Confidence Rule Set . . . . . . . . . . . . . . . . . . . 142
10.13Heart Disease Classification High Confidence Membership Functions . . . . . . . . . . . 144
10.14Heart Disease Classification Conflict Summary Display . . . . . . . . . . . . . . . . . . . 145
10.15Heart Disease Classification Conflict Rule Set . . . . . . . . . . . . . . . . . . . . . . . . 147
xiv
10.16Covaried Classification With High Confidence Summary - Exploration . . . . . . . . . . . 149
C.1 Logically Constructing Conditional Probability from PDFs . . . . . . . . . . . . . . . . . 167
C.2 Covaried Conditional Probability Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.3 Bimodal Conditional Probability Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 169
C.4 Spiral Conditional Probability Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
E.1 Raw Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
E.2 Q=2 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 185
E.3 Q=5 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 185
E.4 Q=8 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 186
E.5 Q=10 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . 186
E.6 Q=15 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . 187
E.7 Q=20 Equally Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . 187
E.8 Q=2 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . . 188
E.9 Q=5 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . . 188
E.10 Q=8 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . . 189
E.11 Q=10 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 189
E.12 Q=15 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 190
E.13 Q=20 MME Binned Uniform Data (1000 Points) . . . . . . . . . . . . . . . . . . . . . . 190
xv
Chapter 1
Introduction
The more I learn, the more I realize I don’t know. The more I realize I don’t know, the more I
want to learn.
— Albert Einstein
The pattern extraction techniques of the “Pattern Discovery” (PD) algorithm developed by Wang and
Wong (Wang, 1997; Wong and Wang, 1997) is extended into the fuzzy inference domain to form a decision
support system∗ (DSS).
The classification performance of the resulting fuzzy hybrid system is higher than that of PD alone
and a metric to characterize the confidence in each decision can be formed, producing a decision support
system that allows a transparent explanation of the decision process.
1.0.1
Rationale
Using a fuzzy inference system (FIS) as the basis of a decision support tool will allow transparent decisions
to be suggested using a linguistic framework based on sound statistical data. Such a system may be used
with a wide spectrum of data types from financial to resource management, however the application of
immediate interest to the author is the area of clinical diagnostic support. For this reason, real-world data
from two bio-medical domains are used in the enclosed analysis, and the presentation of the final system
centres on the discussion of heart disease data acquired from the well-known machine learning repository
∗
An index is provided at the end of the manuscript linking all key terms to their locations in the text.
1
CHAPTER 1. INTRODUCTION
2
at the University of California, Irvine.†
This document describes the production, function and performance of an FIS based DSS which uses
adapted rules extracted using the PD algorithm. The system and its evaluation tools have been written in
the C++ and Python programming languages by the author, with some dependency on the GNU Scientific
Library (Galassi, Davies et al., 2005) and the routines in LAPACK (Anderson, Bai et al., 1999). All design
and implementation efforts outside of these libraries have been the author’s own.
1.1
Decision Support
The use of decision support systems is an area which has been studied for more than twenty years with application areas as diverse as finance, emergency response, environmental management and many medical
applications.
Underlying all work in decision support is an interest in the management and reporting of the estimated
reliability of the decision, measured by the probability of successfully indicating the correct label (see
Larsson, Hayes-Roth et al., 1997; López de Mántaras, 1991; Aha, Kibler and Albert, 1991; Cordella,
Foggia et al., 1999; Levitin, 2002, 2003; Gurov, 2004, 2005). Without a measure of the reliability of
a system, the suggested analysis is useless, as graceful failure cannot be assured in the face of variable
quality data and inference. It is therefore critical that the system presented in this work is evaluated in
terms of the reliability of the decisions. Reliability measurement and system confidence prediction will be
presented in Chapter 2, and the quality of the confidence measure used within the hybrid system described
in this work will be discussed in depth in Chapter 9.
1.2
Decision Support
Decision support is a difficult field to define. In order to avoid spending many pages attempting to produce
a definition, we can content ourselves with that provided by Silver (1991, pp.13):
A decision support system is a computer-based information system that supports people engaged in decision-making activities.
In this context, “‘support”’ is intended to mean that
†
The famous UCI machine learning repository contains “real-world” data used for comparing techniques within the machine
learning community. It is available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html.
CHAPTER 1. INTRODUCTION
3
1. the system assists human decision makers to exercise judgement—that is, the system is
an aid for the person or persons making the decision; and
2. the system does not make the decision—that is, the system helps decision makers exercise judgement but does not replace the human decision makers.
Both of these points are crucial to the design of the hybrid system described in this work.
Silver’s first point indicates that the system design is intended to ease the making of a decision; this
implies that the format of the decision support data must be driven by the need for a human being to easily
comprehend the suggestions, rapidly understand its import and assimilate the data with possibly many
other data elements at their disposal at the time the decision is to be made.
For example, in heart disease, there are a large number of symptoms and markers for disease which
must be taken into account for any single patient. Importantly, these markers change depending on a
patient’s age and other factors. It would be reasonable then for a DSS to correlate and summarize the
patient’s data, highlighting the strongest and most informative markers. Data which is not relevant to the
decision will be shown in only a cursory way, or not at all. By following such a methodology, a decision
making user is given the information they need to make an informed decision without the tedium involved
in a manual collation and exploration. In this way, a software tool can support decision making.
The second point speaks to the important design principle that the decision support system is not
authoritative. At all times it must be kept in mind that any decision support system cannot function as a
black box, but must instead be as transparent and interpretable as possible, assisting in the formation of a
judgement. As part of this transparency, it is therefore important that a user may explore alternate decision
paths in order to come to a comfortable, informed decision of their own making, evaluating the data from
any particular source within the DSS in context with the data available from all other sources.
1.2.1
History
The field of decision support began with the needs of Management Information Systems (MIS) in the
1970’s. Several works (Morton, 1971; Sprague and Carlsons, 1982; House, 1983; Schniederjans, 1987)
describe the then growing need for middle management to have access to computer based systems to
interpret the ever larger quantities of data. Previously, data for management decision making was available
in (daily or weekly) printed reports, which were produced and collated with the expectation that the reader
would absorb the presented data with enough depth of understanding to form reasoned and informed
opinions on the contents.
CHAPTER 1. INTRODUCTION
4
The subsequent expansion of both business and its management along with the now familiar “information explosion” drove the need for a new tool which could condense a great deal of reporting data and
characterize it in terms of possible courses of action. Using such a tool, the data previously available in the
reports would be linked to a course of action in terms of degrees of support for that particular course. A
manager using this type of tool could then construct a business plan, taking into account the recommended
actions suggested by the DSS.
Once a tool of this type was created, it was obvious that its the applicability extended far beyond the
role of middle-management supply-line and financial decision making in a mid-size company.
Current common areas of application of decision support systems include:
Conflict Resolution and Generalized Decision Making: a field which produces general tools using the
concepts of decision support. These tools are as general in approach and therefore as universal
in application as possible (Hipel, Fang and Kilgour, 1993; Kilgour, Fang and Hipel, 1995; Rajabi,
Kilgour and Hipel, 1998; Sage and Rouse, 1999; Hipel, Kilgour et al., 2001).
Environmental Management: including mapping and management of toxins (Booty, Lam et al., 1997)
and watershed management (Hipel, Yin and Kilgour, 1995; León, Lam et al., 1997; Young, Lam
et al., 1997; Yyrdusev, 1997; Hipel and Ben-Haim, 1999). This area has a great deal of ongoing
research, as shown by recent workshops (Cortés and Sànchez-Marrè, 1999; Cortés, Sànchez-Marrè
and Wotawa, 2003),
Medical Decision Making and Disease Characterization: a field in which common areas of application are laboratory data management (Cowan, 2003), patient monitoring (Gibb, Auslander and Griffin, 1994; de Graaf, van den Eijkel et al., 1997; Abu-Hanna and de Keizer, 2003; Montani, Magni
et al., 2003), public health (O’ Carroll, Yasnoff et al., 2002) and disease characterization or evaluation of prognoses (Shortliffe and Perreault, 1990; Friedland, 1998; Kukar, Kononenko et al., 1999;
Innocent, 2000a,b; Colombet, Dart et al., 2003; Coiera, 2003). A new and growing area in which
this technology is finding application is in modelling outbreaks, providing a recognition tool for
fast response (Penaloza and Welch, 1997; Zhang, Fiedler and Popovich, 2004; Brillman, Burr et al.,
2005; Costa, Dunyak and Mohtashemi, 2005; Devadoss, Pan and Singh, 2005; Guthrie, Stacey and
Calvert, 2005; Majowicz and Stacey, 2005).
Business Management Information Systems: this is still a major application area for decision support,
and many texts provide discussion of this area of application (see any of Scott, Claton and Gibson,
1991; Adelman, 1992; Sage, 1991).
CHAPTER 1. INTRODUCTION
1.2.2
5
Medical Decision Support
Medical decision support literature ranges from discussions training physicians in the underlying data
tools needed to understand decision support (Harris and Boyd, 1995; Kononenko, 2001; Kukar, 2003;
Bennett, Casebeer et al., 2005) through handbooks assisting in the construction and design of decision
support systems (Berner, 1988; Keller and Trendelenburg, 1989; López de Mántaras, 1991; Larsson et al.,
1997)
The literature on medical decision support devotes more time than the management literature to a discussion of the reliability and desired confidence in the decision. Most of the decisions made by clinicians
are binary (two-outcome) tests for the presence or absence of a particular condition. The discussions regarding these tests are couched in the terminology of Receiver Operating Characteristic (ROC) curves‡
rather than the probability-of-failure measure common in other approaches.
1.3
Decision Support Tools
Any tool meant to aid a decision maker adds information to the decision process. Care must be taken
to decrease the cognitive load while increasing understanding. The human user must remain the final
decision maker, integrating the information supplied from several channels.
The decision making user:
• retains responsibility for any decision — in the medical community (among others) there are ethical,
legal and trust issues at stake;
• always has access to a higher level view of the data, frequently including data from other sources;
• is an expert in their field and wants to correlate their knowledge and hypotheses with analytical
results from this and other systems.
Any system that attempts to “replace” the decision maker or to “take over” any part of the analytical
process will most likely be met with distrust and will not be used. From an ethical standpoint, such a
system is inadmissible in any decision making arena.
The objective of a DSS tool is to augment the capabilities of the decision maker by providing an
automated means of integrating facts and correlating measurements which are otherwise difficult or time
consuming to asses. The results of this automated process are available as a condensed logical suggestion
which can be incorporated into a larger context, along with any other sources of information available.
‡
An overview of the construction and utility of ROC curves is provided in Section 2.3.1 in Chapter 2.
CHAPTER 1. INTRODUCTION
6
If the results of a decision system are to be combined in a larger scope, the decision support tool
cannot be a “black box” as described in Wiener (1948, 1961, pp. xi).
A useful decision support tool must therefore exhibit all of the following attributes:
transparency: if a decision maker can not determine on what grounds an automated characterization is
being suggested, they will rightly not trust the conclusions of such a system. It is critical that at all
levels the decisions produced by any DSS may be easily accessed and exposed to analysis.
speed: the suggested characterization must be produced on a reasonable time-scale. If, for example in the
medical domain, the system is too slow, then the results will be irrelevant, as the examination will
be over and the patient will have left, preventing any iterative analysis. Correlation of the suggested
results with other data must happen during the decision process.
graceful degradation: if the system must fail, this must occur with grace, indicating an increasing possibility of decision failure as a greater chance of error is encountered.
conservatism: as the degree of aggressiveness or conservatism of a classifier relates to the balancing of
different types of error (false negative versus false positive), the means of choosing this balance
lies in the mechanism used to integrate the automated suggestion with other data. The system must
therefore support analysis of the decision confidence, providing a means to separate likely errors
from quality decisions.
simplicity of use: all DSS tools function by enriching the decision environment. It is therefore easy to
overwhelm the decision maker, providing so much information that the decision process is made
more difficult. This effect is well described by Shortliffe and Perreault (1990) and by Kononenko
(2001).
Notably missing from the above list is “optimal performance as a classifier,” as all of the above factors
must be taken into account in DSS design. While the frequency at which a DSS suggests the correct course
of action is important, the transparency of the system outweighs the need for an optimal classifier. Any
classifier with suboptimal performance, but “white box” transparency will be superior to a optimal “black
box” classifier, assuming that a quality measure is also provided to give an estimate of the “white box”
system reliability.
What is desired is a simple, transparent system which will allow the decision maker to easily see the
decision confidence, while providing a means to “drill down” through the decision process to inspect each
phase of the decision construction.
CHAPTER 1. INTRODUCTION
1.4
7
Hybrid Pattern Discovery/Fuzzy Inference System
For this work, a hybrid approach has been selected, using a combination of methods derived from fuzzy
inference systems, and from the “Pattern Discovery” (PD) algorithm.
The work described here uses a rule framework generated using the PD algorithm, adapting this framework to function within the context of fuzzy inference. The results produced by fuzzy inference are then
presented as a means of supporting each of several possible decision outcomes in the context of their
contextual data.
The PD algorithm described in Wang (1997); Wong and Wang (1997, 2003) and in Wang and Wong
(2003) is an inherently probabilistic, unsupervised learning algorithm for discrete-valued data capable of
discovering polythetic patterns without exhaustive search.
These patterns are discovered through analysis of labelled training data using a contingency table
to isolate true patterns from background noise, which can then be used to fill in any missing values in
new data samples; treating a single column as a “label” column allows the PD algorithm to function as a
supervised learning classifier. A description of this algorithm is provided in Chapter 3.
The PD algorithm was developed to deal with discrete (ordinal and nominal) data; a preliminary
investigation in this work evaluates the performance of the PD algorithm to allow it to function in a
continuous data domain through data quantization. Such quantization always comes with a cost, as the
fine-grained detail that may be present in the underlying process is masked by the application of relatively
coarse quantization intervals.
The extension and recasting of the quantized data into an FIS will allow some of the cost of quantization to be reduced, as the bin boundaries can be softened and the artificial nature of the crisp quantization
bound can be diminished. The construction of the FIS is provided in Chapter 4 in which adaptations are
made to the PD rules in order to improve their classification performance.
The resulting PD/FIS system is then evaluated in a decision support context after a discussion of the
production of a confidence measure which estimates the reliability of each suggestion produced.
1.4.1
Outline of Decision Support and its Evaluation
Figure 1.1 indicates the data flow in the system. Training data records are presented to the PD algorithm. Maximum marginal entropy (MME), as described in Gokhale (1999); Chau (2001), is used as a
discretization mechanism, producing a set of crisp events and allowing the basic PD algorithm to function
in a continuous data domain.
CHAPTER 1. INTRODUCTION
(Labelled)
Training
Data
Pattern
Discovery
Training
+MME
(existing
work)
(Unlabelled)
Test
Data
8
Crisp
Quantization
Intervals
& Rule
Base
n
tio r)
a
c
ro
fi
ssi ng er
a
l
i
C urr
c
(in
PD
Labelled
Test
Data
Fuzzy
Membership
Functions
& Rules
Decision
Support
Exploration
of Inference
on
ati or)
c
i
f err
ssi
Cla urring Prediction of
Classification
c
(in
Error
FIS
Labelled
Test
Data
Figure 1.1: Data Flow Through Hybrid System
The events based on these crisp quantization intervals are explored by the PD algorithm, generating
a set of “patterns” (or rules). The combination of a fuzzified version of the quantization intervals, rules
produced through PD plus a new weighting scheme (that improves system performance) are together used
as the basis of an FIS. This is described in Chapter 4.
The performance of the resulting FIS and PD systems are compared using several synthetic and real
examples.
A number of synthetic data type distributions used for comparison are described in Chapter 5. The
analysis of the performance of the systems on these distributions is presented in Chapters 6 and 7 for the
continuous-valued PD algorithm and the fuzzy system respectively.
Analysis of the system on real-world data is performed using clinical thyroid and heart disease data
found at the UCI (University of California, Irvine) Machine Learning Repository (Newman, Hettich et al.,
1998).§ A description of these data sets and the analysis and discussion of the hybrid classifier performance is provided in Chapter 8.
§
i.e., http://www.ics.uci.edu/∼mlearn/MLRepository.html
CHAPTER 1. INTRODUCTION
1.5
9
Confidence Estimation and System Reliability
The FIS based classifier provides degrees of support for multiple output classes. Using these degrees
of support, a confidence measure is created that predicts the probability of classifying an input record
correctly, based on a certainty-type measure relating the internal decision consistency or conflict with the
probability of having produced an erroneous suggestion. This confidence scheme thereby provides an
indication of the probability of failure, and can therefore be seen as a predictor of system reliability.
A discussion of the calculation and use of reliability measures is presented in Chapter 9, along with a
discussion of classifier and inference reliability and confidence analysis as implemented in the FIS.
Finally, the use of the PD/FIS as a decision support system is analyzed in Chapter 10, and the overall
conclusions and recommendations for future work are found in Chapter 11.
1.6
Summary
The purpose of this work is to introduce and describe a means to extract knowledge from training data
for use in decision support. The primary motivation behind the data analysis and knowledge discovery
process is to use the resulting knowledge base in the context of supporting complex, high risk decisions.
The resulting tool must therefore be a decision support system of the highest quality.
A high quality decision support system must transparently perform two actions:
1. provide a correct labelling (i.e., classify input data) with at least reasonable frequency and
2. report a confidence in the suggested labelling that truly measures the probability of an inaccurate
suggestion.
This implies that for decision support, a good classifier is required as a basis; this need not, however, be
the “best” possible classifier if the “best” classifier is a “black box”. A suboptimal, “white box” classifier
exhibiting a good confidence measure is significantly more useful than an optimal “black box”. Such a
“white box” system is more trustworthy, and thereby more reliable than an “optimal” classifier whose
function cannot be understood in the context of a human user making a larger decision.
Part I
Preliminaries
10
Chapter 2
Pattern Recognition and Decision Support
Discovery is to see what everybody else has seen, and think what nobody else has thought.
— Albert von Szent-Gyorgi
The work described here involves the development of a decision support system based on a classifier.
For this reason, it is important to create a context of surrounding work in both decision support and
classification to which this system relates.
The most important relationship is the background and description of the PD system itself. Due to the
importance of this topic, an in-depth discussion will be left to Chapter 3, which will provide the algorithm
and describe its use and relevant theory. To provide a background for the discussion of other methods,
however, this chapter will introduce the general form of the PD system prior to the comparative discussion.
Once this background is provided, the reader is given an overview of various types of classification
systems which can be used for decision support. The relative merits of each system are discussed in
relation to a PD based design. This provides a context for the motivation for using the PD based FIS
described in this work.
The last section of this chapter will outline the past use of reliability metrics within decision support,
and discuss how decision reliability is managed and measured in other systems.
2.1
Pattern Discovery Overview
In general, the PD system introduced by Wang (1997) was designed to locate and describe statistically
significant patterns observed in discrete (integer, ordinal or nominal) training data.
11
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
12
Figure 2.1: A Three-Dimensional Hyper-Cell
In order to function in a continuous domain, this work presents the PD pattern extraction algorithm
with discretized data constructed via independent analysis of the density of the feature values along each
dimension using marginal maximum entropy (MME) partitioning (Gokhale, 1999; Chau, 2001).
The PD algorithm then constructs a contingency table from these discretized training values to create
an event-based (hyper-cell) partitioning of the input space, as shown in Figure 2.1. In this figure, a
three-dimensional hyper-cell is shown as the intersection of the space defined by three cells on their
respective axes. The hyper-cell is the discretely bounded space associated with unique quanta along (in
this case) the axes x, y and z. A four dimensional hyper-cell would simply include a similar value along
the axis w. Hyper-cells, which form first-order events, are related to each other by rules extracted through
mutual occurrence of observed values. Potential rules must pass a test of statistical rigour in order to
be considered, preventing patterns describing correlations due to random noise from being considered as
rules.
Classification decisions are made using a nonlinear-weighted, information-theory based estimation of
the relative likelihoods of each possible labelling. These estimates are calculated by using the set of rules
triggered by matching input values.
Relative degrees of likelihood are calculated for each label, relating each possible choice within a
spectrum running from total support through complete refutation, incorporating both positive and negative
logic rules to describe the relationships present in the data. The existence of negative logic rules allows the
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
observations of classes A, B and C
match expectation; no patterns found
A
conflict−free observations
cause both positive−
and negative−logic rules
rules created
for classes
A, B and C
13
observation of C matches expectation;
A and B both significant
A
B
B C
C
A
C B
B
B
C
B
A A
C
A
C
x
y
no data implies strong negative
association for all classes
Figure 2.2: Pattern Discovery Rule Extraction Based on Uniform Random Null Hypothesis
characterizations of known negative relationships, as well as providing context to the degrees of positive
relationship.
While the theory behind the pattern extraction algorithm rests on analysis of entropy, the rationale behind each decision can be discovered by comparing the observed probabilities of matching data occurring
in each hyper-cell.
An example is shown in Figure 2.2 where a 3 × 2 grid is populated with data from three classes, “A”,
“B” and “C”. There are 6 instances of each of the classes. The null hypothesis therefore states that the
number of occurrences should be close to 1 in each cell (ignoring the effects of variance in this very tiny
example). Each of the cells in this example have been constructed to demonstrate a particular facet of the
PD hypothesis testing scheme.
Beginning in the top-left corner, we see a cell for which the observation (1 occurrence of each class)
matches the hypothesis. For this cell, no patterns will be recorded as the model perfectly predicts this data.
The top-centre cell holds twice the number of predicted occurrences of each class. For this cell,
patterns will be produced for all classes; each pattern will indicate that there is a significant association
of its label with this class. As each pattern predicts the same number of occurrences, all the weights
will be identical. These patterns will therefore record knowledge about the data without being useful for
classification purposes, as there is no information present useful for this purpose. Note that this is quite
different from the first cell, where there was no information present at all; the fact that this cell forms a
data-dense region may be of interest to a user even though there is no decision-specific information.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
14
Moving to the top-right cell, we see a single occurrence of class “C”, again the value predicted by
the model. There will therefore be no pattern referencing class “C” constructed for this cell. Classes “A”
and “B”, however, both differ from the model; class “A” by a negative deviation, and class “B” by a quite
strong positive one. Class “A” will therefore have a negative rule indicating that it will not likely appear
here, class “B” will have a strongly weighted positive rule indicating it is a likely occurrence.
Continuing across the bottom row of Figure 2.2, the two left-most cells will have positive logic rules
for classes “A” and “C” respectively. In these cells there will be negative logic rules for each class which
is not observed.
The bottom-right cell contains no data, indicating that classes “A”, “B” and “C” will all have a negative
association with this cell. Again, this is not useful for classification purposes, as based on this training
data we have no knowledge of a most-likely labelling for this cell. What is present is the strong knowledge
that data in this cell is rare, this knowledge is valuable to a user as any data which may appear in this cell
after training is of the utmost interested, even if the PD algorithm cannot suggest a possible labelling. If
this strong knowledge is contrasted with the weak knowledge found in the top-left cell, it is apparent that
direct knowledge of event rarity is quite different from the knowledge of random occurrence, as a random
occurrence simply indicates that there is no information present.
The presence of both positive and negative logic rules, and the source of the rules in calculation
of mutual occurrence provides a transparency of inference extending down through the rule base to the
underlying data distribution. This level of transparency makes the use of PD highly advantageous as the
foundation for a DSS, along with the benefits of the statistical basis of both its rules and the MME input
quantization.
Simply put, in a domain such as decision support where the importance of transparency is paramount,
the accessibility of simple, robust, statistical arguments makes a PD based system very attractive.
2.2
Pattern Recognition and Classification Techniques
“Pattern recognition” is the study of techniques for the extraction and matching of a known pattern in
a test data set. This is a subset of the larger field of “machine learning.” To discuss the features of the
PD based FIS system described in this work, a description of the data analysis issues will be presented,
followed by a brief description of various classification algorithms which are popularly used.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
2.2.1
15
Considerations Derived from the Data Domain
The domain of the input data, and the universe of possible values within it, is a defining characteristic
for classification algorithms. Some classifiers function best on continuous data, some on discrete. This
is largely due to the structure of the data representation within the classifier as it relates to the structure
of the data itself. The largest portion of real-world data is, however, continuous. In order to use discrete
methods on continuous data, some form of quantization must be applied.
Classification of Continuous-Valued Data
Classifiers designed for use with floating-point values generally view the data domain as an n-dimensional
space in which one or more surfaces will be placed, to form boundaries between decision regions. Each
region is unambiguously associated with one of K labels associated with the data set.
If these surfaces happen to form an orthogonal planar division of the input feature space it is quite easy
to explain the mapping between feature values and output labels. This scenario, which is very unlikely
when locating an optimal input space division, is a common characteristic of quantized/discrete analysis
schemes and is largely what makes them so easily explained.
Note that even when a classifier is described as functioning on “continuous” data, the domain of the
input data for a supervised classifier is still not x ∈ R. There are two major reasons for this: the underlying
representational limits of a digital “floating point” representation; and the fact that in any finite amount of
training data, the infinitely large universe of real numbers (i.e., R) cannot be realized. Instead, a relatively
small number of distinct values will be observed, though the universe of values which are possible in a
training set is quite large (and bounded only by the digital representation).
Classification of Discrete Data
“Discrete data” by contrast can be defined as a data universe in which the set of possible values is both
finite and, usually, small. The largest universe of such data will be x ∈ I; smaller nominative or ordinal
sets are also described as discrete data.
Classifiers of discrete data tend to be relatively simple as there is a known finite universe of possible
cases into which each data element can fit.
It is therefore possible to come up with an exhaustive description (at least for relatively small universes) which can eliminate the need for generalization to large sections of the data universe. A complete
description of the universe can be created using a large contingency table outlining all possible choices —
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
16
that is, the data can be divided among a set of orthogonal hyper-cells, where cells are created independently
along each axis, and thereby group input data values.
As the universe of possible combinations grows, the main problem in a discrete classification system
is that it may be impossible to ascertain a probable labelling for some cells in the contingency table, as
some data value combinations may never be observed during training.
One of the major problems with the application of discrete algorithms is the fact that a large fraction
of real-world data is continuous. In order to use a discrete algorithm on continuous data, the data must be
quantized or “discretized”. Adapting quantized data to discrete algorithms is therefore an important topic.
The case of dividing input data into discrete quanta can be seen as similar to the division of the input
space as managed under continuous data. A quantization algorithm tends to construct the (orthogonal)
input divisions described above as unlikely when using a continuous algorithm. The difference between
the orthogonal division of the discrete quantization and the “least error” division of the continuous algorithms is a significant source of error when using continuous data in a discrete algorithm. The benefit is
that feature independent orthogonal quantization is transparent, and easy to explain. Such quantization
produces an event-based data space in which discrete algorithms can be used.
As will be shown in the discussion on the performance of the fuzzy inference system (FIS), some of
the elements in the discretized data can be accommodated by using rules speaking to data with similar
locality when discrete analysis is performed on quantized continuous data. This alleviates some of the
cost incurred by using quantization.
2.2.2
Learning System Architectures
A summary will now be presented of several popular classification architectures. In each case, the
strengths and weaknesses relative to the proposed PD/FIS system will be discussed, in terms of the intended application area of decision support systems.
Back-propagation Artificial Neural Networks
A back-propagation network (Rumelhart, Hinton and Williams, 1986; Minsky and Papert, 1988) trains
by using a random initialization of weights describing a set of partitions; an error surface is iteratively
minimized by successively considering the error relative to each input point many times.
This is a gradient-descent method, so therefore back-propagation networks are prone to issues with
local minima in the error space. The main obstacle to their application in this domain is the resulting lack
of interpretability of the final stable state.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
17
Output Nodes
Hidden Nodes
(control degree
of complexity)
Input Nodes
Figure 2.3: A Back-Propagation Neural Network
This type of network is constructed by a set of input, output and hidden nodes, as shown in Figure 2.3 (adapted from similar figures found in Rumelhart et al. (1986); Minsky and Papert (1988); Simpson (1991); Hertz, Krogh and Palmer (1991) and others). Each node is fully connected to each subsequent
node; the value in each node is therefore passed on to each subsequent node after being scaled by a weight
value associated with the link (i.e., the lines in Figure 2.3).
The back-propagation algorithm describes how to tune these weights to reduce the observed error on
training data. The weights are usually seeded with randomized values.
Each node in the hidden layer (or layers) of a back-propagation network allows a greater degree of
non-linearity in the final partition space by controlling the location and angle of some high-dimensional
hyper-plane. While the geometry of these planes is accessible through an examination of the weights,
the actual topology of the space is not easily visualizable, and certainly is not explainable to a user not
familiar with the mathematics involved.
A further complication is that the gradient descent search from a randomly initialized topology is not
likely to produce a division of input features which is logical in anything other than an abstract mathematical sense. In particular, the divisions of the input space will appear arbitrary to a casual user, again
requiring a mathematical explanation to assure the user of their correctness and logic.
For this reason, a simpler division of the input feature space and a more intuitive description of the
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
18
resulting decisions surfaces will produce a more accessible and transparent classification engine, resulting
in a more understandable (and therefore better) decision support system.
Such a network is sometimes referred to as a multi-layer perceptron (MLP).
Expectation-Maximization (E-M)
The E-M algorithm (Duda, Hart and Stork, 2001) operates on a maximum-likelihood probabilistic (Bayesian)
approach. The main feature of the E-M algorithm is that its design takes into account missing feature values, replacing their values by a maximum-likelihood estimation when required during training.
While this admirable feature makes it robust and useful on many real-world continuous and discrete
data sets when missing values are present, it suffers from many of the same drawbacks as back-propagation
in terms of its transparency and interpretability.
E-M is not a gradient descent algorithm; instead the E-M algorithm maximizes the log-likelihood of
all observed data values, filling in expected values for any missing data points. The global log-likelihood
expectation is maximized by an iterative search based on a simple assumed model Θ which is gradually
fit to match the observed likelihoods of the available data. This algorithm is explained in detail in Duda
et al. (2001, pp. 124–128).
While not a gradient descent algorithm, the resulting fit to a parametric error surface will be just as
opaque as the local minima fit optimizations of back-propagation. Essentially, the E-M algorithm is still an
iterative fit to an error surface; the means by which the fit is produced is not instructive in determining the
value of any final rule. In particular, both the input space divisions and resulting classification geometry
will again require significant mathematical analysis to convey the relationship between the final rules and
the input data to a user.
Support Vector Machines (SVM) and Maximum Margin Classifiers
Support Vector Machines (SVM) (Vapnik, 1995; Joachims, 1998, 2005; Cristianini and Shawe-Taylor,
2000; Duda et al., 2001) and the larger family of “kernel” based classifiers such as maximum margin
classification (Xu, Neufeld et al., 2005) actually increase the dimensionality of the input space (potentially
to an infinite number of dimensions) by projection.
The advantage of doing this is to move the data into a high dimensional space in which the data is well
separated, and classification with low error rates is possible.
Excellent performance results may be had by this technique, however from an interpretability point of
view this idea obscures the inner workings of the algorithm even more than is the case in back-propagation
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
19
as instead of working in a space with analogous dimensionality to the input space, a user must now
understand the projection by which the kernel methods expand the number of degrees of freedom for the
problem.
Bayesian Decision Theory
The ideas behind Bayesian Decision Theory (Bayes, 1763; Duda et al., 2001) involve the probabilistic
minimization of computed risk.
The main limitation in Bayesian decision theory is the assumption that the distribution from which
input data points is drawn is well understood and can be characterized in terms of its distribution and
overall likelihood.
The PD algorithm used in this work does not make any such assumptions, and as will be shown in later
chapters is largely insensitive to the actual distribution. This feature makes PD an interesting candidate
for DSS design.
Dempster-Shafer Theory
The decision technique referred to as Dempster-Shafer (D-S) Theory (Dempster, 1968; Shafer and Pearl,
1976; Shafer, 1990; Shafer and Pearl, 1990; Yager, Fedrizzi and Kacprzyk, 1994) is the combination of
strict maximum probability based assignment with a belief model. The addition of belief to probability
theory provides a mechanism for the representation of the differing amounts of knowledge available in various situations, and in particular, allows a representation of conflict and uncertainty within the mechanism
of probabilistic inference.
As an example, of the function of D-S theory, let us consider a case where two witnesses, Laura and
Monica report on whether a burglary has just occurred at a store. At the time of the incident, Laura was
across the street from the store, waiting for a friend. Monica, on the other hand was sitting on a bench
reading a book, and therefore not paying as much attention to her surroundings. Let us therefore represent
our degree of belief in statements from the two witnesses as BL = 0.9 and BM = 0.6, indicating that we
have a high degree of belief in Laura’s testimony, and a lesser degree in Monica’s.
If Laura states that a burglary took place at the store, and if Monica disagrees, then we can represent
our understanding using D-S theory. First, based on only Laura’s statements, we would have a 0.9 degree
of belief that a robbery took place, but a 0 degree of belief that one did not. This is due to the fact that
discounting Laura’s story does not contradict the possibility that a robbery occurred unobserved. Similarly,
Monica’s story gives us a 0.6 degree of belief that a robbery did not occur, but a zero degree that one did.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
20
Treating the witnesses as independent, we can simply multiply values to calculate probabilities, as
described in Dempster (1968). Therefore we can calculate the probability that Laura is reliable but Monica
is not
RL = BL × (1 − BM ) = 0.9 × 0.4 = 0.36
(2.1)
or that Laura is not reliable but Monica is
R M = (1 − BL ) × BM = 0.1 × 0.6 = 0.06;
(2.2)
we can also calculate the probability that neither is reliable
R∅ = (1 − BL ) × (1 − BM ) = 0.1 × 0.4 = 0.04.
(2.3)
We must normalize each of these possibilities over the universe of total possibility. Note that in this
case, the universe does not sum to 1.0 because Laura and Monica have taken opposing views, and in
consequence there is no possibility that they are both correct. The size of the universe of possibility in this
case may therefore be expressed as the sum of the three probabilities just discussed, or
U = RL + R M + R∅ = 0.36 + 0.06 + 0.1 = 0.52.
(2.4)
Taking each of these possibilities in turn over the universe of possibility, we get:
RL 0.36
=
= 0.69
U
0.52
R M 0.06
Pr(R M ) =
=
= 0.12
U
0.52
0.1
R∅
Pr(R∅ ) =
=
= 0.19
U 0.52
Pr(RL ) =
(2.5)
This allows us to represent the probability that Laura is correct (and there was a robbery) at 0.69; the
probability that Monica is correct (and there was no robbery) at 0.12; and the additional probability of not
knowing whether a robbery occurred or not as 0.19.
As explained in Zadeh’s (1984) review of Shafer and Pearl (1976), the result of D-S modelling is
an inference system which is based on probabilities involving sets of elements, rather than on a point
probability model. Even though in his review Zadeh states that (in his opinion) “D-S theory does not
capture the human mode of reasoning about possibility,” this model of representation remains interesting,
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
attentional
subsystem
SEARCH
orienting
subsystem
+
STM
F2
gain 2
−
+
+
STM
+
+
F1
gain 1
+
internal
active
regulation
reset
+
21
Input
−
+
matching
criterion:
vigilance
parameter
Figure 2.4: Typical ART Neural Network
as attested by the ongoing and vigorous publication activity in this field. A model based on D-S theory to
provide decision support in some manner designed to fit the purpose evaluated in this work would certainly
be possible, and a future work could explore this option.
Adaptive Resonance Theory (ART)
The Adaptive Resonance Theory (or ART) algorithm of Grossberg (1976) is a neural net based approach
which was originally phrased as a model for true biological learning. While based on the ideas in Grossberg (1976), the technique was fully presented in Carpenter and Grossberg (1987b) and has since been
refined (see Carpenter and Grossberg, 1987b,a, 1990; Carpenter, Grossberg and Reynolds, 1991a; Grossberg, 1995).
This network stores exemplar values observed in a training data set and initializes a new exemplar
when no existing match can be found within a given tolerance. When no new exemplar is created, the
nearest match existing exemplar is updated to take into account resemblance to the newly viewed input
vector.
Figure 2.4 shows the general scheme of an ART-based neural network. This figure is reproduced
from Carpenter and Grossberg (1990), as is Figure 2.5, which shows the sequence of events that occur in
ART pattern matching. In Figure 2.5 a), a new pattern is shown to the system, which does not exactly
match the previously stored exemplar. In Figure 2.5 part b), this is discovered by the “attentional subsys-
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
+
+
+
+
+
111111111
000000000
000000000
111111111
000000000
111111111
000000000
111111111
F1
−
+
+
−
+
+
+
+
11111
00000
00000
11111
00000
11111
00000
11111
+
F1
−
+
+
b)
F2
1111111
0000000
0000000
1111111
0000000
1111111
0000000
1111111
+
+
+
+
11111
00000
00000
11111
00000
11111
00000
11111
+
reset
1111
0000
0000
1111
0000
1111
0000
1111
+
+
+
a)
−
F2
reset
−
1111
0000
0000
1111
0000
1111
0000
1111
+
−
F1
−
+
+
+
c)
+
F2
reset
+
F2
reset
1111
0000
0000
1111
0000
1111
0000
1111
+
22
+
111111111
000000000
000000000
111111111
000000000
111111111
000000000
111111111
+
F1
−
+
d)
Figure 2.5: Exemplar matching in ART
tem” forming the left-hand side of the network. Figure 2.5 c) shows that the patterns are insufficiently
different (determined by a threshold control termed the “vigilance parameter”), and so in Figure 2.5 d),
the exemplar is updated to incorporate the new data.
This striking idea has wonderful visual connotations, and is potentially very expressive in pattern
matching, assuming the user can transparently understand the metric by which a match occurs.
As originally proposed in Grossberg (1976) and as refined through Carpenter and Grossberg (1987a,b,
1990) ART deals strictly with discrete data values. By adapting the feedback state through use of fuzzy
logic techniques, “Fuzzy ART” (Carpenter, Grossberg and Rosen, 1991b) seamlessly deals with continuousvalued data, and provides one of the most elegant refinements of a discrete algorithm through the use of
fuzzy systems.
While ART is interesting, in the context of our decision support goals it is not clear that the matching
of the exemplars in a generic feature space would be explainable, other than in a manner similar to that
used by back-propagation.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
23
Evolutionary Algorithms
Evolutionary algorithms, as a family, take the idea of gradient descent search into a probabilisticallydirected search domain.
Within the general family of evolutionary algorithms are such ideas as genetic algorithms and genetic
programming (Goldberg, 1989; Goldberg and Deb, 1991; Syswerda, 1991; Koza, 1992; Mitchell, 1996)
as well as simulated annealing (Kirkpatrick, Gelatt et al., 1983), and other randomized search techniques
such as particle swarm (Kennedy and Eberhart, 1995).
All of these techniques revolve around the central idea that local minima can be avoided by adding a
noise element to the search direction. Coupled with a parallel search of the problem space involving many
(randomly initialized) points, this powerful technique is tractable even when applied to several difficult
problems for which other techniques fail. In particular, the problem space need only be characterized in
terms of a cost or benefit function; geometries which are poorly understood can thus be traversed as long
as two possible solutions can be evaluated in terms of their merit, even if they cannot be evaluated in any
other way.
While quite powerful and requiring extremely small amounts of configuration, the results returned by
an evolutionary solution have no guarantee of interpretability other than as a least-cost/best-fit solution.
While the solution itself may perform well with respect to classification, there is no available explanation of the reasons underlying the selection of a given label.
Decision Tree Classifiers
A decision tree classifier functions by establishing a set of decisions that, when made successively, result
in the assignment of a label to input data.
Decisions and sub-decisions are always made based on the same initial test, so the decisions themselves can be structured in the form of a tree.
Decision trees form a powerful, simple means of rapidly coming to a classification based on applying
a sequence of decisions over an input data vector in order to traverse the tree.
The broad field of decision tree classifiers contains algorithms such as Ross Quinlan’s ID3 (Quinlan,
1986), C4.5 (Catlett, 1991; Quinlan, 1993) and FOIL (Quinlan, 1996) algorithms, as well as other popular
systems such as WEKA (Witten and Frank, 2000).
The main drawback of a decision tree is the forced direction of traversal. This does not capture how
humans actually think, and may not be well supported by the data if there are missing values present.
In order to support human decision making, it is preferable to support the human mode of thought; this
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
24
Table 2.1: Mushroom Example Data Set
Size
Large
Large
Small
Small
Small
Spots
Spotted
Striped
Spotted
Spotted
Plain
Colour
Yellow
Brown
Brown
White
Brown
Large
Yellow
Class
Poisonous
Edible
Edible
Poisonous
Poisonous
Not Large
Not Yellow
Spotted
Brown
Not Spotted
Not Brown
Poisonous Edible
Poisonous
Edible
Poisonous
Figure 2.6: Tree Characterizing the Mushroom Data Set
especially includes exploring the results of “what if?” questions, which may involve constructing answers
based on partial knowledge which is not encoded in the tree.
In a decision tree, a single test must be isolated as the root of the tree. The PD algorithm allows the
presentation of the highest-weighted rule triggered by the match of the data vector as the strongest (and
therefore first) source of decision support. In a tree system there is always a fixed first question which one
must ask.
Similarly, it is impossible, while navigating a tree, to ask a question based on an unobserved value –
the best response which can be made is that the creation of the tree did not necessitate observation of that
value, based on traversal from the initial root node. This is completely different both from deciding the
value is irrelevant to any decision, and as is the case in PD, from deciding the value cannot answer any
questions with statistical confidence.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
25
Consider the example data set shown in Table 2.1, and the decision tree shown in Figure 2.6. While
this tree adequately captures the decisions needed to classify a new input record, it cannot answer the
question:
“Are mushrooms which are and edible?”
A system presenting a full contingency table would allow exploration of such a question. The partial
contingency table of PD supports all questions for which a statistically significant answer is available. To
do this, significantly more rules are stored by PD than are stored by C4.5.
The structure of the tree yields an efficient means of classifying data (as the maximum number of data
elements examined is equal to the height of the tree).
The efficient classification of a tree algorithm is significantly different from the problem of explanation. In such an algorithm, the “explanation” consists of the presentation of the single rule defining the
traversal from root to leaf. This path defines the series of tests by which the tree determined the outcome,
ranked from the decision with the highest information gain to achieve any labelling down to the decision
which resolved the given label within a small sub-partition of the data space. The decision forming the first
branching of the tree therefore only describes what decision globally gives the most information within
the decision space, whereas in the PD system the user is presented with the test that contains the most
information specific to the current input data elements.
For this reason alone, a PD based implementation merits consideration over a tree-based one, as dataspecific analysis reflects the way in which decision makers will approach their data. The presence of both
positive- and negative-rule inference in PD allows relationships between labels and rules to be explored
which are simply not available in a tree based system.
In some other work, such as in Kim, Lee and Min (1999); Boyen and Wehenkel (1999) and Chiang
and Hsu (2002), ID3 and C4.5 have been developed into fuzzy inference systems.
Fuzzy Inference Based Classification
Fuzzy inference is a family of techniques, referring only to the means by which logical values are combined in the form of rules, over a set of input membership functions.
Fuzzy inference in general does not speak to how the rules or input membership functions are created.
The work in this thesis is a fuzzy inference system, and suggests one possible means of creating both
rules and input membership functions.
Several other papers mention rule generation for fuzzy systems. This literature can be divided into:
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
26
• interview of domain experts and construction of rules through human knowledge representation (Mamdani, 1974; Zadeh, 1976, 1978, 1983; Heske and Heske, 1999);
• fuzzy clustering (Krishnapuram and Keller, 1996, 1993; Pal, Bezdek and Hathaway, 1996; Pal, Pal
and Bezdek, 1997; Hathaway and Bezdek, 2002);
• neuro-fuzzy systems (Pedrycz, 1995; Labbi and Gauthier, 1997; Pal and Mitra, 1999; Kruse, Gebhardt and Klawonn, 1994; Nauck, Klawonn and Kruse, 1997; Höppner, Klawonn et al., 1999; Mitra
and Hayashi, 2000; Gabrys, 2004);
• fuzzy systems designed or configured through evolutionary algorithms (Ishibuchi, Nozaki et al.,
1994; Ishibuchi and Murata, 1997; Cordón, Herrera et al., 2001a; Spiegel and Sudkamp, 2003;
Hoffmann, 2004);
• use of an extension matrix (Hong and Lee, 1996; Yager and Filev, 1996; Chong, Gedon et al., 2001;
Wang, Wang et al., 2001; Xing, Huang and Shi, 2003) or tree generation algorithm such as ID3 or
C4.5 (Quinlan, 1986, 1993, 1996);
• approaches using contingency tables based on rough sets (Bean, Kambhampati and Rajasekharan,
2002; Shen and Chouchoulas, 2002; Tsumoto, 2002; Ziarko, 2002); and
• schemes using some kind of statistical clustering technique to create input membership functions,
combined with contingency table generation (Chen, Tokuda et al., 2001; Chen, 2002; Kukolj, 2002).
Possibilistic c-Means and Fuzzy c-Means
Possibilistic c-Means (Krishnapuram and Keller, 1993) and fuzzy c-Means (Pal et al., 1997) each provide
another means of generating rule sets based on examination of the underlying data. Both function by
performing a gradient descent evaluation over some (c) clusters in the feature space.
Showing some similarity to k-nearest neighbour, they both create fuzzy boundaries associating “nearby”
values together.
These systems function by performing the same sort of gradient descent search used by back-propagation
neural networks, which means that they may be trapped in local minima, and are not guaranteed to contain
rules whose construction will be transparent to an end-user.
Evolutionary Fuzzy Classifiers
Evolutionary fuzzy classifiers, such as Fukumi and Akamatsu (1999), while also creating linguistically
based rules uses the same type of randomized search used by other evolutionary algorithms. While the
rules themselves may be readable by a user, the means by which the rules were constructed is opaque; in
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
27
AX
AX
X
Figure 2.7: Rough Set Example
contrast, the PD system offers a clear statistical basis for rule construction.
Neuro-Fuzzy Classifiers
Similarly, neuro-fuzzy classifiers, while more transparent than their non-fuzzy neural network counterparts have the same problems with respect to the transparency of the rule creation.
These systems function by performing the same sort of gradient descent search used by back-propagation
neural networks, which means that they may be trapped in local minima, and are not guaranteed to produce rules whose construction will be transparent to an end-user, but rather involve rules “that produce a
good result.”
Rough Set Based Classifiers
Rough sets (Pawlak, 1982, 1992; Lin and Cercone, 1997; Polkowski and Skowron, 1998; Øhrn, 1999;
Ziarko, 1999; Grzymala-Busse and Ziarko, 1999; Ziarko, 2000; Grzymala-Busse and Ziarko, 2000; Ziarko,
2001), as with fuzzy logic is not a technique by itself, but rather a means of representing uncertainty.
Rough sets deal in discernibility between equivalence classes rather than partial membership in a class.
Figure 2.7 shows a rough set in relation to a set of attributes, A. This figure shows the three regions of
discernment relative to A (the attribute universe) and relative to X, the true relation on A:
• the “lower bound” on the set, or AX
• the “upper bound” on the set, or AX
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
28
• the region outside of AX
This is referred to as an “indiscernibility relation”. The boundaries for AX and AX may be the same;
if this is true for all boundaries then the set is simply a crisp set.
The task of constructing the relations is an open question, exactly comparable to the construction of
input membership functions in fuzzy logic. While techniques to create a rough-set based characterization
scheme could just as easily have been produced using a PD basis, there is no a priori reason to assume
that the performance or reliability would be any better than the FIS which has been implemented.
Classification Comparison
While the PD algorithm must deal with quantization issues, the attractiveness of the robust statistical
explanation for its rules, along with the statistical explanation from MME for quantization gives a PD
based system such a high degree of transparency that it is a very interesting algorithm in the context of
decision support.
2.3
Reliability
There are several means of discussing the reliability of a test. The two most common means of doing so
are: receiver operator characteristic (ROC) curve analysis and measurement of decision confidence using
probabilistic methods.
2.3.1
Overall Classifier Reliability — ROC Analysis
Consider the situation where there are two candidate classifiers for a given two-class problem. A method
to evaluate the two classifiers and provide a means of ranking their quality at discerning between the
classes would be advantageous in selecting which classifier to use.
For example, if a test is constructed for hypothyroidism, an analysis can be done to determine that this
test is likely to produce an incorrect result 1.25% of the time, and that of these errors, 75% will incorrectly
indicate the presence of a disease when there is none.
This type of analysis is done using a receiver operating characteristic (ROC) curve (Metz, 1978)
analysis, which provides sensitivity and specificity measures for a given two-outcome classifier.
This type of measure is discussed extensively in the medical informatics literature (see any of: Berner,
1988; Keller and Trendelenburg, 1989; Shortliffe and Perreault, 1990; Kononenko and Bratko, 1991;
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
29
Decision
Boundary
(Negative Decision)
(Positive Decision)
Sensitivity
1111111111
0000000000
0000000000
1111111111
0000000
1111111
0000000000
1111111111
(Type I Error)
0000000
1111111
0000000
1111111
0000000000
1111111111
True Positives
0000000
1111111
0000000
1111111
0000000000
1111111111
00000
11111
0000000
1111111
00000
11111
True Negatives
00000
11111
0000000
1111111
00000
11111
0000000
1111111
False
0000000
1111111
Positives
0000000
1111111
(Type II Error)
False
Negatives
Specificity
increase sensitivity
increase specificity
Figure 2.8: Specificity and Sensitivity
López de Mántaras, 1991; Gibb et al., 1994; Harris and Boyd, 1995; de Graaf et al., 1997; Larsson et al.,
1997; Friedland, 1998; Kukar et al., 1999; Kononenko, 2001; O’ Carroll et al., 2002; Abu-Hanna and
de Keizer, 2003; Colombet et al., 2003; Coiera, 2003; Cowan, 2003; Kukar, 2003; Montani et al., 2003;
Brillman et al., 2005) as well as in much of the discussion regarding reliability in decision making (see, for
example Schniederjans, 1987; Sage, 1991; Sundararajan, 1991; Barlow, Clarotti and Spizzichino, 1993;
Cordella et al., 1999; Gurov, 2004, 2005).
The general purpose of an ROC test is to generate a statistic globally characterizing the ability of a
two-outcome classifier to separate two distinct outcomes.
Figure 2.8 shows a plot describing typical distributions of desired positive and negative outcomes. The
portion of the graph above the central horizontal line indicates a PDF (probability density function) of the
occurrence of the values for which we want to see identification as a “positive” outcome. Below the line
is a similar PDF describing the occurrences of values for which a “negative” identification is desired. The
vertical line in the centre of the figure indicates a decision threshold, dividing the data values between
“positive” and “negative” decisions; values from either PDF falling to the right of the decision boundary
will be identified as “positive”, and all values (from either PDF) falling to the left will be identified as
“negative”.
The specificity of the test is simply calculated as the fraction of the “positive” decisions that are
correctly identified; the sensitivity, conversely is the similar fraction of the “negative” decisions.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
30
ROC Curve
True positive fraction (sensitivity)
1
0.8
0.6
0.4
0.2
ROC curve from test A
ROC curve from test B
x=y
0
0
0.2
0.4
0.6
0.8
False positive fraction (1-specificity)
1
Figure 2.9: Example ROC Curve
As these distributions frequently overlap (as shown in Figure 2.8, it is usually impossible to select a
test statistic where a single threshold will avoid making errors in assigning label values.
From ROC analysis, we get the categorization of error types:
Type I Error: a rejection of the null hypothesis (that being a successful test) when we should accept;
also a “false negative;”
Type II error: acceptance of the null hypothesis when we should reject it; also a “false positive.”
By plotting the fraction of the errors incurred using each possible threshold value for a given classifier,
a curve (the ROC curve) is produced. The area under such a curve provides a metric by which two
classifiers can be compared. The curve with the greater area has the greater ability to discriminate between
the desired positive and negative outcomes.
The following example is drawn largely from the work by Tape (2005). Two ROC curves are plotted
as shown in Figure 2.9, for two different candidate tests “A” and “B.” The fraction of positive-class results
which are classified correctly (the sensitivity) is plotted against the fraction of positive-class results which
are classified incorrectly (1-specificity).
As can be seen in Figure 2.9, this will produce a curve which is above the line x=y. The better the test,
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
31
the more closely this line will approach the axes of this graph. For this reason, the accuracy of the test is
measured using a calculation of the area under the ROC curve on the range [0 . . . 1], which produces an
area bounded in the domain [ 12 . . . 1].
ROC curves which have an area close to
1
2
are use useless, those whose area is 1 are perfect classifiers
between the “positive” and “negative” classes.
As the shape of the distribution of decisions (as shown in Figure 2.8) is dependent upon the test, the
ROC curve provides a means of evaluating different tests for the same decision.
As can be seen in Figure 2.9, this provides a simple means of characterizing the two tests, as it is not
otherwise obvious which of tests “A” and “B” would be superior, as at some points the sensitivity of “A”
exceeds that of “B”, and at some points the reverse is true.
Given the calculations of the areas under the two curves Area(A) = 0.74 and Area(B) = 0.81, we see
the “A” test is apparently superior by this measure, however it is important to note that statistical tests for
significance between the curves still apply, incorporating measures such as the number of points, etc.
While ROC analysis can provide a measure of performance of a classifier in a two-outcome test, this
measure is a global one, reporting the reliability as a mean chance of failure of the classification system.
What is truly desired in a decision support system is a means of inferring the probability of failure of
a particular decision, not the class of all decisions made by a system. For that reason, we turn to the
discussion of decision confidence within a system when knowledge of particular input values is available.
2.3.2
Input Specific Reliability
System reliability can be measured in terms of the probability of a correct response. Specifically, the
“reliability” of a system is simply the inverse of the probability of failure, or
C = E(R)
R = 1 − Pr(failure) ≡ 1 − Pr(Incorrect)
(2.6)
Generalized Probabilistic Measurement of Decision Confidence
Frequently, this is calculated as the compound probability of both measurement and inference error, combined using Bayesian logic.
Bayes’s (1763) theorem provides the well-known and convenient mechanism of relating a prior distribution of a particular label Pr(Ψ) with the probability of the occurrence of a particular input vector given
the occurrence of the label Pr(x|Ψ) providing the probability of occurrence of the label, given the input
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
vector, as
Pr(Ψ|x) =
32
Pr(Ψ) Pr(x|Ψ)
,
Pr(x)
using the relation
Pr(Ψ|x) Pr(x) = Pr(Ψ, x) = Pr(x|Ψ) Pr(Ψ).
Using these relations, one can calculate the probability of an incorrect assignment for any suggested
labelling based on the observed probabilities calculated on training data, assuming a direct probabilistic
path of inference exists from input value measurement to output characterization. If this path exists, this
allows a software system to report the probability of incorrect support at the time of a decision presentation.
By using Dempster-Shafer theory, this technique can be extended through the postulated limits of
“belief”, however the underlying assumption in all Bayesian inference is that the mapping of uncertainty
can be done in probabilistic terms.
Such a technique cannot be used when the mechanism of inference is not probabilistic, such as in
possibilistic fuzzy inference systems. The use of a derived probabilistic model is frequently used in such
contexts.
Derived Probabilistic Methods
Generalized indices of reliability which only approximate probabilistic models can also be used, such as
the certainty factor of the system (Shortliffe, 1976; Buchanan and Shortliffe, 1984).
The index simply used a minimum operation at each logical join, rather than computing true
joint or conditional probabilities.
While not directly based on probabilistic methods, it has been proven by Heckerman (1986) (and
summarized by Ginsberg (1993)) that there is an underlying probabilistic basis for this method.
When tested against a true probabilistic model using the same inference, there was no measurable
difference in ’s performance (see, for instance, Ginsberg, 1993).
Much of the discussion regarding the “usefulness” of fuzzy systems involves a variation of this discussion (for example Klir, 2005a,b), and much of this work is in the form of a response to the famous paper
by Cox (1946), which claims that any non-probabilistic system is without merit.
The large volume of research continuing in the many-sided field of approximate reasoning refutes the
validity of Cox’s claim, as do specific responses such as those by Klir, above.
CHAPTER 2. PATTERN RECOGNITION AND DECISION SUPPORT
33
Measured Reliability
The overall focus of all reliability modelling is to predict the true probability of failure of a specific
classification or suggestion. It is therefore desirable to determine reliability by calculating the probability
of failure of the labelling process as a function of some system parameter (usually input values or some
internal state variable).
Such a calculation will provide an estimate of the true system reliability. The quality of the estimate
can be assessed by evaluating how well estimated values correlate with measured reliability, as calculated
for a population of test values.
In the work in this thesis, such a reliability metric will be used. This technique will be introduced,
discussed and evaluated in Chapter 9, “Confidence and Reliability.”
2.4
Summary
Several classification schemes have been presented. The common thread of all of the systems discussed is
that they do not have complete transparency in the creation of their decisions, and they do not necessarily
support the exploration of a suggestion presented to the user.
What is desired is a system which has a sound basis for rule generation, provides a simple means
of producing suggestions which will be explainable, provides a good estimate of the reliability of any
suggestion made, and, finally, presents its suggestions in a framework that allows the exploration of both
suggestion itself and the decision space from which the suggestion was drawn.
A pattern discovery based characterization system has all of these benefits:
• there is a statistical basis for both the input quanta and rules (patterns) generated from them;
• transparency is provided both through framing inference in terms of rules (or patterns) and from the
statistical basis of the system and
• the adaptation of the PD algorithm to fuzzy analysis provides a linguistic mechanism for explanation
and inference.
What remains is to consider the performance of a PD based FIS. To do this, we must first understand
how the PD algorithm works.
Chapter 3
(Non-Fuzzy) Pattern Discovery
The Pattern Discovery (PD) algorithm was originally described in the doctoral work of Wang (1997) and
later published in the journal literature (see Wong and Wang, 1997, 2003; Wang and Wong, 2003). It
functions as a rule based classifier constructed through statistical inference.
3.1
Pattern Discovery Algorithm
Consider a set of discrete training data presented as an array of N rows of length M+1. Each row or
input vector contains M input feature values and a single class label, Y=yk , from a set of K possible class
labels.∗ Every input vector can be considered to be an event of order M+1 in input space. Each element of
an M-dimensional input feature vector, x j , j ∈ {1, 2, · · · , M}, can have one of ν j discrete observed values
drawn from the set of possible values or primary events describing feature j. Each possible combination
of m primary events selected from within a vector can be considered a sub-event of order m, m ∈ I, 1 ≤
m ≤ M+1.
Primary (or first order) events are represented as x1l , while in general an event of order m is represented
as xm
l , with l indicating a particular sub-event within the list of all sub-events of order m occurring in a
particular input event x. Events of interest with respect to classification must be of order 2 or greater and
be an association of at least one input feature value (a primary event) and a specific class label.
PD analysis begins by counting the number of occurrences of all observed events among the N vectors
forming the training data. Statistically significant events (or “patterns”) within this set are then discovered
∗
A summary of the variable definitions used throughout this thesis is supplied in Appendix A at the end of the document.
This is intended to form a general reference for the notation throughout the development of the discussion.
34
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
35
by using a residual analysis technique.
3.2
Definitions for Residual Analysis
Definition 3.1 (Standardized Residual):
The standardized residual is defined as the ratio of the simple residual (i.e., oxml − exml ) to the square root
of its expectation (Haberman, 1973):
zxml =
oxml − exml
√ m
exl
(3.1)
where
exml indicates the number of occurrences of xm
l expected from observation of a training set of known size
under an assumed model of uniform random chance and
o
xm
l
is the observed number of occurrences of xm
l in a training data set.
In equation (3.1) it is important to note that the expectation value exml cannot fall to zero under the
assumed model used here. This is because exml is equivalent to a linear scale of the number of available
training examples, as it is produced by dividing the available training examples among the number of
quanta. The only way a zero value could therefore be produced is by having no observed values for some
feature. If this were to occur (i.e., if all the values for a given column in the training data were missing)
the adjusted residual would be undefined, however in this pathological case there is no discovery system
that would be able to proceed. It suffices to proceed under the assumption that there will be a non-zero
number of exemplars available for every input feature.
The standardized residual provides a normalization of the difference oxml − exml such that zxml has an
asymptotic Normal distribution (for a proof, see Haberman, 1973, 1979). The standardized residual does
not, however, have a unit standard deviation; for this reason we proceed to further scale zxml so that the
distribution is N(0, 1) by considering the adjusted residual.
Definition 3.2 (Adjusted Residual):
The adjusted residual is a normalization of the standardized residual (also Haberman, 1973, 1979) that
achieves a N(0, 1) distribution by adjusting the variance of the previously zero-mean Normally distributed
deviate of equation (3.1):
zxm
rxml = √ l
vxml
(3.2)
where vxml is the maximum likelihood estimate of the variance of the zxml value of (3.1); as given by Wong
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
36
and Wang (1997), this is:
vxml
m
m
Y
oxl − exml
o xli
= 1 −
= var zxml = var √
exml
N
x ∈xm
li
(3.3)
l
j=1
m
where o xli is the number of occurrences of the primary event xli ∈ xm
l (xl is the current event being
examined) and N is the total number of observations made (i.e., the number of rows in the training data
set).
The benefit of the adjusted residual is that it scales the space of the standardized residual into that of
a Normal deviate with unit standard deviation. Using the resulting N(0, 1) space, we can easily calculate
observed deviation from expectation.
3.3
Pattern Identification Using Residual Analysis
The test performed on each xm
l event to determine whether it is “significant” simply compares the observed
number of occurrences of the event with the expected number of occurrences under the null hypothesis
that the probability of the occurrence of each component primary event is random and independent.
m
The observed number of occurrences of xm
l is represented as oxl and the expected number of occur-
rences, exml is
exml = N
m
Y
o xli
,
N
x ∈xm
li
(3.4)
l
i=1
where o xli is the number of occurrences of xli , itself a primary event drawn from the event xm
l .
To select significant events, the adjusted residual rxml defined in (3.2) is used as it provides a N(0, 1)
distributed z statistic (i.e., a statistic drawn from a normal distribution with zero mean and unit standard
deviation). The value rxml defines the relative significance of the associated event xm
l . The PD algorithm
deems an event to be significant if |rxml | exceeds 1.96, defeating the null-hypothesis with 95% confidence.
Events capturing significant relationships between feature values in the training data are termed “patterns.” Patterns are used to suggest the class labels of new input feature vectors. Patterns containing a
value for the label column are termed “rules.”
Significance is calculated in absolute terms because combinations of events which occur significantly
less frequently than would be expected by the null-hypothesis (patterns with a negative rxml ) are just as
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
37
significant and potentially discriminative as those that occur more frequently. Such patterns may be used
to contra-indicate a specific class label.
3.4
Classification
Classification in PD functions by combining the implications of extracted patterns to indicate a class label.
The patterns used are chosen based on their discriminative ability.
3.4.1
Weight of Evidence Weighted Patterns
In order to measure the discriminative power of a pattern, Wang (1997) suggests the use of the “weight of
evidence” statistic or WOE.
Definition 3.3 (Weight of Evidence):
Letting (Y=yk ) represent the label portion of a given pattern xm
l , the remaining portion (consisting of the
input feature values) is referred to as x?l . The mutual information between these two components can be
calculated (Wong and Wang, 1997) using:
I(Y=yk : x?l ) = ln
Pr(Y=yk |x?l )
(3.5)
Pr(Y=yk )
A WOE in favour of or against a particular labelling yk ∈ Y can be calculated as
!
Y=yk ?
WOE
x = I(Y=yk : x?l ) − I(Y,yk : x?l )
Y,yk l
or
WOE = ln
Pr(x?l , Y=yk ) Pr(Y,yk )
Pr(Y=yk ) Pr(x?l , Y,yk )
.
(3.6)
(3.7)
WOE thereby provides a measure of how discriminative a pattern x?l is in relation to a label yk and
gives us a measure of the relative probability of the co-occurrence of x?l and yk (i.e., the “odds” of labelling
correctly).
The domain of WOE values is [−∞ · · · ∞], where −∞ indicates those patterns (x?l ) that never occur in
the training data with the specific class label yk ; ∞ indicates patterns which only occur with the specific
class label yk . These ±∞ valued WOE patterns are the most descriptive relationships found in the training
data set as any non-infinite WOE indicates a pattern for which conflicting labels have been observed.
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
38
WOE-based support for each yk (possible class label) is evaluated in turn by considering the highestorder pattern with the greatest adjusted residual from the set of all patterns occurring in an input data
vector to be classified, and accumulating the WOE of this pattern in support of the associated label. All
features of the input data vector matching this pattern are then excluded from further consideration for this
yk , and the next-highest order occurring pattern is considered. This continues until no patterns match the
remaining input data vector, or all the features in the input data vector are excluded. This “independent”
method of selecting patterns attempts to accumulate their WOE in way which estimates the accumulation
of the probabilities of independent random variables. Once this is repeated to consider each yk , the label
with the highest accrued WOE is assumed as the highest likelihood match and this class label is assigned
to the input feature vector.
3.4.2
Pattern Absence
It is possible that an input vector may contain only primary events which are not associated with any
pattern. This is most likely to occur when an input event matches a hyper-cell that, in the training set,
occurred with a probability similar to that of the null hypothesis. In this case, no information about the
possible classification of this event is available, and no pattern is matched, as described in Figure 2.1 on
page 12. When this occurs, the classifier leaves an input vector unclassified; this will be a distinct outcome
in addition to the set of assigned labels through the addition of an extra “” label. This behaviour
provides a strength not commonly seen in classifiers; a possibility of a graceful “no decision” in scenarios
when insufficient information is available to make a robust decision.
The alternative action in this case would be to choose the class with the greatest overall probability of
occurrence, irrespective of x, i.e.,
Y = yk
K
such that k = argmax Pr(yk )
(3.8)
k=1
however for DSS purposes these cases are intentionally flagged separately in the work described here.
3.5
Quantization: Analysis of Continuous Values
Originally, PD was developed to deal with integer or nominal data, terming the observation of each discrete
datum as an event.
In most real-world problems data is continuous-valued and must be quantized to be used by the PD
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
39
Input Dimension Domain
Observed
Data
Events
MME
inter−bin divisions
0
1
2
inter−bin
interval
3 bin
4
‘core’
x0
MME
Quantization
Intervals
x1
Figure 3.1: MME Partitioning
algorithm. In order to define events over continuous-valued input data the feature values must be first
discretized. In this work, a marginal maximum entropy (MME) partitioning scheme as described by
(Gokhale, 1999; Chau, 2001) is used, this divides the continuous data into bins with Q quantization intervals per feature. The general idea is that for a specific feature the data values assigned to each bin have a
local similarity, and that each bin contains the same number of assigned feature data values.
This is achieved over the set of observed values by:
• sorting all the values for a given feature j, j ∈ {1, 2, · · · , M};
• dividing the sorted list into q j bins of
N
qj
values each;
• calculating a minimum cover or “core” of each bin;
• covering gaps between the calculated cores of adjacent bins by extending the bin boundaries to the
midpoint of the gaps.
The creation of MME bounded divisions based on grouping of input data values is described in Figure 3.1 where the first two rows illustrate the construction of MME based input space quantization intervals. In the top row, an input space is shown with training feature data shown as circles. The distribution
of the circles along the axis indicates the values of the features observed; where feature values occur very
close together, the values are “piled up” on top of each other.
Using this data, the second row shows the division of the sorted data points into MME quanta of three
points each, creating five MME “cells” numbered 0 to 4. These MME cells have differing sizes, as the
distribution density of the continuous values is not constant; this has also caused variability in the widths
of the gaps between MME cell cores. There is no discernible gap between quanta 0 and 1 as there is no
gap between the third and fourth data point as counted from the left. In the case of the cores of cells 3 and
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
40
Points Divided Among 5 MME Bins
35.13
14.90
11.72
9.22
5.94
-8.46
-41.61
-20.39
Class A Point
7.36
21.46
60.10
Class B Point
Figure 3.2: Example MME Division of 2-class Unimodal Normal Data
4, there is a substantial gap in which no points were recorded in the training set shown here.
These gaps are closed by expanding the bin and moving the boundaries to the midpoints of the gaps,
covering the input domain completely. New data points are then assigned to the bin whose interval includes
their value as shown in Figure 3.1 on the second row with the assignment of x0 and x1 to MME quantization
intervals 2 and 4, respectively. This assignment is not necessarily what is desired, as when there is a large
gap in the discretization of the training data there is considerable vagueness regarding the exact boundary
of the MME cells. Decisions made based on this type of data are more uncertain than decisions made if
the data falls into the core area of a bin.
There are two types of imperfection to be captured: the vagueness based on the location of the bin
boundaries, and uncertainty in the measurement of both the training and testing data points.
Figure 3.2 shows the MME divisions constructed for the two displayed classes, which are unimodal,
normally distributed data with some covariance. The training data used for the bin construction is overplotted on top of the bin boundaries. The data have been divided among five MME bins along each axis.
As can be seen, the MME divisions are constructed without regard to class association — the quantization
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
41
is done by counting points in both classes together (i.e., class-independent quantization). In regions of the
graph where the data points are sparse, the MME divisions are wider. Note that in the centre of the y-axis
where the density of the data points is high, a narrower bin will accommodate the same number of points;
along the x-axis, the bins are wider throughout as the separation of the distributions along this axis means
that the density is more constant than is the case in y.
3.6
Expectation Limits and Data Quantization
The relationship between decisions made by the PD algorithm in a continuous-valued domain and the
configuration of the system is dependent upon two major factors: the number of training records available
for PD and the quantization resolutions for each feature q j that govern how many different MME bins
each feature is divided into.
Given a fixed number of training data records, it is important to choose the q j to ensure that enough
records are present to allow a statistically sound decision to be made regarding the discovery of each
possible high order pattern. As the occurrence of a high order event is simply the product of the occurrence
of the primary events (which MME attempts to keep equal across all bins), we can calculate an estimate
of this value by simply calculating a product relating the number of rows of training data available (N),
the quantization resolutions used for MME (q j ) and the number of features used to represent the class
distribution (M), which defines the highest order observable event. This relation is
E
xm
l
m
Y
1
=N
,
q
x ∈xm j
li
j = index(i, xm
l ),
(3.9)
l
i=1
in which the function “index(i, xm
l )” selects the column index of primary event i within the poly-order
m
event xm
l . This function is required as pattern xl may well be constructed via a subset of the M input
features available in x and therefore j may not be valid for all values in the interval {1, . . . , M}.
In equation (3.9) an increase in any q j decreases Exml ; an increase in m (the order of the event) also
decreases Exml . Essentially, given a fixed amount of training data (or a fixed amount of information),
dividing the data into a greater number subdivisions reduces the amount of information in each. This is
the well-known “curse of dimensionality” (Bellman, 1961) which affects any multi-variate data inference
technique. To test high order events for their possible significance as patterns we are therefore obliged to
use a lower q j , given the same N; effectively consolidating our data at a very coarse resolution to ensure
that each division has enough data to be able to support a statistical inference procedure.
CHAPTER 3. (NON-FUZZY) PATTERN DISCOVERY
42
As a precaution against recording spurious patterns when Exml is low, the PD classifier implemented
will not consider as patterns any events for which the expected number of occurrences is less than 5.
This is desired in order to prevent the discovery of high-order patterns caused by the occurrence of only
one or two instances of a high-order event (instances that were possibly generated by chance). Such a
pattern, if accepted, will in all likelihood have an infinite WOE, as the chances of observing the same
randomly-generated pattern while observing a different label is quite small.
The existence of such patterns may diminish classifier performance, especially when the total number
of features, M, is quite large. The use of an infinite weighted but spurious pattern would then corrupt the
evaluation of any remaining features in the induction of the correct class label value.
Referring back to the introduction of PD in Figure 2.2 on page 13, we see all cells in this figure will
be discarded as the expectation at this quantization places only one element into each cell. While first
inspection may indicate that this restriction is unduly harsh, a small degree of consideration shows that in
fact there is far too little information to proceed to any sound conclusion, and indeed the data pictured in
Figure 2.2 could easily have occurred by random chance.
The purpose of equation (3.9) is to ensure that we do not extend the level of inference discovered
in a data set beyond the amount of information actually present; instead we restrict ourselves to the patterns which can be discovered with confidence, omitting patterns whose rationale cannot be rigorously
defended.
Using equation (3.9) it is therefore possible to choose Q = q1 = q2 = · · · = qm based on a knowledge
of the maximum order event expected and the number of training examples available.
3.7
Summary
The PD algorithm provides a pattern (rule) selection mechanism based on a simple analysis of residuals,
which are simple to calculate and straight-forward to explain.
In order to adapt PD to a continuous domain, a class-independent quantization scheme was used. This
MME quantization is transparent and simple to explain, however any relatively coarse quantization has
the potential to raise problems when the resolution of the data is high.
The primary advantage of these two techniques is the simple statistical basis, allowing excellent explainability.
Part II
Hybrid System Design and Validation on
Synthetic Data
43
Chapter 4
Construction of Fuzzy Pattern Discovery
Research is what I’m doing when I don’t know what I’m doing.
— Wernher von Braun
A scheme to implement a “fuzzified” version of the PD algorithm is introduced in this section. Two
main factors within the PD system are considered: the crisp nature of the quantization bins, and the lack
of any fuzzy framework for the PD patterns.
4.1
Rationale
One of the primary advantages of a fuzzy inference system (FIS) is the linguistic framework supporting the
inference and resulting transparency of the decisions made. This framework allows the robust exploration
of the knowledge structure necessary for use in decision support. A new fuzzy inference system (FIS) is
introduced here performing continuous and mixed-mode data classification based on rules recognized as
patterns by the PD method of statistically based pattern extraction. The classification performance of this
new FIS across a series of different class distributions is examined and discussed.
The rules utilized in this classification based FIS are created by the PD algorithm, based on an MME
quantization of the input feature space.
The creation of an FIS based on the PD framework will ameliorate the high cost of quantization
expressed by an MME based PD as well as provide a linguistic basis for a DSS.
The main advantage of the rules exported from the PD algorithm is the transparency and statistical
validity provided by the analytical techniques used to create them. The main problem with the use of
44
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
45
these rules is the characterization of the weights associated with the rules by the PD algorithm; these
weights do not fall into the normalized fuzzy membership domain, but rather in the domain of the relative
likelihood of contrary observation, and are bounded only by [−∞ · · · ∞].
What is being proposed in this work, therefore, is a marriage between a probabilistically based rule
extraction system and a fuzzy inference methodology for rule evaluation and expression. It may seem
that this conjunction is a poor fit, as every fuzzy researcher has pointed out that fuzzy inference and
probability are different things, beginning with Zadeh (1968). What is proposed in this work is not a
mixing of the metaphors of fuzzy and probability, but rather a co-operation between the two, using fuzzy
sets to represent vagueness where appropriate and using probability theory to explain observation. By
using the strengths of the two paradigms together, the hybrid approach suggested here provides a means
of providing a context for decision making superior to either alone.
4.2
Fuzzification of Input Space Data Divisions
One of the advantages of fuzzy inference is the blurring of bin boundaries between adjacent input ranges
and concordant reduction in quantization costs. The construction of fuzzy input membership functions for
this purpose is described here.
Figure 4.1 extends Figure 3.1 from page 39, indicating the production of FIS input membership
functions from the vagueness in the constructions of the MME quanta. When examining Figure 4.1, we
see there are two types of imperfection to be captured: the vagueness based on the location of the bin
boundaries, and uncertainty in the measurement of both the training and testing data points.
Consider the boundaries of bin 0. When there is no gap between cells the cell-boundary vagueness is
low. The crisp, well-defined bin bounds of this cell allow the calculation of WOE to be performed with a
higher certainty than is possible when vagueness about the boundary location exists. As bin 1 (which will
define the “core” of a fuzzy set) is crisply defined by its borders with the “cores” of bins 0 and 2, any new
measured values in this feature in the vicinity of the core of bin 1 can be crisply assigned based on these
bounds.
In contrast, when there are significant gaps between the bin cores, such as those bordering bin 3, the
fuzzy concept of “vagueness” captures the imperfection present in the bin-boundary problem. As shown
on the lower-most line of Figure 4.1, this vagueness is captured when fuzzy input membership functions
are constructed with a flat central area (µ=1) corresponding to the defined region of the MME bin core,
and with extension of the support of the set into trapezoidal ramps projecting into the area of vagueness
between the quantized MME bins.
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
Input Dimension Domain
(Fuzzy Universe of Discourse)
MME
inter−bin divisions
0
1
2
Observed
Data
Events
inter−bin
interval
3 bin
4
MME
Quantization
Intervals
4
Fuzzy
Membership
Functions
‘core’
x0
0
1
2
x1
3
x0
46
length of ‘ramp’
?
x1
Figure 4.1: Fuzzy Mapping of MME Partitioning
4.2.1
Creation of Fuzzy Membership Functions
Using the following rationale, we create fuzzy membership functions based on the location of the bin core
regions:
• the degree of uncertainty in the boundary of a bin is related to both the width of adjacent bin cores
and the distance between them. Cores which abut have less uncertainty than those with a large
inter-core gap; similarly, the degree to which an inter-bin gap is “large” cannot be evaluated without
a knowledge of the width of the bin cores themselves;
• limiting the extent of the support ensures that the locality of inference of information regarding
MME assertions is maintained (i.e., we maintain the assumption that as a measurement deviates
from some fixed constant, assertions made on the measured value will begin to differ from those
made on the constant);
• if the point is farther from known data than we can reasonably extend the locality of inference,
the preferred behaviour is to discard decisions altogether rather than make a classification based on
extrapolating behaviour trained using distant exemplars.
The support of a trapezoidal fuzzy set is therefore extended in each direction from the bin core in order
to expand the domain covered by the bin. The length of the extension of the support in each direction (the
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
47
length of the trapezoidal ramp) is set to the minimum of:
• the distance from the edge of the current bin core to the midpoint of the neighbouring bin core (to
limit the number of fuzzy membership functions which overlap);
• 1/2 the width of the current bin core (to restrict the maximum distance to which a bin may have
influence, based on its width).
4.2.2
Fuzzy Membership Function Attributes
The resulting fuzzy membership functions may abut in one of three ways:
1. the overlap may be complete, as in the boundary between bins 0 and 1—in this case the support of
the set projects into the adjacent bin and the membership across the inter-core boundary never drops
below 1.0;
2. the regions of support of adjacent sets may overlap and cover the inter-core region, as in Figure 4.1
at point x0 which will be assigned to both fuzzy input membership sets 2 and 3 with non-zero
membership;
3. the regions of support may not meet, as is the case at point x1 , which is sufficiently far from both
neighbouring core areas that it is outside the extent of the support for both sets.
This third option causes the creation of regions in the input space which would have no assignment in
the fuzzy logic system; in order to avoid the instability which this would otherwise cause, crisp membership functions are inserted into these areas which will cause the output label “” to be assigned if
any input values fall into these regions.
The above scheme allows the point x1 (which was assigned to a distant MME bin in the PD row) to be
discarded, producing a classification of “” for x1 , which allows the system to avoid performing
a classification with insufficient confidence. This strategy maintains the traditional transparency of fuzzy
systems and extends the ability of a user to inspect the rationale of the decision through to the input
domain. The user can rely upon the statistical validity of the MME quantization (see Gokhale, 1999;
Chau, 2001) and further see that this quantization is not disturbed by unduly extending the support of the
fuzzy membership functions away from the training-defined core regions. This will allow the user to “drill
down” through the final rule firings to determine the input value matches and maintain a high degree of
confidence in the conservatism of the overall system. The mechanism supporting this will be discussed in
Chapter 10.
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
4.3
48
Use of Pattern Discovery Based Rules
Using the patterns provided by the PD algorithm as rules for a FIS also requires some thought for two
reasons: the pattern weightings used within the PD algorithm are not [0 . . . 1] bounded, and the PD logic
for pattern selection and use differs greatly from the firing of rules in a general FIS.
Aside from weighting and selection, the patterns created by the PD system are very similar to fuzzy
rules, as they enumerate the associations observed between input events and output labels. These can be
easily mapped into the association of membership in (collections of) input fuzzy sets with membership in
output universes describing the degree of association with each label.
The PD algorithm generates, however, both positive logic patterns which support a given class, and
negative logic patterns which indicate that the associated class is unlikely, given the input events observed.
Each positive logic pattern provides a vote supporting a single class. Negative logic patterns may also
exist, decreasing the support for the indicated class, and other positive logic patterns may exist increasing
the support for a conflicting conclusion. Each possible class may therefore have several assertions in
support or in refutation of its candidacy as a class label. The FIS must therefore consider assertions of
both support and refutation for each possible class; these assertions must be combined into one overall
assertion value, Ak , describing the support or refutation for each candidate label k ∈ K.
The WOE values associated with PD patterns provide a description of their discriminative power. In
order to process these as rules weighted using WOE, a provision for a system to accumulate assertions
from rules that have infinite weights within the structure of the hybrid FIS is introduced.
In order to facilitate this accumulation, the consequents of all rules are examined as hypotheses supporting or rejecting each output class.
Three schemes to implement this mapping are described in the following sections:
• a Mamdani (1974) based inference using WOE weighted rules;
• a mapped WOE weighting scheme using fuzzy inference and
• a simpler occurrence based weighting scheme with fuzzy inference.
4.3.1
Mamdani Inference Using WOE Weighted Rules
This strategy is supplied to give a baseline comparison in performance using a very simplistic implementation using a fuzzy set method based on Mamdani’s (1974) initial implementation of a fuzzy logic
system.
Each rule (pattern) will fire and generate an assertion axml K in an output space for the associated class
K, based on the WOE of the rule using the ranges and assertions shown in Table 4.1.
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
49
Table 4.1: Mamdani Output Values for WOE
WOE Value
[ 21 ω. . . ∞]
( 0. . . 21 ω)
(− 21 ω. . .
0]
[ −∞. . . − 12 ω]
Output Class Name
SUPPORTED
SOMEWHAT-SUPPORTED
SOMEWHAT-REFUTED
REFUTED
Centre
1
0.5
-0.5
-1
The value ω in the table is calculated as:
ω = max(|WOE i |) ∀ i |WOE i | , ∞,
(4.1)
where i indexes the list of all rules, and | · | indicates absolute value.
Each output set described in Table 4.1 is defined as an impulse based set or fuzzy singleton; a set
whose support is a single value
Vxml K =
µ(axml K )
axml K
,
(4.2)
where
axml K is an the value created by the firing of a fuzzy rule through the use of Table 4.1 based on pattern xm
l
supporting class K and
µ
is the fuzzy membership match of the rule firing.
Firing of rules using this Mamdani-style assertion logic creates a collection of singletons in one of K
defuzzification universes, each of which describes the degree of support for a particular classification.
The degree of the rule firing is used to scale the membership of the impulse set, combining the degree
of input membership match of the rule (pattern) with the assertion value provided by the WOE value of
the rule.
After all rules have been processed, a centroid-based calculation is applied to the universe of discourse
to achieve a scalar output in the usual manner. The value of this scalar Ak is then taken as an assertion
of support or refutation of this classification. This assertion has the range [−1 . . . 1] where −1 is total
refutation, 1 is total support and 0 indicates no opinion.
This is similar to the “certainty factor” introduced by the system of Shortliffe (1976); Buchanan
and Shortliffe (1984).
This procedure will be termed “M” in the discussion section.
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
4.3.2
50
Using WOE Directly With Fuzzy Rules
The main problem with the use of WOE weighted rules is the fact that the rules have possibly infinite
weights. The range [−∞ · · · ∞] cannot be applied directly within the standard t-norm/t-conorm proposed
by Zadeh (Yager, Ovchinnikov et al., 1987) as any conflicting ±∞ values will leave the results of the
calculation undefined. A further consideration is that WOE values are measured in “relative likelihood”
and are therefore not [0, 1] bounded; instead arbitrarily large finite values may be observed, along with
any infinities.
The inconveniently bounded WOE based assertion for a particular classification provides a mapping
of the “degree of support” by which the given classification is supported by the rule, where larger positive
values indicate higher degrees of support, and larger negative values indicate higher degrees of refutation
of the classification.
To convert WOE values into a space bounded by [−1 . . . 1] in the defuzzification universe the WOE
based assertions are normalized by the maximum finite WOE value recorded in the PD rule set, independent of class. This ensures that all the finite WOE assertions will fall into the range [−1 . . . 1]. The net
effect of this normalization is to cause all axml K assertions to fall into three groups:
1. assertions which fall into the range [−1 . . . 1];
2. assertions whose value = ∞ and
3. assertions whose value = −∞.
Adjusted Centroid Calculation for Infinite WOE
Assertions in the [−1 . . . 1] range can be considered using a standard centroid calculation. In the infinite
cases, the support/refutation value is simply calculated by counting the number of infinities. If more +∞
values are observed than −∞ values, an assertion of 1 is produced. If more −∞ values are present, an
assertion of −1 is produced. If there are an equal number of positive and negative infinite assertions, the
output is marked “,” regardless of the outcome of the assertions for the other class possibilities.
As each rule (each xm
l ) asserts a value (Ak ) in the range [−∞, −1 . . . 1, ∞], we term this “WOE” weighting.
4.3.3
Occurrence Weighted Fuzzified PD Rules
A drawback of the Mamdani-style system is that fact that the granularity of the weight is lost in the
assignment to one of a discrete number of output sets (assertion values). A similar potential problem
with pure WOE is the potential to lose resolution through the combination of infinity values in WOE
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
51
weighting. It is therefore of interest to construct a means of providing a functional mapping for rule
weights into output assertions.
Definition 4.1 (Occurrence Based Weighting):
A relative measure (Wxml ) of the discriminative power of the rules (xm
l ) is created by using a weighting
based on the number of occurrences of the supporting rule in the training data:
Wxml
=
oxm
l
if rxml ≥ 1.96,
ox?
l
(4.3)
oxm −exm
l
l
exm
l
if rxml ≤ −1.96,
where
oxml indicates the number of occurrences of the event defining rule l, or xm
l , including the class label, as
used in equation (3.2);
exml is the expected number of occurrences of the event defining rule l, (xm
l ) and
ox?l indicates the number of occurrences of only the input value portion of the event (i.e., event xm
l with
the class label column excluded), noting that this input event may also occur with other class labels.
Note that in (4.3), the first proposition produces values in the range [0 . . . 1], and the second values
in the range [−1 . . . 0]. Values of rxml ∈ (−1.96 . . . 1.96) will not be observed for significant patterns (or
rules).
This method can be considered an implementation of the techniques of Takagi and Sugeno (1985) and
Sugeno and Kang (1988) as it provides a functional representation which combines the membership value
of the rule firing and a Wxml value for the rule weighting into the output of some function producing a set of
fuzzy singletons which are collected together and defuzzified using a centroid defuzzification algorithm.
m
Each rule xm
l is allowed to assert a value axl k ∈ [−1 . . . 1], supporting or refuting some possible
classification k of input value x. Overall support for each class, Ak , is thereby based on a possibilistic
scheme and is independent of the support for any other classes.
As this algorithm creates assertions supporting or refuting an output classification weighted purely on
the observed occurrence of sub-events, we term this “O” weighting.
4.3.4
Selection of a Classification Label
For all the above schemes, once an assertion is calculated for each class for which any rule fires, the label
is selected as follows:
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
52
1. locate the class containing the centroid which asserts the maximum value:
K
k? = argmax Ak
(4.4)
k=1
2. if Ak? > 0 then Y=yk?
3. otherwise Y=y
In the case where Y=y , the classifier has produced a “soft failure”; that is, the classifier has
decided that the given input data does not provide sufficient discriminative information to produce a reliable decision, and therefore no decision will be made. In a decision support context this characteristic is a
welcome one, as it will enhance the reliability of the overall system.
4.4
Selection and Firing of PD Generated Rules
In the PD algorithm, the occurrence of a pattern is assumed to consume the information associated with
the primary events describing the pattern. For this reason, each input feature can support at most one rule
firing, in order to maintain the assumption of statistical independence of the input features. The rule to
be fired is selected by evaluating the order and adjusted residual of all rules matching the input vector.
Conversely, the fuzzy methodology allows all rules to fire, re-using the information in the associated input
values, and letting the rule and membership weights govern to what extent each rule contributes to the
final solution.
An implementation of the PD based independent selection is presented and evaluated against the
standard fuzzy rule firing scheme, with the results included as Chapter 7. This comparison allows us to
evaluate the “fuzzification” of the PD system by stages, comparing the performance as each feature of the
PD algorithm is adapted to the fuzzy framework.
4.4.1
Fuzzy Rule Firing
This strategy, termed “A” in the results, refers to the standard fuzzy logic evaluation where all rules are
fired. In the case of rules for which the input value falls outside the support of their defining membership
functions, a zero membership value is attached to the assertion provided by the rule.
CHAPTER 4. CONSTRUCTION OF FUZZY PATTERN DISCOVERY
4.4.2
53
Independent Rule Firing
This strategy provides a rule evaluation procedure similar to that used in the PD system. This strategy is
termed “Ind” in the results chapters.
The algorithm proceeds through the following steps:
1. produce a list L of all fuzzy variables provided through fuzzification of the input values which have
a non-zero membership;
2. set search order o to be N, the number of inputs to the system;
3. place all rules of order o whose precedents exist in the list L into a list of matches, M. If this search
fails, repeat after setting o := o − 1;
4. If o = 1 has been reached, and no matches have been found, then stop;
5. find the rule ρmax in M which has the highest adjusted residual;
6. fire rule ρmax , generating consequents as described in Section 4.4.1;
7. remove from L all the variables matching the precedents of rule ρmax ;
8. if L is empty, then stop
9. repeat, starting from step 3, noting that the value of o continues to decrease with each iteration.
4.5
Summary
This chapter has outlined the extension of PD rules into a FIS, along with some changes to the inference
required to evaluate the PD based rules. Three rule weightings and two rule firing techniques have been
presented.
Rule Weighting :
• M rule weighting using a lookup table for fixed output assertions with rule-based membership
• WOE rule weighting with adapted centroid calculation to manage possible infinite values
• O rule weighting with simplified, non-infinite rule weights
Rule Firing :
• FZ – standard fuzzy firing of all rules
• I – firing of PD style rules, within a fuzzy context
The next chapters will now evaluate the performance of these various strategies.
Chapter 5
Synthetic Class Distribution Data and
Analysis Tools
What is research but a blind date with knowledge?
— Will Harvey
In order to evaluate the hybrid system, several synthetic data distributions have been created. Synthetic
data is used for analysis of this part of the work as the properties of the data are known in advance, allowing
a deeper insight into both the performance triumphs and failures of the hybrid system with respect to
attributes due to the construction of the data sets.
Four main types of synthetic data have been produced: unimodal covaried data, log-Normal covaried
data, bimodal covaried data and spiral data. The construction of each of these types will now be explained.
5.1
Covaried Class Distributions
Covariance matrices for 4-feature N(0, 1) data were generated by using the variance values
V = {160, 48, 16, 90}
(5.1)
arranged along the diagonal of a generating covariance matrix (Covii =Vi ). Each off-diagonal element
Covi j was calculated through
Covi j = κ
q
(Covii )(Cov j j )
54
(5.2)
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
CovA
CovB
CovC
CovD
160
−52.58
=
−30.36
−72
48
16.63
=
39.44
−52.58
16
22.77
=
−30.36
16.63
90
−72
=
39.44
−22.77
−52.58 −30.36 −72
48
16.63
39.44
16.63
16
22.77
39.44
22.77
90
55
16.63
39.44 −52.58
16
22.77 −30.36
22.77
90
−72
−30.36 −72
160
22.77 −30.36
16.63
90
−72
39.44
−72
160
−52.58
39.44 −52.58
48
−72
39.44 −22.77
160
−52.58
30.36
−52.58
48
−16.63
30.36 −16.63
16
Table 5.1: Covariance Matrices for 4 Classes
using a κ value of 0.6.
This produced the covariance matrix for class A. The covariance matrix for class B was produced by
setting
CoviiB = V(i+1) mod 4 .
(5.3)
Class C and D were produced by substituting i+2 and i+3 respectively into (5.3). The resulting covariance
matrices used to generate the 4-class data are shown in Table 5.1 for reference.
To use a covariance matrix to create data for some class k, a N(0, 1) data set with zero covariance
was randomly generated, and a transform T was calculated to transform the data to have the desired
covariance:
T k = Φk λk
Pk = (T k PN(0,1) T )T + µk
where
(5.4)
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
MT
is the transpose of some matrix M ;
Φk
is the matrix of eigenvectors derived from the covariance matrix for this class (Covk );
λk
is the vector of eigenvalues for the covariance matrix;
PN(0,1)
are the N(0, 1) uncoloured points;
Tk
is the “colouring” transform to be applied;
µk
is the mean vector; and
Pk
are the final coloured points.
56
These matrices generate clouds of data which intersect non-orthogonally and which have differing
variances and covariances in each dimension, and in each class. The covariance values themselves have
been chosen to provide both strong and weak covariances, across the various dimensions.
Class separations were produced using a combination of the variances of classes A and B within the
set of covariance matrices, where the separation vector c was calculated using
ci =
q
4
(CoviiA )(CoviiB ).
(5.5)
The four classes were separated into different quadrants in Euclidean space by projecting the mean
vector of each class away from the origin by separation factors of
)
1 1 1
, , , 1, 2, 3, 4 ,
S=
8 4 2
(
si ∈ S,
(5.6)
and combining si with ci from (5.5) to produce centres for each factor of s located at (+ 21 si ci , + 12 si ci ),
(+ 12 si ci , − 21 si ci ), (− 12 si ci , + 21 si ci ) and (− 12 si ci , − 12 si ci ), respectively.
5.2
Covaried Log-Normal Class Distributions
In order to evaluate the performance of the classifiers on data which is not N(0, 1) distributed, data was
generated with a log-Normal distribution. This data was produced by taking each point generated in a
N(0, 1) distribution using the method just described and calculating
p0 = epξ ,
p ∈ N(0, 1)
(5.7)
to transform each point p to be log-Normally distributed, controlled by the shape parameter ξ. Each point
in the resulting log-Normal distribution is then transformed using (5.4), using the covariance matrix. Sepa-
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
57
Bimodal Data, Separation 4s
100
classA
classB
classC
classD
Feature B
50
0
-50
-100
-150
-100
-50
0
Feature A
50
100
150
Figure 5.1: Bimodal Data Points
rate experiments were done with ξ ∈ {1.0, 1.5, 2.0}. The skewness value of the resulting class distributions
was on average 4.58 when ξ=1.0; was 13.80 when ξ=1.5 and was 27.33 when ξ=2.0.
5.3
Covaried Bimodal Class Distributions
The bimodal data was created by starting with the si cluster locations of the linearly separable class distributions and then adding a second set of points for each class. The location of the second cluster is set by
projecting the mean away from the origin in a diametrically opposite direction, with an extra translation of
√
4 vmax , where vmax is the maximum variance value specified in the set of variances, V (i.e., 160.0). Thus,
√
√
along with a cluster of points centred at (s, s), a second cluster would be placed at (−4s vmax , −4s vmax ).
Each cluster contained half the total number of points.
This algorithm was repeated for all sets, generating a layout of pairs of clusters around the origin,
shown in the sample data illustrated in Figure 5.1 for s=8, which displays the first two dimensions of the
four-feature data and shows clearly that no line can be drawn across the field to separate any one class
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
58
from any other.
In Figure 5.1, the four modes shown in the centre (within the dotted square) indicate the distribution
of points for Unimodal data with s=8.
5.4
Spiral Class Distributions
2-class data was produced by using the spiral equation
r = ρ(2πθ) + r0 ,
(5.8)
which relates the input variable θ to a radius where r0 is the base radius at which the spiral begins and ρ is
the scale, or acceleration of the spiral.
The spiral defined in (5.8) accelerates out from (0, 0). For each (r, θ) point chosen on this spiral, a
value from a N(0, 1) distribution was chosen to apply a scatter in radians to the θ value of each generated
point, while maintaining the same radius. To generate 4 feature points, a second scatter is chosen, to
perturb the data in the third and fourth dimensions also. The second class was generated by choosing a
similar set of points. Separation was introduced by rotating the entire set around the origin by a specified
amount in units of π radians. Maximum separation therefore is 1.0. Data was generated using σ=1.0,
ρ=0.5 and r0 =0.125. Two dimensions of a sample pair of class distributions with a separation of 1.0 (a
half-turn, or 1.0π) is shown for 2-dimensional data in Figure 5.2.
5.5
Training Error
For each class distribution studied the effect of training error Terr on the performance of the classifiers
was examined. In each summary table a column indicating Terr =0.1 is included to indicate that 10% of
the records for each class have had their true label value replaced with a value chosen randomly from the
other class labels.
5.6
Jackknife Data Set Construction
For each class distribution 11 sets of randomly-generated data from each class were produced to create
training and testing data sets. The data were then combined using a jackknife procedure. This popular
performance measurement methodology is discussed in Duda et al. (2001) and is widely used in the
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
59
Spiral Data Space
5
4
3
Feature B
2
1
0
-1
-2
-3
-4
-5
-5
-4
-3
-2
-1 0
1
Feature A
classA
2
3
4
5
classB
Figure 5.2: Spiral Data Points
literature. Essentially, the total set of Ω available data records is divided into a set of G groups, with N
records per group. Each group Gg , g ∈ {1, . . . , G} will be used once for testing; when group g is being
tested, the training data is formed by concatenating all other records together
Tg =
G
[
Gi .
(5.9)
i=1
i,g
In this way, each record is used for testing once, and used for training G times.
In each experiments described here G=11 separate jackknife runs have been created. The number
of training records differs among experiments, however in each case the term N will be used to indicate
the number of training vectors used for each class; in each case
1
10 N
will indicate the number of testing
records. Specifically, when N=1000 training vectors per class, there are 100 testing points per class. This
amount of training data reflects the quantity of data available for several real-world problems.
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
5.7
60
Classifiers Used in Comparison
Classification accuracies of the original PD and the hybrid PD/FIS were compared with those of backpropagation (BP) and a minimum inter-class distance (MICD) classifier in a series of experiments involving both linearly and non-linearly separable class distributions. The MICD classifier is simply a simple
Naı̈ve Bayesian classifier, and will be defined shortly.
In this work q j was constant for all M input features and was set by the quantization resolution value
Q.
5.7.1
Back-Propagation Classifier Construction
A simple BP network was constructed using the algorithms provided in Simpson (1991, pp. 112–120)
and in Hertz et al. (1991, pp. 115–120). The learning rate and momentum for the classifiers were fixed
across all BP configurations using the typical values of 0.0125 and 0.5, respectively. Training for each
experiment was run for a maximum of 105 epochs, until the overall error dropped below 2.5 × 10−3 , or
until the derivative of the error dropped below 1.25 × 10−3 . Several choices for the number of hidden
nodes were studied (H={5, 10, 20}).
5.7.2
Minimum Inter-Class Distance Classifier Construction
The MICD classifier takes an input record and applies a whitening transform transform calculated from
the observed covariance. The smallest distance to a class mean in the whitened space is then the optimal
maximum likelihood match, if the true class distribution is Normally distributed, linearly separable and
sufficient points are available for the estimate of the covariance.
The transform for class k is given by Duda et al. (2001, pp. 41) as
δk = xT Wk x + wk x + wk ,
(5.10)
using transform components calculated by
1
−1
Wk = − Cov?k ,
2
−1
wk = Cov?k + µk ,
and
wk = −
1
1 T
−1
−1
µk Cov?k µk − ln Cov?k
+ Pr(k).
2
2
(5.11)
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
61
In equation (5.11),
M −1 represents the inverse of matrix M ;
x
represents the vector of feature values, as in the discussion of PD;
Cov?k
is the covariance matrix calculated for class k;
µk
contains the mean vector for class k;
Pr(k) is the overall probability of occurrence of class k; and
|·|
is the matrix determinant operation.
5.8
Weighted Performance Measure
To compare performance across classifiers, a statistic was constructed to summarize results across all
separations into a single scalar value.
Definition 5.1 (Weighted Performance Summary Measure):
Summarizing over all separations with greater emphasis placed on data from lower separations is performed using
pclassified
i
si
i=0
P|S| 1
i=0 si
P|S|
P=
(5.12)
in which si is a separation value as defined in equation (5.6), the value |S| indicates the number of separations tested and
classified
p
nclassified
=
ntotal
!
1−
nerror
nclassified
!
,
(5.13)
where
nclassified represents the number of records classified;
nerror
is the number of records incorrectly classified; and
ntotal
is the total number of records processed (the sum of those classified and those left unclassified
through assignment to the class label).
This generates a single (scalar) value representing the overall performance of the classifier weighed
across all separations, such that performance at lower separations (where the problem is harder) is given
more weight.
CHAPTER 5. SYNTHETIC CLASS DISTRIBUTION DATA AND ANALYSIS TOOLS
5.9
62
Summary
This chapter has described a series of synthetic data type distributions to be used for analysis. The motivation for evaluation on synthetic data is the available knowledge of the problem domain and the relationship
this will have to the resulting performance.
Once the PD/FIS system is evaluated on synthetic data, real-world examples will be used to further
test the best of the proposed FIS classifiers in later chapters.
Natively continuous classifiers (BP, MICD) have been chosen as comparative test cases, as on the
continuous data type distributions described here their expected performance will be very high, providing
a real and fair test for the PD/FIS algorithm.
Chapter 6
Synthetic Data Analysis of Pattern
Discovery
You cannot have a science without measurement.
— Richard W. Hamming
Results evaluating PD are presented in the following sections, organized by the type of class distribution. The purpose of this chapter is to establish a performance baseline for the MME quantized PD system
when functioning as a classifier in the continuous domain.
6.1
Covaried Class Distribution Results
Table 6.1 displays the weighted performance results calculated using equation (5.12) for the covaried
Normal class distributions. Figure 6.1 displays the results over all separations tested for N=1000 and
Figure 6.2 for N=10, 000 training examples. The graphs in these figures are clipped at 0.5 to allow the
lines to be more clearly visible, and the x-axis is plotted in log2 for clear separation of the data points.
In all of these displays, the PD classifier implementation suggested by Wang (1997) using independent pattern selection and WOE weightings is termed PD(WOE/I) The modified O weighting
using all existing patterns is termed PD(O/A). BP is shown only with H=10 hidden nodes as
this was the highest performance configuration tested for both values of N. The covaried class distributions are quite rich in high-order information, and were easily separated by the MICD classifier, which
63
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
64
Table 6.1: Weighted Performance: Covaried Class Distribution
Covaried
Classifier
PD(WOE/Indep);Q=5
PD(WOE/Indep);Q=10
PD(WOE/Indep);Q=20
PD(O/A);Q=5
PD(O/A);Q=10
PD(O/A);Q=20
BP;H=5
BP;H=10
BP;H=20
MICD
N =100
N =500
N =1000
0.51±0.07
0.52±0.07
0.48±0.05
0.50±0.07
0.51±0.07
0.44±0.04
0.56±0.04
0.60±0.04
0.56±0.07
0.74±0.07
0.61±0.03
0.64±0.03
0.60±0.03
0.65±0.03
0.64±0.03
0.59±0.03
0.63±0.02
0.69±0.02
0.69±0.01
0.74±0.03
0.66±0.03
0.63±0.02
0.61±0.02
0.66±0.03
0.66±0.02
0.60±0.02
0.61±0.03
0.71±0.02
0.72±0.02
0.74±0.02
N =1000
T err = 0.1
0.65±0.02
0.63±0.02
0.60±0.02
0.65±0.02
0.65±0.02
0.59±0.02
0.62±0.02
0.70±0.02
0.69±0.01
0.73±0.02
N =10, 000
0.69±0.01
0.69±0.01
0.66±0.01
0.68±0.01
0.69±0.01
0.68±0.01
0.59±0.01
0.70±0.01
0.72±0.01
0.74±0.01
Fraction Correct on Covaried Normal Data
(1000 training points, Q=10)
Fraction of Test Samples
1
0.9
0.8
0.7
MICD
BP H=10
PD(AdjRes/Ind)
PD(Occurence/All)
0.6
0.5
0.125
0.25
0.5
1
2
log2 Separation (s)
Figure 6.1: Covaried Data N=1000 Results
4
8
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
65
Fraction Correct on Covaried Normal Data
(10000 training points, Q=10)
Fraction of Test Samples
1
0.9
0.8
0.7
MICD
BP H=10
PD(AdjRes/Ind)
PD(Occurence/All)
0.6
0.5
0.125
0.25
0.5
1
2
log2 Separation (s)
4
8
Figure 6.2: Covaried Data N=10, 000 Results
can be seen providing an upper bound in each of Figures 6.1 and 6.2.
The performance of the PD classifier was somewhat lower than that of BP and MICD, though some
small difference in performance between the PD(O/A) and PD(WOE/I) performance is
visible in Figure 6.1, with the PD(O/A) performance being noticeably higher at lower separations. Once a separation of 1σ is reached, there is no longer a difference between the different PD
classifier implementations.
Figure 6.2 shows no real difference in performance between either of the PD classifiers or the BP
classifier, though the MICD classifier still shows that a marked difference exists in all these cases between
the recorded performance and optimality in this case.
These analyses are supported by an examination of Table 6.1, which reports a generally stronger
performance for PD(O/A) versus PD(WOE/I) (but both being somewhat weaker than that of
BP classifier) and significant improvement in this linearly separable case when using the MICD classifier.
Considering the Terr =0.1 column in Table 6.1, we can see that the PD classifiers can tolerate moderate
training data error.
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
66
Table 6.2: Weighted Performance: Log-Normal Class Distribution
Classifier
PD(WOE/Indep);Q=5
PD(WOE/Indep);Q=10
PD(WOE/Indep);Q=20
PD(O/A);Q=5
PD(O/A);Q=10
PD(O/A);Q=20
BP;H=5
BP;H=10
BP;H=20
MICD
N=10,000 4c/4f
LogNormal LogNormal
s=1
s=1.5
0.93±0.00
0.86±0.01
0.94±0.00
0.87±0.01
0.93±0.00
0.84±0.01
0.92±0.01
0.83±0.01
0.93±0.00
0.86±0.01
0.91±0.00
0.83±0.01
0.95±0.00
0.86±0.01
0.96±0.00
0.92±0.01
0.97±0.00
0.94±0.01
0.91±0.01
0.61±0.03
LogNormal
s=2
0.82±0.01
0.82±0.01
0.77±0.01
0.79±0.01
0.82±0.01
0.80±0.01
0.78±0.02
0.86±0.01
0.90±0.00
0.36±0.02
Overall for these class distributions, it is apparent that Q=10 performs the strongest for the PD based
classifiers; there is a notable performance decrease when comparing any of the tests using Q=10 with
those using Q=5 or Q=20.
6.2
Covaried Log-Normal Class Distribution Results
As shown in Figure 6.3, the performance of the PD classifier remained high as the data distribution deviated from Normal, while the MICD classifier performance dropped remarkably as the skewness of the
data increased. This is summarized across all classifiers in Table 6.2, in which it can be seen that the BP
classifier responds to changes in skewness with a stability comparable to that of the PD classifiers.
6.3
Covaried Bimodal Class Distributions Results
As there is no single hyper-plane which can linearly divide any two classes for the bimodal and spiral
data, MICD is no longer really useful as a classifier, and the comparison between PD and BP classifiers
becomes much more important.
The results for the bimodal class distributions are shown across all distributions for N=1000 in Figure 6.4, and are summarized for all bimodal class distributions studied in Table 6.3.
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
67
Fraction Correct on Log-Normal Data
(1000 training points, Q=10)
Fraction of Test Samples
1
0.8
0.6
0.4
PD, σ*=1
MICD, σ*=1
PD, σ*=1.5
MICD, σ*=1.5
PD, σ*=2
MICD, σ*=2
0.2
0
0.125
0.25
0.5
1
Separation (s)
2
4
8
Figure 6.3: Log-Normal Data N=1000 Results
The PD classifiers with N=1000, with Q=20 and with Q=30 again showed poor performance, however
with lower Q or greater N, the PD classifier performance matched that of the other classifiers. Note that
the BP classifier performance suddenly decreases at high separation for these class distributions.
As seen in Table 6.3, training data error has little effect on the performance of the PD classifiers, as
shown in the Terr =0.1 column. While this is largely true for the BP algorithm as well, the MICD classifier
shows a strong sensitivity to this form of error, due to the inaccuracies this error will introduce into the
estimates of the mean and covariance values.
6.4
Spiral Class Distributions Results
The results of experiments using spiral class distributions are summarized in Table 6.4.
We see in these results that the performance of the PD classifiers remain close to that of BP classifiers,
and that Q=10 provides good performance in all of the cases examined. The performance of Q=20 is high
when N=10, 000, however when N=1000 the performance is much lower.
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
68
Table 6.3: Weighted Performance: Bimodal Class Distribution
Bimodal
Classifier
PD(WOE/Indep);Q=5
PD(WOE/Indep);Q=10
PD(WOE/Indep);Q=20
PD(O/A);Q=5
PD(O/A);Q=10
PD(O/A);Q=20
BP;H=5
BP;H=10
BP;H=20
MICD
N =100
N =500
N =1000
0.69±0.05
0.73±0.05
0.73±0.05
0.70±0.05
0.72±0.05
0.71±0.06
0.70±0.03
0.73±0.03
0.70±0.05
0.73±0.04
0.73±0.02
0.79±0.02
0.78±0.02
0.77±0.03
0.81±0.02
0.77±0.02
0.73±0.04
0.82±0.02
0.83±0.03
0.73±0.02
0.78±0.02
0.79±0.01
0.79±0.02
0.80±0.01
0.82±0.01
0.78±0.02
0.74±0.01
0.82±0.01
0.85±0.01
0.73±0.01
N =1000
T err = 0.1
0.78±0.02
0.79±0.02
0.78±0.02
0.80±0.01
0.81±0.01
0.78±0.01
0.75±0.03
0.81±0.03
0.82±0.02
0.67±0.01
N =10, 000
0.82±0.01
0.83±0.00
0.82±0.00
0.81±0.01
0.84±0.01
0.83±0.00
0.74±0.02
0.82±0.01
0.85±0.01
0.73±0.01
Table 6.4: Weighted Performance: Spiral Class Distribution
Spiral
Classifier
PD(WOE/Indep);Q=5
PD(WOE/Indep);Q=10
PD(WOE/Indep);Q=20
PD(O/A);Q=5
PD(O/A);Q=10
PD(O/A);Q=20
BP;H=5
BP;H=10
BP;H=20
MICD
N =100
N =500
N =1000
0.17±0.07
0.14±0.04
0.29±0.02
0.17±0.02
0.14±0.04
0.07±0.02
0.60±0.06
0.62±0.08
0.63±0.09
0.50±0.07
0.61±0.04
0.58±0.05
0.59±0.02
0.62±0.04
0.57±0.04
0.32±0.04
0.71±0.03
0.71±0.05
0.76±0.01
0.49±0.03
0.65±0.03
0.70±0.03
0.63±0.02
0.65±0.03
0.70±0.03
0.43±0.03
0.73±0.02
0.75±0.03
0.77±0.03
0.49±0.03
N =1000
T err = 0.1
0.64±0.03
0.68±0.02
0.61±0.02
0.64±0.03
0.69±0.02
0.37±0.04
0.67±0.05
0.74±0.03
0.73±0.02
0.50±0.02
N =10, 000
0.67±0.01
0.74±0.01
0.74±0.01
0.64±0.01
0.73±0.01
0.73±0.01
0.71±0.01
0.79±0.02
0.81±0.01
0.50±0.01
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
69
Fraction Correct on Covaried Bimodal Data
(1000 training points)
Fraction of Test Samples
1
0.9
0.8
0.7
MICD
BP H=5
BP H=10
BP H=20
PD(AdjRes/Ind), Q=5
PD(AdjRes/Ind), Q=10
PD(AdjRes/Ind), Q=20
0.6
0.5
0.125
0.25
0.5
1
2
log2 Separation (s)
4
8
Figure 6.4: Covaried Bimodal Data N=1000 Results
The relative performance of the PD classifiers shows the PD(O/A) classifier performed
more strongly than the PD(WOE/I) classifier, however note the extremely poor performance for Q=20,
and N=100 tests of the PD(O/A) classifier in this case.
Here too, the Terr =0.1 column shows a stability against training data error with the PD classifiers
being less affected than the BP classifiers.
6.5
Discussion
The results of these evaluations clearly show that the PD algorithm is an effective data mining tool. Furthermore, these results demonstrate that PD classifiers can be effective components within a higher level
decision support system. These results show that the PD classifiers are sensitive to having sufficient training data, though their requirements do not exceed those of other popular classifiers, such as BP. Both PD
and BP classifiers (Rumelhart et al., 1986; Minsky and Papert, 1988; Simpson, 1991; Hertz et al., 1991)
are trained in a similar way; a set of training data is examined and the essential relationships describing the
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
70
data are extracted. These relationships or patterns are then used in classification to provide labels for new
test data. PD and BP classifiers can be applied to linearly and non-linearly separable class-distributions.
The primary difference between the two classifiers lies in the structure of the decision space. The
PD classifiers construct a contingency table from discretized training values forming a hyper-cell division of the input space and make classification decisions using a nonlinear-weighted information-theory
or occurrence based estimation of the most likely class calculated using the patterns occurring in the input data vector. BP classifiers make a decision by selecting several optimal hyper-planes and making a
classification by performing a regression on multiple weighted hyper-planes within the subdivided space.
Both PD and BP classifiers benefit from large amounts of representative, labelled training data and
have configuration parameters that affect their classification performance. When PD is applied to continuousvalued data, the number of intervals (i.e., the resolution) used to quantize the data relative to the number
of features and the amount of training data available is important. The number of hidden nodes and the
learning rate and momentum are important factors for BP. Therefore, the performance of various configurations of these two classification schemes was compared so that some insight into the impact of these
factors on the practical use of PD and BP classifiers could be obtained.
6.5.1
Covaried Class Distributions
The results for the covaried data demonstrate that when the value for Q rises, the performance may fall,
as shown in Table 6.1. This behaviour is a result of the need for sufficient data to reach the expectation
limit of (3.9) within the PD classifier to reliably discover high-order patterns. Without high-order pattern
(maximum 5th order for this data set), performance suffers.
We also see that the performance of the PD classifiers is reasonably close to the optimal performance
of an MICD classifier for these simple, linearly separable, class distributions. This implys that a PD
classifier can reach optimal classification performance when the quantization adequately represents the
underlying data set and sufficient N allows optimal pattern discovery (i.e., all existing high order patterns
are found).
With a high value for Q and a small training data set size, an insufficient number of observed events
will occur for the highest order events to be confidently observed and discovered as patterns. As a result,
events which may form patterns in the underlying data set are not discovered during training, in turn
providing a poorer pattern space in which to make decisions during classification. This is the process
responsible for the low performance for Q=20 and N=100 or 1000. Increasing the size of the training
data set will overcome this as the number of occurrences of each high order event will rise; this is not
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
71
a satisfactory solution in all cases however, as often sufficient training data is simply not available. In
such cases, where reliable training data is difficult to produce, lowering the value of Q to produce a
coarser division of the feature space may provide a more viable alternative. Considering the accuracy of
the classification when Q=10 in Table 6.1 we see that choosing a lower value of Q does not necessarily
penalize the performance of the PD classifier; instead, for these class distributions, the strong ability to
generalize arising from the patterns discovered allows correct classification decisions to be made while
discretizing the features at a lower resolution.
It is notable that PD classifiers provide correct decisions even when the separation is very low; decisions which approach the optimal MICD bound in this linearly-separable decision space.
The difference between the performance of the PD(WOE/I) and PD(O/A) classifiers
shows that using the infinite weighting of the PD(WOE/I) scheme may be over-weighting some of the
pattern assertions, and that a more generalized “centre of mass” approach such as that produced by a more
linear weighting and firing of all patterns may avoid this over-weighting problem.
6.5.2
Log-Normal Class Distributions
When examining the log-Normal data in Figure 6.3 and Tables 6.2 it is apparent that the performance of
the PD classifier is independent of the distribution of the underlying random elements of the data, while
the assumption made by the MICD classifier that the distribution is Normal penalizes its performance.
As was mentioned when describing the log-Normal distributions in Section 5.2, as the shape parameter
ξ is increased the skewness of the distributions increase. Figure 6.3 indicates that as ξ increases and the
distributions become more skewed, the performance of the MICD classifier drops, until by ξ=2, the MICD
classifier is essentially guessing.
In contrast, the PD classifier performance is only slightly affected as skewness increases, even though
in the tails of these distributions there is now insufficient data available for PD to be able to create acceptable patterns to characterize this space. The performance of the PD classifier is very stable compared
with that of the MICD classifier in this case. This demonstrates that the PD classifier performance is not
strongly tied to the inherent shape of the class distribution, nor to the distribution of the noise present during measurement. In particular, assumptions that either of these distributions be Normal are not required.
6.5.3
Covaried Bimodal Class Distributions
The bimodal class distributions in Tables 6.3 and Figure 6.4 are not linearly separable, but still contain a
high degree of internal structure, as the covariance of each mode matches the covariance in the unimodal
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
72
case.
All the non-linear classifiers found this problem relatively easy, out-performing the MICD classifier
from the outset. Notable again was the deviation in the performance of the PD classifiers as Q changes;
again Q=10 was the optimal value shown because of the balancing between discretization resolution
and statistically sufficient expectation. In particular the performance of Q=20 was noticeably lower as
there were not enough training samples to support this level of quantization as high order patterns were
not discovered. These class distributions are clearly divided, although in a non-linear way, and the class
divisions follow the orthogonal orientation of the feature quantization space, so PD performance was quite
similar for Q=5 and Q=10.
Here again, performance of the PD(O/A) implementation is better than that of the PD(WOE/I) weighted scheme, indicating that the WOE based weightings may again be confusing the classification system into choosing an incorrect classification. As N (the size of the training set) increases, this
effect becomes much less noticeable.
The apparent performance anomaly in BP as separation increases in this bimodal distribution test is
due to local minima problems. Once the separation increases to the point where the inner modes can be
well distinguished, the bimodal data becomes a continuous version of the XOR problem — a problem
well known in the machine learning community for its need for a classifier which can function in a nonlinear space. While BP can solve this problem, it is known to be “hard” for BP to find a solution, as
several minima exist in the error space. When examining the individual performance numbers within the
BP jackknife tests, it was found that several of the classifiers had a performance of exactly 0.5, indicating
that they have failed to separate the modes of two of the classes. If these failed tests are removed, the BP
performance also saturates at 1.0 along with the other classifiers tested.
6.5.4
Spiral Class Distributions
For the spiral data, the effect of N can be clearly seen. Noting the performance of the PD classifiers with
Q=20 when N=1000, it can be seen that the PD classifiers do not have a large enough training set to
characterize this complex data. Once a training set of sufficient size is available, or if the quantization
value is kept reasonably low, the PD classifier performance rises to rival that of the BP classifier, which
performs admirably in this case.
Table 6.4, when N=1000, demonstrates the strengths and weaknesses of the PD classifier:
• as the shape of the underlying distributions is curved, the resolution of the discretization bins is
desired to be high, thus Q=5 has poor performance;
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
73
• the number of training examples is not sufficient to support 20 quantization intervals so performance
is very low for Q=20;
• using a Q value high enough to capture as much resolution in the data as possible without defeating
the threshold placed on expectation (i.e., Q=10), a performance comparable to BP can be achieved.
At high Q and low N, the PD(O/A) classifier performs abysmally, as there are not enough
training examples to discover any patterns when separation is low. This leads to a large number of unassigned values, and biases the performance statistic to low values.
When examining the results for each of the jackknife sets, it was found that at separation 0.125, no
rules were produced at all for the N=100 case, only 2 rules were produced at separation 0.5, and the largest
number of rules produced was 8, all of order 2. In contrast, at N=1000 up to 92 rules were produced to
characterize this data.
The low number of rules produced indicates that only a small number of the hyper-cells are covered; in
the remaining hyper-cells, no classifications will be performed. In cases where classification does occur,
it is performed based on few and conflicting rules.
The PD(O/A) classifier is particularly affected in this case as the few rules established are
of low order and exhibit a high degree of conflict. This leads to nullification of the accumulated weights
and cause incorrect assignments formed on small amounts of poor information.
6.5.5
Overall Performance Analysis
For a given value of Q, as N is increased, the relative performance of the PD classifiers improves, relative
to the BP classifiers. This performance improvement is a function of the ability of the PD algorithm to
discover higher-order pattern with the requisite statistical confidence (i.e., exml ≥ 5). The discovery of
these higher order patterns allows more accurate classification decisions to be made.
The interplay between N and Q is such that if Q is too low, increasing N will have little effect. Conversely, if Q is to be set to a high value, a large N will be required before any high order patterns are
available. For the continuous-valued class distributions studied, it seems that Q=10 provides a reasonable
compromise between sufficient quantization resolution and the ability to discern high order patterns without the need for training sets which are unlikely to be available in practice. Correspondingly, data sets
of size no larger than N=1000 are sufficient to support characterization through PD patterns and achieve
high performance during classification.
When examining the effect of differing data distributions, it was shown that the PD algorithm is largely
insensitive to variations in data topology, and is not reliant on assumptions such as Normality. This is
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
74
expected, as like BP, the underlying concepts of the PD algorithm avoid any specific assumptions about
data distribution topology. The only assumption made by the PD algorithm is the null-hypothesis that
unrelated events are independent and uniformly distributed.
The minimal effect of the training error on the performance of the PD classifiers is due to the statistical rigour used in defining a pattern. This allows erroneously labelled training examples to be ignored,
providing they do not occur a statically valid (and therefore unlikely) number of times. Training errors
therefore will not affect the patterns discovered, nor subsequent classifications made.
It is clear that the performance of the PD classifier itself is improved with the use of O
weighting and the firing of all the patterns in many cases. This implies that the WOE weighted, “independent” pattern selection strategy has some weakness which is compensated for by O weighting
and firing of all patterns. When examining the underlying data records which differ between the two
algorithms, it was found that the WOE statistic may be over-weighting the patterns found.
Examining the differences in performance between PD(WOE/I) and PD(O/A) in Tables 6.1 and 6.3 it is clear that the performance differences between the algorithms decrease as N increases. This suggests that the under-performance of the WOE algorithm is related to low N. In turn, this
implies that WOE weighting is over-weighting patterns made on low numbers of training events, which
in turn contributes to erroneous classifications. This may be caused by the variability in the estimate of
WOE. The patterns themselves are correct, as shown by the higher performance obtained by using the
same patterns with different weights; it is the weighting values themselves which are flawed.
The PD classifiers performed well for all of the class distributions studied, even though in general,
they are disadvantaged by the fact that the orthogonality and interval distribution of their discretization
space is created without regard to class boundaries. The decision surfaces of the MICD classifiers can
therefore out-perform those of PD classifiers by creating an optimal hyper-plane when the data is linearly
separable. The BP classifier can similarly create non-orthogonal planes to represent the class distributions,
in both linearly separable data and in the general cases. In this regard, improved PD classifier performance
could be expected if class-dependent discretization schemes were used.
6.6
Conclusion
The PD and BP classifiers performed comparably on the selection of class distributions studied; using
simple linearly separable class distributions there is little difference and when examining non-linearly
separable class distributions the PD classifier performance is at worst comparable.
Both the PD and BP classifier require the setting of control parameters in order to function optimally:
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
75
• for the PD classifier, the main constraints are to have sufficient quantization resolution to adequately
represent the important aspects of the class distributions while maintaining the expectation bound
of (3.9); the general desire is to increase the number of quantization intervals until the expected
number of occurrences in a hyper-cell is below statistical reason. Statistical uncertainty can be
easily avoided with only a cursory examination of the dimensionality of the class distributions and
knowledge of the number of training examples available.
• for the BP classifier, the number of hidden nodes, learning rate and momentum must be chosen.
Suitable values for each of these can be dependent on class distribution, making selection difficult
without extensive knowledge of the data or significant experimentation.
The original authors of the PD algorithm (Wang, 1997; Wong and Wang, 1997, 2003; Wang and
Wong, 2003) have evaluated its performance on a variety of ordinal and discrete-valued data problems.
The results presented here suggest that the PD classifier, using MME quantization, can be successfully
used for analysis of continuous- or mixed-value data as well.
The current results demonstrate that the performance of the PD algorithm used as a classifier and
applied to continuous-valued data is comparable to the well-accepted MICD and BP classifiers across a
number of class distributions and shows that the PD classifier performance is robust.
The amount of training data available places constraints on the extent to which input data can be quantized and can limit the performance of PD classifiers. When compared with BP classifiers, it is evident
that the cost of this quantization is not over-large. The ability to confidently configure PD classifiers,
their strong absolute and relative performance when applied to continuous- or mixed-valued data and the
transparency and strong statistical basis for the patterns discovered should allow PD classifiers to be successfully applied to classification problems where the rationale and confidence of the underlying decision
is important. The combination of the data mining abilities of the PD algorithm and the transparency of PD
based classifications provide the framework in which the context required for effective decision support
and analysis is provided.
6.7
Summary
While the PD algorithm performance may be exceeded by natively continuous classifiers such as BP and
MICD, the difference is not very great.
The PD algorithm is stable across distributions; in particular the log-Normal distribution performance
of PD is quite good while the Bayesian MICD algorithm has significant problems as skewness increases.
There is also stability of the PD algorithm across training error. As these tests have been proven successful
CHAPTER 6. SYNTHETIC DATA ANALYSIS OF PATTERN DISCOVERY
76
on the base PD algorithm, we will not need to repeat them for the derived FIS.
The performance of PD is also stable with respect to N, within the bounds set by the occurrence
estimation calculation of (3.4).
Of the results in this chapter, one of the most significant is that within the PD system, the implementation of the new O based weighting has higher performance (in terms of the number of correct
decisions made) than the WOE based standard PD algorithm. A further important result is that while the
performance of the PD system is lower than that of BP, the performance is significantly high across a
variety of data type distributions.
Chapter 7
Synthetic Data Analysis of Fuzzy Inference
System
There’s no sense in being precise when you don’t even know what you’re talking about.
— John von Neumann
In this chapter the results of experimental evaluations of the performance of the fuzzified PD classification algorithm are provided. These results are organized by the type of class distribution used for the
experiments. After the results are presented, a full analysis is given in Section 7.4, Discussion. These evaluations will provide a measure of the function of the PD/FIS performance while functioning as a classifier.
The measurement of this performance will allow a discussion of extent to which the label values suggested
by the system are correct; once the system has been measured in this way, the system’s knowledge of its
own confidence can be discussed. This discussion of confidence is placed in Chapter 9.
7.1
Covaried Data Results
Figure 7.1 shows results based on the covaried data tests trained with N=1000 records per jackknife
training set. All experiments are summarized in Table 7.1 using the weighted performance statistic of
(5.12). The data in Table 7.1 are mean values, with corresponding standard deviations. As in the last
chapter, the graph in Figure 7.1 is clipped at 0.5 to allow the lines to be more clearly visible, and the
x-axis is plotted in log2 for clear separation of the data points.
77
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
78
Table 7.1: Covaried Class Distribution Summary
Classifier
FIS(Mamdani/All,crisp;Q=10)
FIS(Mamdani/All;Q=10)
FIS(Occurrence/All,crisp;Q=10)
FIS(Occurrence/All;Q=10)
FIS(Occurrence/Ind;Q=10)
FIS(WOE/All;Q=10)
FIS(WOE/Ind;Q=10)
PD(WOE/I);Q=10
PD(O/A);Q=10
BP;H=5
BP;H=10
BP;H=20
MICD
Covaried Class Distribution
N =100
N =500 N =1000 N =10000
0.49±0.07 0.60±0.03 0.64±0.02 0.68±0.01
0.54±0.08 0.64±0.03 0.66±0.02 0.70±0.01
0.47±0.07 0.63±0.03 0.66±0.02 0.69±0.01
0.53±0.08 0.65±0.03 0.67±0.02 0.70±0.01
0.53±0.08 0.61±0.03 0.62±0.03 0.70±0.00
0.54±0.07 0.64±0.03 0.67±0.02 0.70±0.01
0.54±0.08 0.63±0.03 0.65±0.02 0.70±0.01
0.52±0.07 0.64±0.03 0.63±0.02 0.69±0.01
0.51±0.07 0.64±0.03 0.66±0.02 0.69±0.01
0.56±0.04 0.63±0.02 0.61±0.03 0.59±0.01
0.60±0.04 0.69±0.02 0.71±0.02 0.70±0.01
0.56±0.07 0.69±0.01 0.72±0.02 0.72±0.01
0.74±0.07 0.74±0.03 0.74±0.02 0.74±0.01
The various fuzzy implementations are displayed as “FIS(weighting-scheme/rule-firing-method),” where
“I” rule firing indicates the PD based independent plan, and “A” indicates the standard fuzzy rule firing.
In both Figure 7.1 and Table 7.1, it is clear that the FIS using occurrence based weights and using all
rules has the highest performance of any PD based system. This performance is exceeded by both the BP
system and the (optimal) MICD classifier.
Figure 7.1 shows that the performance of all the classifiers is well-behaved with separation: the expected monotonic performance increase with separation is visible, and it is clear that the choice of weighting algorithm has only a subtle effect, as the lines for all candidate PD based classifiers form a tight
spectrum across the graph. Indeed, excepting MICD and BP it is difficult to determine the identity of any
other line on the graph.
Referring therefore to the numeric data in Table 7.1, we see that the BP classifiers show a higher
performance than the PD based systems across all separations, however BP results in all cases are lower
than that of the MICD classifier. When N=10, 000, the BP classifier with H=10 hidden nodes has a correct
classification rate of 0.70 ± 0.01; this is equalled by the PD based FIS. When N is low, the PD based FIS
performance is almost equal to that of the BP classifier.
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
79
Fraction Correct on Covaried Normal Data
(1000 training points, Q=10)
Fraction of Test Samples
1
0.9
0.8
FIS(Mamdani/All)
FIS(Mamdani/All,crisp)
FIS(WOE/Ind)
FIS(Occurrence/All)
FIS(Occurrence/All,crisp)
MICD
PD(WOE/Ind)
PD(Occurence/All)
BP(H=10)
0.7
0.6
0.5
0.125
0.25
0.5
1
2
log2 Separation (s)
4
8
Figure 7.1: Covaried Class Distribution Results
7.2
Bimodal Data Results
Bimodal class distribution results for N=1000 are plotted in Figure 7.2, and a complete summary is provided in Table 7.2. The MICD classifier is no longer optimal, as mentioned in the last chapter, as the data
is not linearly separable. The performance for this classifier is presented for completeness and comparison
purposes.
The BP classifiers again have superior performance relative to the PD-based classifiers, but as seen
in the results, the performance of the FIS(O/A) classifier approaches the BP classifier results
while still under-performing at all separations.
7.3
Spiral Data Results
Spiral results for N=1000 are shown in Figure 7.3, and complete results are presented in Table 7.3.
As a single hyper plane drawn across the data will simply cleave each set in half, the MICD performance here is 0.5. Again, the BP classifier performance is the highest of all tested classifiers. The
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
Table 7.2: Bimodal Class Distribution Summary
Classifier
FIS(Mamdani/All,crisp;Q=10)
FIS(Mamdani/All;Q=10)
FIS(Occurrence/All,crisp;Q=10)
FIS(Occurrence/All;Q=10)
FIS(Occurrence/Ind;Q=10)
FIS(WOE/All;Q=10)
FIS(WOE/Ind;Q=10)
PD(WOE/I);Q=10
PD(O/A);Q=10
BP;H=5
BP;H=10
BP;H=20
MICD
N =100
0.71±0.04
0.74±0.04
0.71±0.05
0.74±0.05
0.74±0.05
0.72±0.05
0.72±0.05
0.73±0.05
0.72±0.05
0.70±0.03
0.73±0.03
0.70±0.05
0.73±0.04
Bimodal Class Distribution
N =500 N =1000 N =10000
0.79±0.03 0.81±0.02 0.83±0.01
0.82±0.02 0.83±0.02 0.54±0.19
0.80±0.03 0.82±0.02 0.84±0.00
0.82±0.03 0.84±0.02 0.85±0.01
0.79±0.02 0.79±0.02 0.52±0.18
0.80±0.02 0.82±0.02 0.84±0.01
0.79±0.02 0.80±0.02 0.84±0.01
0.79±0.02 0.79±0.01 0.83±0.00
0.81±0.02 0.82±0.01 0.84±0.01
0.73±0.04 0.74±0.01 0.74±0.02
0.82±0.02 0.82±0.01 0.82±0.01
0.83±0.03 0.85±0.01 0.85±0.01
0.73±0.02 0.73±0.01 0.73±0.01
Table 7.3: 4-feature Spiral Class Distribution Summary
Classifier
FIS(Mamdani/All,crisp;Q=10)
FIS(Mamdani/All;Q=10)
FIS(Occurrence/All,crisp;Q=10)
FIS(Occurrence/All;Q=10)
FIS(Occurrence/Ind;Q=10)
FIS(WOE/All;Q=10)
FIS(WOE/Ind;Q=10)
PD(WOE/I);Q=10
PD(O/A);Q=10
BP;H=5
BP;H=10
BP;H=20
MICD
N =100
0.29±0.07
0.36±0.07
0.29±0.07
0.35±0.08
0.35±0.08
0.35±0.08
0.35±0.08
0.14±0.04
0.14±0.04
0.60±0.06
0.62±0.08
0.63±0.09
0.50±0.07
Spiral Class Distribution
N =500 N =1000 N =10000
0.58±0.04 0.69±0.03 0.72±0.01
0.64±0.04 0.71±0.03 0.74±0.01
0.58±0.04 0.70±0.03 0.72±0.01
0.64±0.04 0.71±0.03 0.76±0.01
0.64±0.04 0.69±0.03 0.70±0.01
0.64±0.04 0.71±0.03 0.73±0.01
0.64±0.05 0.70±0.03 0.72±0.01
0.58±0.05 0.70±0.03 0.74±0.01
0.57±0.04 0.70±0.03 0.73±0.01
0.71±0.03 0.73±0.02 0.71±0.01
0.71±0.05 0.75±0.03 0.79±0.02
0.76±0.01 0.77±0.03 0.81±0.01
0.49±0.03 0.49±0.03 0.50±0.01
80
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
81
Fraction Correct on Bimodal β=0.5 Data
(1000 training points, Q=10)
Fraction of Test Samples
1
0.9
0.8
FIS(Mamdani/All)
FIS(Mamdani/All,crisp)
FIS(WOE/Ind)
FIS(Occurrence/All)
FIS(Occurrence/All,crisp)
MICD
PD(WOE/Ind)
PD(Occurence/All)
BP(H=20)
0.7
0.6
0.5
0.125
0.25
0.5
1
2
log2 Separation (s)
4
8
Figure 7.2: Bimodal Class Distribution Results
unmodified PD classifier out performs some of the FIS classifiers when N=10, 000, however the FIS(O/A) classifier remains the top overall performing PD based classifier.
Table 7.4 indicates the number of records left unclassified by the three PD based classifiers studied.
Similar data is not presented for the other class distributions because for these distributions no records
were left unclassified. It is clear from Table 7.4 that the FIS with fuzzy support (i.e., the trapezoidal
ramps) was able to classify records which were left unclassified by the crisp methods.
7.4
Discussion
The results shown here suggest that the rules produced by PD induce correct, highly confident classifications when implemented by the FIS algorithm. In addition, the use of these rules under a number of input
and output weighting schemes provides performance which meets or exceeds the performance of a crisp
PD classifier. We see here the amelioration of the effects of quantization through the use of fuzzy input
membership functions.
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
82
Fraction Correct on Spiral Data
(1000 training points, Q=10)
1
Fraction of Test Samples
0.9
0.8
FIS(Mamdani/All)
FIS(Mamdani/All,crisp)
FIS(WOE/Ind)
FIS(Occurrence/All)
FIS(Occurrence/All,crisp)
MICD
PD(WOE/Ind)
PD(Occurence/All)
BP(H=20)
0.7
0.6
0.5
0.4
0.125
0.25
0.5
log2 Separation (π)
1
Figure 7.3: 4-feature Spiral Class Distribution Results
PD based rule generation techniques are similar to techniques that use statistical clustering for input space segmentation and contingency tables for generating rule weights (Kukolj, 2002; Chen et al.,
2001; Chen, 2002). However, the PD techniques use adjusted residual analysis and, as implemented here,
MME quantization. These techniques both differentiate this work and provides a transparency of a type
not present in the other techniques. MME quantization is class independent and while class dependent
quantization can be applied within the PD framework (Ching, Wong and Chan, 1995) to possibly improve
performance, MME quantization may allow a more direct interpretation of the bins created as knowledge
of specific class characteristics is not required to interpret the quantization intervals used to define rules
and hence contribute to a more transparent classifier.
The adjusted residual analysis provides statistically valid tests to select patterns that are then known
to express valid relationships between specific feature values and specific class memberships and are thus
selected as rules for classification. Using contingency tables to solely provide rule weightings while using
all events as rules for classification may cause confusion due to the occurrence of conflicting statistically
insignificant events. Furthermore, adjusted residual analysis allows both positive and negative classifica-
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
83
Table 7.4: Spiral Class Distribution Unclassified Records
Classifier
FZ (ramps)
FZ (crisp)
PD
0.125
none
0.013±0.0049
0.013±0.0049
0.250
none
0.003±0.0031
0.004±0.0032
0.500
none
0.001±0.0029
0.001±0.0029
0.750
none
none
none
1.000
none
none
none
tion assertions to be made that can take advantage of information both supporting and refuting specific
classifications.
7.4.1
Performance Across Class Distributions
The performance of the PD-based fuzzy classifiers studied was robust across three very dissimilar synthetic data distributions, indicating that the rules produced by the PD algorithm are robust, and that the
adaptation of these rules to provide a fuzzy logic framework maintains the power and discriminative properties of the PD rules, while at the same time overcoming some of the cost incurred by quantization of the
input data.
7.4.2
Crisp versus Fuzzy
Comparing the results of the FIS(O/A) experiments with those using FIS(O/A,C)
the improvement in the FIS produced by establishing the fuzziness of the input membership functions is
obvious. Specifically, modelling quantization vagueness contributed to the performance increases represented in Tables 7.1, 7.2 and 7.3. The fuzzification of the bin boundaries allows the FIS with fuzzy bins
to make correct decisions through the use of additional information.
The reduction in the number of errors occurs through the firing and use of more (and possibly better)
rules near inter-class feature value boundaries. These rules were created for use in the adjacent quantization bins. The validity of their assertions extends into neighbouring bins with a high degree of observed
accuracy, improving the overall performance.
7.4.3
Performance Across Weighting Schemes
It is also apparent from Tables 7.1, 7.2 and 7.3 that the use of occurrence based weighting is superior
to the WOE mechanism when used within an FIS; this is unsurprising as evaluation of the “occurrence”
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
84
based rule weighting provides superior performance within PD itself, as seen in the last chapter. In all the
result tables it is clear that the difference in performance among the various PD algorithms is small. This
bears witness to the admirable strength of the PD-produced rules, supporting the use of several weighting
schemes with generally strong performance.
Comparing the results of the FIS(O/A) and FIS(WOE/I) weighting schemes, it appears
that performance when using FIS(WOE/I) weighting is subject to a penalty due to an over-weighting of
the fuzzy rules. This is consistent with the evaluation of the PD system itself where PD(O/A)
has performance measures which out-perform those of PD(WOE/I).
As seen in Tables 7.1, 7.2 and 7.3, the performance improvement of FIS(O/A) over
FIS(WOE/I) is related to the improvement seen in the last chapter of PD(O/A) over PD(WOE/I) in Tables 6.1, 6.3, and 6.4.
7.4.4
Performance Based on Training Set Size
As N is increased, the performance of all of the FIS and PD classifiers increase. At low N, the PD based
systems show a stable performance and are still able to extract salient rules from complex data. Examining
Table 7.3, it is apparent from the performance of the PD versus FIS classifiers that the implementation
of fuzzy membership has allowed the few rules created with extremely low N to be used more often,
increasing performance at N=100 from 0.14 to 0.35. This performance improvement is possible because
of the strength of the discovered rules: even at such low N, the rules discovered are trustworthy.
7.4.5
Cost of Quantization
The use of a discrete algorithm in a continuous domain often incurs a cost of reduced performance due
to the quantization performed. One of the features of fuzzification of quantized data is to reduce the cost
of quantization and improve the performance of discrete algorithms towards that which is exhibited by
well considered natively continuous algorithms. The data in Table 7.4 suggests that fuzzification of the
input space models some of the vagueness associated with quantization of continuous-valued input data,
allowing records which are left unclassified by the crisp classifiers to be classified by the FIS using rules
which are obviously still valid.
In the spiral class distribution problem, adjacent bins frequently support differing labels simply due
to the topology of the class distributions. Table 7.4 is therefore particularly interesting as it shows that
the ramps extend the “region of effect” of a given rule into parts of the data space which are difficult to
characterize based on MME quantization, but which still can be covered successfully using the ramped
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
85
decreasing membership of rules in adjacent bins. The use of rules from adjacent bins provides good
performance, even though the spiral class distribution would suggest that this is difficult.
As seen in all the above results, the FIS(O/A) based classifier exceeds the performance of
all other PD based systems. Table 7.1 shows a marked improvement by the FIS classifiers over the straightforward PD classifier. Much of this improvement is due to the presence of the trapezoidal membership
function ramps (i.e., extended, or fuzzy support) as indicated by the lower performance of the crisp input
membership function version of the FIS.
7.5
Conclusions
The PD algorithm has various weaknesses, some of which can be mitigated by applying the fuzzy set techniques described in this work The improvements resulting from the fuzzification of PD are well demonstrated using the spiral class distribution shown in Figure 7.3 and Table 7.3:
• the curving shape of the underlying distribution poses a problem with respect to the rectangularly
divided quantization space;
• this in turn causes a great deal of conflict in some of the quantized hyper-cells, generating no rules
for these regions of discord;
• the extended fuzzy support of the input membership bins allows the extension of rules from adjacent
hyper-cells into these conflicting regions, allowing classification assertions to be made based on
neighbouring cells, albeit with decreasing performance as the distance away from the cell bound
increases.
The improved performance obtained through the use of the fuzzified input membership functions occurs through the firing and use of more (and possibly better) rules near inter-class feature value boundaries.
These rules were created for use in the adjacent quantization bins, however the validity of their assertions
extends into neighbouring bins with a high degree of observed accuracy, improving the overall performance. Together, these factors allow the use of fuzzified PD based classifiers even in strongly conflicting
regions of the problem space. Their performance is acceptable in spite of the regions of discord, and they
can be used with relatively low N.
While the BP classifier out performs the FIS, the BP classifier itself suffers from several drawbacks,
notably the uncertain tuning of configuration parameters such as the number of hidden nodes, learning
rate, etc. Perhaps the most significant problem is the “black box” functionality of the neural net as a
whole, as described in the background discussion provided in Chapter 2. In contrast the FIS can be easily
CHAPTER 7. SYNTHETIC DATA ANALYSIS OF FUZZY INFERENCE SYSTEM
86
configured based on the PD expectation of (3.4) as seen in the last chapter, and provides the transparency
we will require for a DSS.
7.6
Summary
The contributions explored in this chapter are:
• performance of the FIS in relation to PD;
• performance benefits of fuzzy membership functions; and
• performance benefits of the new occurrence weighting scheme.
The performance of the new FIS is better than PD performance. The use of fuzzy membership functions recovers some of the cost of quantization and improves the overall system performance. In particular,
the FIS(O/A) classifier has the highest performance of any PD based classifier.
The FIS and continuous-adapted PD systems have been evaluated on a variety of synthetic data distributions. The FIS performance still does not match the performance of BP, however the classifiers are
comparable.
Even though the FIS performance is not the best of the classifiers measured, the performance is reasonably high. A decision support system needs all of: a classifier to produce a suggested label; a good
estimate of the confidence of the suggestion and exceptional transparency. For the construction of a DSS
therefore, a classifier which performs well and exhibits a high degree of transparency is preferred to one
which may have a higher classification performance but lower transparency. This implies that the PD
based FIS system meets our needs admirably for DSS design purposes, as the performance is comparable
to the BP system and the transparency of the inference is much higher.
What remains is to test the system upon real-world examples, and to describe the system in a decision
support context.
Part III
Real-World Data and Decision Support
87
Chapter 8
Analysis of UCI Data
Many data sets are available in the Machine-Learning Repository (Newman et al., 1998) at the University
of California, Irvine (UCI) which provides a collection of standard data bases for use in evaluating the
performance of various classification systems, and is now recognized as the source of “standard” data for
classification performance analysis. Two of the UCI data sets have been chosen for use here as they have
continuous-valued attributes: the “thyroid disease” and “heart disease” databases.
8.1
Analysis of Thyroid Disease Data
The human thyroid disease database available at has been chosen as an example both because this data
falls within the biomedical domain as well as because the data is multi-class and includes a fair number of
training records. Most of the data sets in the UCI repository are provided with only a very small number
of records, this is frequently the case when each record is costly to obtain. The 7200 records in the thyroid
set therefore represents a reasonably large amount of data, considering it has been collected from a real
source. This data set is referred to as thyroid-disease within the UCI repository, and is termed “the
ANN∗ data set,’ as supplied by Randolf Werner of Daimler-Benz. The access URL for this database is
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/thyroid-disease.
8.1.1
UCI Thyroid Data Preparation
The relevant attributes of this database are:
∗
i.e., ‘artificial neural network”.
88
CHAPTER 8. ANALYSIS OF UCI DATA
89
Table 8.1: UCI Thyroid Data Record Counts
Class: 0
Data Set 1
33
"
2
33
"
3
33
"
4
33
"
5
34
Total 166
1
74
74
74
74
72
368
2
1,333
1,333
1,333
1,333
1,334
6,666
• the data set is fully continuous;
• there are a reasonable number of records in the data set, having 3, 428 designated for “training” and
3, 772 for “testing” as presented in the repository for a total of 7200; and
• the data set represents real disease states recorded from human patients.
The one drawback of this data set is its rather poor documentation, as the features are not named. It is
impossible, therefore, to get a good sense of how this data set relates to actual disease data.
For this work, a number of jackknife sets were produced using the following procedure:
• the “testing” data was concatenated onto the end of the “training” data;
• records for each separate class {0, 1, 2} were isolated into separate files while preserving the order
to allow later comparison by other readers;
• each class was divided into 5 sets, and finally
• the sets of each class were combined to form 5 jackknife data sets, grouping the first of each of the
5 sets together to form the first jackknife set, then using the second of each, etc.
This divides the 7, 200 available records among five files with the record counts by class as shown in
Table 8.1.
8.1.2
Results
Using only the continuous-valued attributes from the data described in 8.1.1 (and thus using records with
6 input values plus a label for training), the following familiar classifiers were evaluated:
• a BP network;
• the MICD classifier;
• the base PD algorithm;
CHAPTER 8. ANALYSIS OF UCI DATA
90
Table 8.2: 6 Feature Thyroid Data Correct Classification
Classifier
FIS(O/A)
FIS(WOE/I)
PD;Q=5
MICD
BP;H=4
Fraction
Correct
0.962±0.005
0.929±0.016
0.864±0.035
0.586±0.067
0.960±0.012
• the FIS(O/A) and FIS(WOE/I) classifiers of Chapter 4.
For the FIS classifiers, Q=5 was used due to the very low numbers of records, especially in class “0.”
The results for these experiments is shown in Table 8.2. This table displays the average count of the
number of records classified correctly and of those with error, as well as the variance in the count for both
correct and error. None of the algorithms left any of the records unclassified.
Interestingly, the FIS(O/A) hybrid system outperforms the BP algorithm, though only
by a very slight margin. The BP classifier in turn out-performs both PD(WOE/I) and the unmodified
PD classifier. The MICD performance is very interesting, as it indicates that the data is certainly not a
unimodal Normal distribution.
8.1.3
Discussion
The relatively high performance of the FIS(O/A) algorithm in the face of small numbers of
training records is a strong feature, especially when compared with the performance of the BP system.
The difference in performance between the various FIS algorithms is similar to that when tested on
synthetic data. The separation in classification performance of BP and FIS(O/A) has vanished,
and the relative ranking has reversed, though the difference in performance is negligible. This seemingly
surprising under-performance of the BP system is most likely due to the generally small number of records
available, and specifically the very small number of incidences of class “0.” The asymmetric distribution
of the labels will cause the training of BP to favour the most-observed label to some degree.
The analysis of thyroid data has therefore confirmed the relative ranking of the PD/FIS and PD classifiers, and has demonstrated the well-known weakness of BP when used with very small data sets. This
indicates that the PD/FIS, specifically the FIS(O/A) form, may be somewhat less susceptible
CHAPTER 8. ANALYSIS OF UCI DATA
91
to problems of low N than BP; an algorithm known to need a great deal of training data. This encouraging
result indicates that the analysis of synthetic data may have played into strengths of the BP algorithm,
due to the regular structure of the data. The PD/FIS system obviously extracts a comparable amount of
information from the thyroid data as does the BP classifier; further, the performance across the jackknife
sets is quite stable.
This result supports the previous conclusions that the FIS(O/A) PD/FIS system performs
admirably well when functioning as a classifier. The labels suggested by the system are of as high quality
on this real-world problem as are those suggested by the BP system.
To further explore the performance of the PD/FIS system when dealing with the amount and type of
data found in real-world problems, another data set was examined.
8.2
Analysis of Heart Disease Data
As a further real world example we will consider the data set of the Hungarian Heart Disease Database†
from the online repository at UCI. These databases are available through the URL ftp://ftp.ics.uci.
edu/pub/machine-learning-databases/heart-disease.
8.2.1
Choice of Heart Disease Database
There are four “heart disease” data sets at the UCI repository: “Cleveland,” “V.A.,” “Hungarian,” and
“Switzerland.” The collection protocol between the databases varies slightly, preventing the use of the
collected records as one data set.
The database with the largest number of records, “Cleveland”, contains some fields which are in error,
as described in the documentation accompanying the database. For this reason, it was not used. The
“Hungarian” database has almost the same number of records (294 instead of 303), and has no history of
errors or corruption; the “Hungarian” data set is therefore used for the analysis presented here.
8.2.2
Heart Disease Database Features
The heart disease data set has 13 numeric input features which are commonly used, which are outlined
in Table 8.3. These 13 features provide a sufficiently high order to allow the PD system to discover
†
This data was collected through the efforts of Andras Janosi, M.D. of the Hungarian Institute of Cardiology in Budapest. Dr.
Janosi donated this data to the UCI Machine Learning Online Repository (Newman et al., 1998) through David W. Aha in July
1988.
CHAPTER 8. ANALYSIS OF UCI DATA
92
Table 8.3: UCI “Hungarian Heart Disease” Database Feature Names
Feature
A
S
CP
Typea
C
N
N
TRBPS
C
FBS
RECG
C
C
N
N
TA
EA
OP
S
CA
Tc
L
C
N
C
N
I
N
N
Descriptionb
age in years
sex [1 = male, 0 = female]
chest pain type [1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 =
asymptomatic]
resting blood pressure (in mm Hg on admission to the hospital)
serum cholesterol in mg/dl
fasting blood sugar > 120 mg/dl? [1 = true; 0 = false]
resting electrocardiographic results [0 = normal; 1 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV); 2 =
showing probable or definite left ventricular hypertrophy by Estes’ criteria]
maximum heart rate achieved (bps)
exercise induced angina [1 = yes; 0 = no]
ST depression induced by exercise relative to rest
slope of the peak exercise ST segment [1 = upsloping; 2 = flat; 3 = downsloping]
number of major vessels (0-3) coloured by fluoroscopy
[3 = normal; 6 = fixed defect; 7 = reversible defect]
label for disease state, values are “healthy” and “diseased”; the original data characterized 3 levels of disease state, however the accompanying documentation states
that all known classification tasks have combined all disease states into a single
class.
a
C=continuous, N=nominal, I=integer
These features and descriptive text are reproduced from the documentation accompanying the data set.
c
This rather cryptic field is not explained in the documentation for the database in the UCI repository.
b
informative rules. This data will be further used in the decision support discussion in Chapter 10 where
inspection of the relationships found will allow the reader to observe interesting consequences of the rule
evaluation in the context of this real-life data.
In Table 8.3, each column is described in terms of its “type.” In this mixed mode data set, some features
can be treated as continuous values using MME, and some must be treated as distinct. All features labelled
“continuous” are pre-processed using MME. All “nominal” or “integer” features are left unprocessed.
CHAPTER 8. ANALYSIS OF UCI DATA
93
Table 8.4: Rule Order for Hungarian Heart Disease Data
Order
2nd
3rd
4th
5th
8.2.3
# of Rules
31.888±0.485
45.306±1.024
38.412±1.891
17.092±0.510
Overall Statistics from PD Analysis
In order to generate rules for this database, a “leave-one-out” protocol was established to jackknife over
all 294 records using a single record as a test case and the remaining 293 records for training. The full
jackknife protocol was used in order to make the most of the limited number of data records. All results
in this chapter are therefore averaged over all 294 tests.
The PD algorithm was run using Q=7 on these 294 different jackknife tests. This Q value was chosen
in order to ensure a reasonably high order of pattern could be discovered, as controlled by equation (3.4).
Note that the Q factor only affects columns with continuous data values, as integer and nominal data values
are not processed by MME. The number of elements in a non-continuous column is therefore determined
solely by the number of unique values occurring for that feature and is entirely independent of Q.
8.2.4
Pattern Discovery Generated Rules
For the heart disease data, there were on average 132.7 rules generated to describe a jackknife trial, with a
standard deviation of 2.554. The high number of rules indicates the complexity of the data set as observed
by the PD algorithm with MME binning.
The average and standard deviation of the order of these rules is shown in Table 8.4. No rules higher
than fifth order were discovered in this data set, due to the small number of training records available.
These low numbers indicate that the contingency table is quite sparsely characterized.
Most of the rules were weighted with non-infinite WOE, as an average of only 2.74 positive and 9.76
negative infinite rules were found in each test. On average, there were 65.27 rules characterizing disease
per jackknife set, with a standard deviation of 1.28. The average number of rules characterizing normal
patients was 67.42 with a standard deviation of 2.02.
From an information standpoint, this shows us that there is approximately equal complexity in the
relations regarding diseased and normal patients. Further, the low number of infinite-weight rules indicates
CHAPTER 8. ANALYSIS OF UCI DATA
94
Table 8.5: Heart Disease Data Correct Classification
Classifier
FIS(O/A)
FIS(WOE/I)
PD;Q=7
MICD
BP;H=2
BP;H=4
BP;H=7
Fraction
Correct
0.830±0.376
0.813±0.390
0.813±0.390
0.765±0.424
0.639±0.480
0.636±0.481
0.653±0.476
that both classes are mutually conflicted. The ranked decrease in order shown in Table 8.4 indicates that
the higher order rules may have been discovered if more input records had been available.
8.2.5
Analysis of Performance
Performance statistics were gathered for the same classifiers used in thyroid data testing:
• the MICD classifier;
• BP;
• unmodified PD;
• FIS(O/A) and
• FIS(WOE/I).
Results for correct classification performance are shown in Table 8.5.
The FIS system is the next best classifier, outperforming the MICD classifier by a significant margin,
and improving on the performance of PD using MME quantization both with an increase in performance
and a decrease in variability.
The BP classifier has quite low performance with high variation, indicating that this problem contains
a local minima in the BP error surface.
8.2.6
Discussion
Significant improvements are again shown by the FIS(O/A) algorithm against the PD algorithm alone, as in the tests on the thyroid and synthetic data sets.
CHAPTER 8. ANALYSIS OF UCI DATA
95
The ranking of the other classifiers is stable across all the tests performed when the fact that BP
frequently becomes trapped in local minima while evaluating this problem is taken into account.
This example shows that the FIS(O/A) PD/FIS hybrid can attack a problem with limited amounts of training data and perform quite acceptably on a hard problem. The performance of the
MICD classifier shows that this problem has a distribution which is not unimodal Normal, and the BP
performance indicates that the amount of information in the data is quite small.
Looking again at Table 8.4, one wonders whether even higher order rules may increase the robustness
and discriminative power of the rule set, as the performance of 0.830 of the records classified correctly is
still much lower than one would desire, especially in a medical characterization problem. The performance
of the FIS(O/A) system would therefore likely be higher if more data were available for
training.
Overall, the performance of the FIS(O/A) system is quite encouraging, as again the BP
system performance has been exceeded. This PD/FIS system can therefore function as a classifier with
comparable performance to other popular classifiers. This implies, as in the thyroid data, that this complex
real-world data set shows strengths in FIS(O/A), or weaknesses in BP, that were not apparent
in the synthetic data.
The results from the synthetic analysis are therefore representative of problems for which FIS(O/A) is not strong. This in turn indicates that the performance shown in Chapter 7 does not
over-estimate the performance capabilities of the PD/FIS system; instead the analysis here shows that the
synthetic data class distributions provided a difficult test for the PD system.
8.3
Conclusions from UCI Data Analysis
The hybrid PD/FIS system performs comparably to the BP algorithm on two important real-world data
sets. Particularly high measured performance was observed when using the FIS(O/A) weightings and rule firings. This further indicates that the system is well-behaved as a classifier, supporting the
conclusions of the synthetic data analysis of the previous chapter.
The contributions of this work are shown to consistently improve the performance of PD based classification systems and provide a consistent benefit to use of PD as a classifier on these real-world data
exmples. The specific contributions of interest are:
1. improvements to performance through FIS(O/A) weighting; and
2. improvements to performance through the use of fuzzy membership.
This measured classification performance shows that a DSS based on this method will produce a
CHAPTER 8. ANALYSIS OF UCI DATA
96
correct suggestion with comparable likelihood to that of other classification systems. The use of the
PD/FIS system as a classifier therefore is suggested for use in a DSS as the penalty in classification
performance is outweighed by the benefits provided by the transparency of the system.
Chapter 9
Confidence and Reliability
Science is built up of facts, as a house is with stones. But a collection of facts is no more a
science than a heap of stones is a house.
— Jules Henri Poincaré
As mentioned in the introduction, a decision support system (DSS) is useless if it does not have a
mechanism of reporting an estimation of the confidence that can be placed in the suggestion provided.
This requirement of a DSS is the means by which the system may fail gracefully. A confidence measure
provides the means by which the user may understand when there is conflict or poor statistical support in
a suggestion, and differentiate these occasions from those when the suggestion is made on confident and
clear measures. Using such a confidence measure, the user is informed under what conditions an error is
likely to be made.
9.1
Confidence and Reliability
A confidence measure is an estimate of the true reliability of the system. Equation 2.6 in Chapter 2
describes the measurement of reliability in terms of the measurable probability of failure:
C = E(R)
R = 1 − Pr(failure) ≡ 1 − Pr(Incorrect).
97
CHAPTER 9. CONFIDENCE AND RELIABILITY
98
In the context of the hybrid system described in this work, the probability of failure is equivalent to the
probability of providing the user with an incorrect suggestion.
Reliability is measured over a series of results, however, and as such is usually quantified as the
average performance of a system. ROC analysis (as described in Section 2.3.1) provides a measure of the
reliability of a two-outcome test or classification system.
What is desired for the hybrid system described here is an estimate of the system reliability, however a prefereable estimate is one which maps the confidence of a correct association being made based
specifically on the current input values. Of the class of decisions made by the system, not all can be made
with equal ease, and a good confidence estimate will be one which correlates well with the probability of
successfully classifying records at different degrees of difficulty.
We will therefore use the term reliability to indicate the system wide performance, and use the term
confidence to indicate the estimate of performance based on input data values.
This chapter will evaluate a set of possible confidence measures. Evaluation will consist of examining
how well each measure predicts the true probability of success or failure of the system, as evaluated over
the training/test data.
To choose a confidence measure, that which best predicts the true (measured) probability of a successful classification being produced of the system as calculated using the synthetic data type distributions is
determined. This measure is then used to provide feedback to the user to indicate a varying expectation
of uncertainty in the suggested outcome. This provides the user a context of estimated certainty for the
suggestion put forward in a DSS.
9.2
Implementation of Confidence in Hybrid System
In the hybrid system described in this work, an assertion is assigned for each of the possible outcomes.
As described in Chapter 4, the assertion in support or refutation of class k produced by input value x is
termed Ak , and this value Ak , is bounded in the domain [−1 . . . 1]. Using these Ak values we can correlate
the distances in the assertion space with the observed errors found in training. From these data we can
evaluate various possible confidence measures in order to choose the candidate that suits our purposes the
best. Rather than evaluating confidence measures for all classifiers discussed, only the best-performing
classifier, FIS(O/A), will be evaluated.
CHAPTER 9. CONFIDENCE AND RELIABILITY
9.2.1
99
Construction of τ and δ Measures
Of the possible outcomes for any classification, one outcome will have the highest assertion value. Remembering that
K
k? = argmax Ak
k
from equation (4.4), we can define
τ = Ak?
(9.1)
to indicate the value of the highest assertion made. The label value which τ supports will be the label
Y=yk? , which is the label used as the final suggested characterization or classification.
In any system where K > 1 (i.e., any system containing more than one class), there will exist a class
whose assertion value is the highest value if the τ-label is ignored (i.e., the assertion with the secondhighest value).
The assertion in support of this label is termed τ2 , which is defined as
K
τ2 = max Ai .
i=0
i,k?
(9.2)
The conflict between the two classes associated with τ and τ2 can be measured by comparing the
conflicting support of the output assertions using
δ = τ − τ2 .
(9.3)
In cases of equal values for τ and τ2 , the value for δ is zero. The value for δ approaches the maximum
distance of 2 units in the case when there is total support for the class of the label associated with τ and
total refutation for all other classes; that is, when
Ak? = 1 and
Ai = −1∀i ∈ K, i , k.
9.2.2
(9.4)
Evaluating Confidence Through δ and τ
Figures 9.1 and 9.2 display histograms of the δ/τ values observed for classifications performed on the
covaried data at separations of 0.125 and 4.0 respectively. All classifications in these plots were performed
by the FIS(O/A) classifier. In the figure, the x and y directions indicate δ and τ. Associated
100
δ -vs- τ : Classification Errors
FIS(Occurrence/All)
HeatMap
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0.2
0.4
0.05
0.04
0.6
0.8
1
1.2
δ
0.03
0.02
1.4
1.6
1.8
2
τ
τ
CHAPTER 9. CONFIDENCE AND RELIABILITY
δ -vs- τ : Correct Classifications
FIS(Occurrence/All)
HeatMap
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
0.2
0.4
0.6
0.8
0.01
0.03
1
1.2
δ
0.02
1.4
1.6
1.8
2
0.01
Error Distribution Histogram (Heat-map)
Correct Distribution Histogram (Heat-map)
δ -vs- τ : Classification Errors
FIS(Occurrence/All)
δ -vs- τ : Correct Classifications
FIS(Occurrence/All)
count / num records
count / num records
0.06
0.05
0.04
0.03
1
0.8
0.6
0.02
0.4
0.2
0
0.01
-0.2
-0.4
-0.6
00
-0.8
0.2 0.4 0.6 0.8 1 1.2
1.4 1.6 1.8 2 -1
δ
0.05
0.04
0.03
0.02
0.01
Error Distribution Histogram (Surface)
τ
0.04
0.035
0.03
0.025
0.02
1
0.8
0.015
0.6
0.4
0.01
0.2
0
0.005
-0.2
-0.4
-0.6
00
-0.8
0.2 0.4 0.6 0.8 1 1.2
1.4 1.6 1.8 2 -1
δ
0.03
0.02
τ
0.01
Correct Distribution Histogram (Surface)
Figure 9.1: δ/τ Histograms from Covaried Data, s=0.125
with each (δ, τ) pair is a z value that indicates the fraction of the total number of records observed that fall
into the adjacent cell of the (δ, τ) domain.
The top row of each figure indicates a heat-map representation of the data, where lighter colours on the
map indicate regions with higher values of z. White areas on the heat-maps indicate cells with the highest
count for that map, black areas indicate values of zero. The same data is plotted below each heat-map
in the form of a surface histogram, where the surface is extended in z to indicate the cell count, directly
related to the colouring of the heat-map.
There is significant overlap seen in Figure 9.1 (separation 0.125s), between the correct and incorrect
101
δ -vs- τ : Classification Errors
FIS(Occurrence/All), s=04.000
HeatMap
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.0025
0.002
0.0015
0.001
0.0005
0
0
0.2
0.4
0.0025
0.002
0.6
0.8
1
1.2
δ
0.0015
0.001
1.4
1.6
1.8
2
τ
τ
CHAPTER 9. CONFIDENCE AND RELIABILITY
δ -vs- τ : Correct Classifications
FIS(Occurrence/All), s=04.000
HeatMap
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.12
0.1
0.08
0.06
0.04
0.02
0
0
0.0005
0.2
0.4
0.6
0.8
0.1
0.08
1
1.2
δ
0.06
0.04
1.4
1.6
1.8
2
0.02
Error Distribution Histogram (Heat-map)
Correct Distribution Histogram (Heat-map)
δ -vs- τ : Classification Errors
FIS(Occurrence/All), s=04.000
δ -vs- τ : Correct Classifications
FIS(Occurrence/All), s=04.000
count / num records
count / num records
0.0025
0.12
0.1
0.08
0.06
0.002
0.0015
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
00
-0.8
0.2 0.4 0.6 0.8 1 1.2
1.4 1.6 1.8 2 -1
δ
0.001
0.0005
0.0025
0.002
0.0015
0.001
0.0005
Error Distribution Histogram (Surface)
τ
1
0.8
0.6
0.04
0.4
0.2
0
0.02
-0.2
-0.4
-0.6
00
-0.8
0.2 0.4 0.6 0.8 1 1.2
1.4 1.6 1.8 2 -1
δ
0.1
0.08
0.06
0.04
τ
0.02
Correct Distribution Histogram (Surface)
Figure 9.2: δ/τ Histograms from Covaried Data, s=4.0
distributions, even though the midpoints of the two classes are well separated. This is expected as the
number of errors made is relatively high due to the poor discernibility of the problem. The overall expected
confidence based on these decisions is low.
Considering Figure 9.1 from the standpoint of reliability, only the best-separated portions of the surfaces indicate regions where a high-reliability decision is expected; in sections with a great deal of overlap,
the expected decision reliability is low.
Comparing Figure 9.1 to Figure 9.2 we see that, as expected, when separation increases the separation
in the δ/τ space between errors and correct classifications increases. As well, the overall number of errors
CHAPTER 9. CONFIDENCE AND RELIABILITY
102
has dropped considerably. This can be seen by the much smaller maximum extent in z of the error graph
in Figure 9.2 versus that in Figure 9.1.
These facts indicate that in portions of the plots where no errors are made, high reliability classifications will be observed.
9.2.3
Candidate Confidence Measures
Based on the observation that the main mass of erroneous classification in Figures 9.1 and 9.2 are in a
different portion of the (δ, τ) space than corresponding masses of correct classifications, it would seem
reasonable that a measure may be constructed that can discriminate decisions made in error from those
made correctly, using τ and δ as inputs. If such a measure is related to the distance between means, or
similarly, the relative probability of conflict in this space, that measure will serve as a robust indicator of
the trueprobability of a correct classification label being suggested.
A great many different methodologies may be constructed under this hypothesis. Four candidates are
considered in this work:
• MICD Based Confidence;
• δ/τ Observed Probability Confidence;
• Normalized [0 . . . 1] Bounded τ Confidence; and
• PD (WOE Based) Probabilistic Confidence.
MICD Based Confidence
This is a measure based on the distance in δ/τ space as whitened using an MICD classifier (as per equation
(5.10) on page 60) and then compared against the sum of all distances. This is calculated by
1
CMICD =
e− 2 dcorrect
1
1
e− 2 dcorrect + e− 2 dincorrect
.
(9.5)
The values dcorrect and dincorrect are based on distances for correct and incorrect decisions, respectively.
The value dcorrect indicates the distance to the mean of the whitened distribution of δ/τ values calculated
by examining decisions made correctly using training data. Similarly, dincorrect is the distance related to
those decisions made incorrectly.
The rationale behind this confidence measure is the use of the Bayesian decision surface underlying
the inter-PDF measures in the whitened MICD space. The relative distance in this space corresponds
CHAPTER 9. CONFIDENCE AND RELIABILITY
103
to the conditional probability of assignment under an assumption of Normally distributed “correct” and
“incorrect” points. This conditional probability directly mirrors the probability of failure.
While this simple probabilistic relationship provides a direct mapping to an a priori conditional probability, neither the “correct” nor “error” distributions are Normal, as can be easily seen in Figures 9.1
and 9.2. The data is generally bell-shaped, but the bounds of the (δ, τ) space sharply truncate the distribution. We have seen in previous chapters that the MICD classifier still performs reasonably well with
non-Normal data, degrading in a predictable fashion. For this reason, it is a useful candidate confidence
measure.
δ/τ Observed Probability Confidence
This measure is based on a histogram-generated probability of δ/τ values
Cδ/τ Probability =
Count(correct, δ, τ)
,
Count(correct, δ, τ) + Count(incorrect, δ, τ)
(9.6)
where Count(correct, δ, τ) represents the number of a correct classifications associated with the cell containing (δ, τ) on a histogram of the type shown in Figures 9.1 and 9.2. The term Count(incorrect, δ, τ)
represents the count of incorrect classifications drawn from a similar histogram.
The observation of a (δ, τ) pair during classification will let us report the likelihood of error observed
for similar values at training time. This directly returns the probability of error based on a δ/τ pair.
As such, this should report a good estimate of probability of successful classification, and as such an
indication of system reliability as considered on records associated with these values of δ and τ.
The main drawback to this scheme is the possibility of observing a (δ, τ) value which corresponds
to no known bin during the classification of new data values. It will be impossible to determine a true
confidence measure for such a value, as there is no prior data upon which to base such a decision. It
will, however, be possible to flag such values to the user and indicate that a confidence measure cannot be
computed.
A similar constraint involves values rarely observed during training; confidence based on such values
will have a much higher degree of inaccuracy than confidence based on histogram data cells which received many points during training. This implies the possibility of a “confidence of confidence” measure;
this extra complexity will be at best confusing, and is certainly not desirable.
CHAPTER 9. CONFIDENCE AND RELIABILITY
104
Normalized [0 . . . 1] Bounded τ Confidence
This calculation normalizes τ over the sum of all assertions made, after shifting τ so that it is [0 . . . 1]
bounded. This confidence is calculated using
Cτ[0...1]
1+τ
K
P
(1+A
i)
i=1
=
0
if
PK
i=1 (1
+ Ai ) > 0
(9.7)
otherwise,
remembering that τ is simply the highest-valued Ai as defined in equation (9.1), and that all Ai (including
τ) will be bounded by [−1 . . . 1] as output from the FIS.
The rationale in this measure is that a [0 . . . 1] bounded τ-space will behave in some ways similarly
to a probability space. Once this is done, the summation will contain a probability-type measure. A side
effect of this measure is that it changes the way in which information is represented. In PD and in the
discussion of the FIS so far, information has always been represented in terms of deviation from a central
0, where strong degrees of support or refutation provided a strong deviation. Information for each class i
is therefore ∝ |Ai |, and values around 0 will therefore result from rules which fire with low confidence.
In this scheme, the 0 value of the FIS becomes 21 , similar in meaning and behaviour to a probability
value; information is measured from this central limit. This means that if the reported confidence falls
below
1
2
then the “best” guess has actually been made without a positive τ value.
One interesting property of this measure is that it is calculated only in terms of values internal to the
FIS system. The previous two measures require an external analysis of the δ/τ values from training data.
PD (WOE Based) Probabilistic Confidence
This measure is not directly based on δ/τ, but instead is based on the WOE measures associated with the
fired rules.
It is possible to derive a conditional probability measure for association with a particular class Y=yk
∗
based on a given input vector xm
l , as constructed from the accumulated WOE:
Pr(Y = yk |x⊕l ) =
1
1
βWOE
∗
[1−Pr(Y=yk )]
Pr(Y=yk )
+1
(9.8)
Construction of this equation, derived from the definition of WOE in (3.7), is courtesy of Lou Pino (2005). A copy of the
derivation is provided in Appendix B, beginning on page 163.
CHAPTER 9. CONFIDENCE AND RELIABILITY
105
where x⊕l is the composite pattern formed by considering all columns matched by the patterns fired, that
is
x⊕l =
[
x?l
x?l ∈ M
(9.9)
where x?l is the input portion of a pattern, as described in (3.5) when defining the use of WOE in PD in
Chapter 3.
This value provides an estimate of the probability of association with the class chosen as the label. As
this maps the underlying probability of association within the data space itself, this is a reasonable choice
as a means of estimating system failure.
The main source of noise in this estimate is the presence of the discretization bins, as all values within
the same bin generate the same classification outcome and therefore, by necessity, have the same reported
confidence based on the same defined events.
This conditional confidence cannot be directly used in the hybrid system as we cannot make the
assumption of independence between the columns of input data.
In order to establish independence and approximate the confidence of the underlying PD classifier,
the “PD (WOE Based) Probabilistic Confidence” scheme uses the WOE based conditional probability of
equation (9.8) in conjunction with the WOE value computed by considering only the rules which would
fire under the “independent” rule selection scheme, though the “occurrence” scheme is still used to select
a label value. We must restrict our consideration to just this subset of the rules in order to maintain
independence in the WOE calculation.
The benefit of this scheme is the simpler relationship with the underlying conditional probability.
some interference with the correlation relation to “true” confidence is expected. as only a subset of the
rules actually fired are used in the confidence calculation.
As in the [0 . . . 1] bounded τ metric, this scheme also uses only values internal to the FIS calculations.
9.2.4
Confidence Measure Evaluation Methodology
In order to compare these possible confidence values, a means of measuring their relative merit is required. The most direct method of evaluating confidence as a predictor of the true probability of correct
classification is simply to correlate the reported confidence values with a “true” confidence calculated
by examining the conditional probability of the winning class underlying the synthetic data distributions
described in Chapter 5.
To perform this calculation, all of the tests performed in the jackknife tests on the synthetic data were
examined and the confidence values recorded. These values were then compared with the conditional
CHAPTER 9. CONFIDENCE AND RELIABILITY
106
probability of occurrence of the winning class. The computation of the conditional probability for the
synthetic data distributions is supplied in Appendix C. Each such comparison will result in a pair of
data values, consisting of the expected and reported confidence. These values can then be correlated, and
ideally will result in a straight line.
As the data to be correlated are quantized values, Spearmannrank correlation will be used, rather than
the common Pearson rank correlation, which correlates values from continuous random variables.†
SpearmannRank Correlation
The Spearmannrank correlation coefficient (Press, Teukolsky et al., 1992; Lehmann and D’Abrera, 1998)
calculates the correlation between the ordinal positions of elements in a list.
The Spearmannranking is defined in Press et al. (1992, pp. 640) as
rSpearmann
i
N h
P
R xi − R x Ryi − Ry
i=1
= s
s
2 P
2
N
N
P
R xi − R x
Ryi − Ry
i=1
(9.10)
i=1
where R xi is the set of rankings in the x dimension and Ryi is the related set in the y dimension such that
for an element ai ∈ X, the ranking of this element is R xi , Ryi .
The values R x and Ry are the mean rankings in x and y respectively, and the ranks for duplicate values
are all equal to the mean values of the rankings that would have been applied over the range, thus the list
X = { (0, 7.5), (0, 10), (0, 12.5), (0, 15), (5, 15) }
receives the rankings
X = { (2.5, 1), (2.5, 2), (2.5, 3), (2.5, 4.5), (5, 4.5) }.
The purpose of the Spearmannranking is to give a simple correlation calculation which is unbiased by
the domain of the actual values, as long as they can be applied to an ordinal sequence.
In the Spearmannranking, a value of 0 indicates independence, and a positive or negative value indicates dependence along the major axis of y=x or Y= − x, respectively.
†
A discussion of the performance of several other possible metrics is provided in Appendix D. These measures include: mutual information, symmetric uncertainty and the interdependence redundancy measures. Results for these measures are provided
in Appendix D but are not included here as the saturation of the measured values causes a significant distortion in the provided
results. See the appendix for details.
CHAPTER 9. CONFIDENCE AND RELIABILITY
107
The set
X = { (0, 0), (1, 1), (2, 2), (3, 3) }
will result in a Spearmannranking of 1. The set
X = { (0, 3), (1, 2), (2, 1), (3, 0) }
will result in a −1 ranking.
9.2.5
Confidence Measure Evaluation Results
Table 9.1 summarizes the Spearmannrank correlation of the probability of a successful classification and
reported confidence for all the metrics examined. This table is provided to allow easy comparison between
all reported data. Note that the table only contains values up to separation 2.0. Data at higher separations
is not included as there are too few errors observed at these separations to provide a reliable estimate of
the curve.
CHAPTER 9. CONFIDENCE AND RELIABILITY
108
Cδ/τ Probability
Cτ[0...1]
CPD-Probabilistic
C
0.125
0.250
0.500
1.000
2.000
µ
B
0.125
0.250
0.500
1.000
2.000
µ
S
0.125
0.250
0.500
0.750
1.000
µ
CMICD
Table 9.1: Conditional Probability -vs- Reported Confidence SpearmannRanked Comparison
0.952
0.957
0.933
0.957
0.955
0.951
1.000
0.996
0.999
0.997
0.990
0.996
0.955
0.979
0.977
0.963
0.960
0.967
0.915
0.914
0.930
0.932
0.977
0.933
0.894
0.858
0.886
0.930
0.871
0.888
0.988
0.988
0.991
0.993
0.974
0.987
0.931
0.925
0.930
0.932
0.888
0.921
0.913
0.869
0.891
0.925
0.940
0.907
0.399
0.842
0.914
0.683
0.586
0.685
0.841
0.931
0.938
0.727
0.916
0.871
0.612
0.889
0.882
0.665
0.583
0.726
0.387
0.645
0.960
0.931
0.892
0.763
CHAPTER 9. CONFIDENCE AND RELIABILITY
109
MICD Based Confidence: CMICD
Figures 9.3, 9.4 and 9.5 display the relationship found between CMICD and the probability of a successful
classification for the covaried, bimodal and spiral data respectively. Examining these figures, we see a significant amount of noise, however a general trend towards higher reliability at higher reported confidence
is visible. The linear regression line placed across the plots clearly shows the trend to be in the correct
direction.
In Figure 9.3 there is quite a bit of noise in the central region of the curve, especially at low separations when many errors are being made. As the problem becomes easier at higher separations, the line
straightens out and a higher correlation is recorded. This is also seen in Table 9.1, where the values for
this confidence measure are shown in the column entitled CMICD .
Notable on the covaried plots at low separation in Figure 9.3 is the downward portion of the initial part
of the line. This portion of the plot is based on the MME bin with the largest range, and as such contains
points from a large range of confidences. This explains why the line deviates so sharply from the trend in
the rest of the graph. As this line indicates only that the reliability may be quite high when the confidence
is low, this artifact is not a critical problem.
Bimodal data, which exhibits fewer classification errors to begin with, has much less noise, as seen
in the plots in Figure 9.4. The spiral data in Figure 9.5 shows the most noise in the confidence/reliability
curve, as more errors are made in the assignments. As the bimodal data set is quite information rich in
comparison with the other two synthetic data sets, this response is expected.
CHAPTER 9. CONFIDENCE AND RELIABILITY
110
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=00.125)
1
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 0.9817x+0.0667
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.9821x+0.0624
0
1
0
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=00.500)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.9746x+0.0782
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.9109x+0.1326
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=04.000)
1
Covaried FZ(Occurrence/All);Q=10 MICD Reliability (s=02.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.6
0.4
0.2
0.2
Reliabilty
Regression: 0.7499x+0.2800
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4467x+0.5585
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.3: CMICD Plots on 4-Feature Covaried Data
0.8
1
111
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=00.125)
1
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=00.250)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
CHAPTER 9. CONFIDENCE AND RELIABILITY
0.6
0.4
0.2
Reliabilty
Regression: 0.7549x+0.2418
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.7412x+0.2524
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=00.500)
1
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.6899x+0.3001
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.6929x+0.3081
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=02.000)
1
Bimodal FZ(Occurrence/All);Q=10 MICD Reliability (s=04.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.6
0.4
0.2
0.2
Reliabilty
Regression: 0.6366x+0.3716
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4493x+0.5540
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.4: CMICD Plots on 4-Feature Bimodal Data
0.8
1
112
Spiral FZ(Occurrence/All);Q=10 MICD Reliability (s=00.125)
1
Spiral FZ(Occurrence/All);Q=10 MICD Reliability (s=00.250)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
CHAPTER 9. CONFIDENCE AND RELIABILITY
0.6
0.4
0.2
0.4
0.2
Reliabilty
Regression: 0.2282x+0.4670
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.8938x+0.1608
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Spiral FZ(Occurrence/All);Q=10 MICD Reliability (s=00.500)
1
Spiral FZ(Occurrence/All);Q=10 MICD Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.7862x+0.2533
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.3540x+0.6538
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.5: CMICD Plots on 4-Feature Spiral Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
113
δ/τ Observed Probability Confidence: Cδ/τ Probability
The plots showing the correlation with reliability for the Cδ/τ Probability measure are shown in Figures 9.6,
9.7 and 9.8, again for covaried, bimodal and spiral data respectively. All three figures exhibit a strikingly
noise-free estimation. Both the covaried data in Figure 9.6 and the bimodal data in Figure 9.7 show little
deviation from the plotted linear regression. The slope of the regression itself clearly approximates the
x=y desired slope on these two plots. There is some decrease in the slope as the separation increases, as
the number of overall errors drops. Examining the spiral data in Figure 9.8, we see significantly more
noise than is shown in either of Figures 9.6 or 9.7. As in the MICD based confidence measure, it seems
that when the problem is harder, there is more noise in the confidence estimate.
Evaluating the Spearmanncorrelation coefficients in Table 9.1, we see that this confidence measure,
marked Cδ/τ Probability has the highest correlation with true reliability.
While the downward-sloping trend is present at lower measured reliability values, it is less significant
than when seen in the CMICD plots.
CHAPTER 9. CONFIDENCE AND RELIABILITY
114
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=00.125)
1
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 0.9010x+0.0873
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.9019x+0.0881
0
1
0
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=00.500)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=01.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.9125x+0.0783
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.9013x+0.0857
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=04.000)
1
Covaried FZ(Occurrence/All);Q=10 PROB Reliability (s=02.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.6
0.4
0.2
0.2
Reliabilty
Regression: 0.8939x+0.0983
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.5920x+0.3860
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.6: Cδ/τ Probability Plots on 4-Feature Covaried Data
0.8
1
115
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=00.125)
1
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=00.250)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
CHAPTER 9. CONFIDENCE AND RELIABILITY
0.6
0.4
0.2
Reliabilty
Regression: 0.8797x+0.1101
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.8529x+0.1318
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=00.500)
1
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.8656x+0.1233
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.8340x+0.1431
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=02.000)
1
Bimodal FZ(Occurrence/All);Q=10 PROB Reliability (s=04.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.7169x+0.2539
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.3419x+0.6186
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.7: Cδ/τ Probability Plots on 4-Feature Bimodal Data
0.8
1
116
Spiral FZ(Occurrence/All);Q=10 PROB Reliability (s=00.125)
1
Spiral FZ(Occurrence/All);Q=10 PROB Reliability (s=00.250)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
CHAPTER 9. CONFIDENCE AND RELIABILITY
0.6
0.4
0.2
0.4
0.2
Reliabilty
Regression: 0.4774x+0.3185
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.7619x+0.1791
0
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Spiral FZ(Occurrence/All);Q=10 PROB Reliability (s=00.500)
1
Spiral FZ(Occurrence/All);Q=10 PROB Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.7944x+0.1900
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4645x+0.5047
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.8: Cδ/τ Probability Plots on 4-Feature Spiral Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
117
Normalized [0 . . . 1] Bounded τ Confidence: Cτ[0...1]
Turning to the Normalized [0 . . . 1] bounded τ confidence measure, we see that the correlations shown in
Figures 9.9, 9.10 and 9.11 (covaried, bimodal and spiral) again show a strong linearity, however with more
noise in the estimate than is the case in Cδ/τ Probability .
The Spearmannranking in Table 9.1 shows us that this measure is comparable to that of CMICD , however is not as strong as Cδ/τ Probability .
CHAPTER 9. CONFIDENCE AND RELIABILITY
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.125)
1
118
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 1.9137x+-0.1158
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 2.0529x+-0.1751
0
1
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.500)
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=01.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 1.8786x+-0.1023
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 1.7656x+-0.0505
0
1
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=02.000)
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=04.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 1.1697x+0.2724
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4991x+0.6738
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.9: Cτ[0...1] Plots on 4-Feature Covaried Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.125)
1
119
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 1.2063x+0.2323
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 1.2232x+0.2260
0
1
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.500)
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=01.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 1.1379x+0.2838
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 1.0317x+0.3526
0
1
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=02.000)
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=04.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.8357x+0.4748
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.3764x+0.7651
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.10: Cτ[0...1] Plots on 4-Feature Bimodal Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
Spiral FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.125)
1
120
Spiral FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
0.4
0.2
Reliabilty
Regression: 0.5688x+0.2717
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 1.2538x+-0.0284
0
1
Spiral FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=00.500)
1
0
0.2
0.4
0.6
Reported Confidence
0.8
1
Spiral FZ(Occurrence/All);Q=10 NORMTAU1Bound Reliability (s=01.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 1.1208x+0.0956
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.3959x+0.6869
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.11: Cτ[0...1] Plots on 4-Feature Spiral Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
121
PD (WOE Based) Probabilistic Confidence: CPD-Probabilistic
Examining the performance of the CPD-Probabilistic column in Table 9.1, we see performance which is significantly poorer than that of the other classifiers.
A look at Figures 9.12, 9.13 and 9.14 show two significant things: there is quite a lot of noise in all
three plots and the slope of the lines on each plot are considerably lower than those of the related plots in
the other measures.
While neither of these pose disastrous problems (as noted by the high correlation in Table 9.1), the
comparison with the other confidence measures shows that this estimation technique is inferior.
CHAPTER 9. CONFIDENCE AND RELIABILITY
122
Covaried PD(Indep);Q=10 PDPROB Reliability (s=00.125)
1
Covaried PD(Indep);Q=10 PDPROB Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 0.4801x+0.3006
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4698x+0.3237
0
1
0
Covaried PD(Indep);Q=10 PDPROB Reliability (s=00.500)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried PD(Indep);Q=10 PDPROB Reliability (s=01.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.4995x+0.3142
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.5099x+0.3470
0
1
0
Covaried PD(Indep);Q=10 PDPROB Reliability (s=02.000)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Covaried PD(Indep);Q=10 PDPROB Reliability (s=04.000)
1
0.8
Reliability [1-Pr(error)]
0.8
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.5907x+0.3289
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4119x+0.5073
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.12: CPD-Probabilistic Plots on 4-Feature Covaried Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
123
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=00.125)
1
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=00.250)
1
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.8
0.6
0.4
0.2
Reliabilty
Regression: 0.5386x+0.2844
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.5438x+0.2708
0
1
0
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=00.500)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=01.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.4
0.2
0
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.4620x+0.3508
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.4444x+0.4190
0
1
0
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=02.000)
1
0.2
0.4
0.6
Reported Confidence
0.8
1
Bimodal PD(Indep);Q=10 PDPROB Reliability (s=04.000)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
0.6
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.5114x+0.4109
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.3339x+0.6222
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.13: CPD-Probabilistic Plots on 4-Feature Bimodal Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
124
Spiral PD(Indep);Q=10 PDPROB Reliability (s=00.250)
1
0.8
0.8
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
Spiral PD(Indep);Q=10 PDPROB Reliability (s=00.125)
1
0.6
0.4
0.2
0.6
0.4
0.2
Reliabilty
Regression: 0.4054x+0.3391
0
0
0.2
0.4
0.6
Reported Confidence
0.8
1
0
Spiral PD(Indep);Q=10 PDPROB Reliability (s=00.500)
1
0.8
0.8
0.6
0.4
0.2
0.2
0.4
0.6
Reported Confidence
0.8
1
Spiral PD(Indep);Q=10 PDPROB Reliability (s=01.000)
1
Reliability [1-Pr(error)]
Reliability [1-Pr(error)]
Reliabilty
Regression: 0.5045x+0.3469
0
0.6
0.4
0.2
Reliabilty
Regression: 0.6926x+0.2053
0
0
0.2
0.4
0.6
Reported Confidence
0.8
Reliabilty
Regression: 0.8338x+0.1217
0
1
0
0.2
0.4
0.6
Reported Confidence
Figure 9.14: CPD-Probabilistic Plots on 4-Feature Spiral Data
0.8
1
CHAPTER 9. CONFIDENCE AND RELIABILITY
9.3
125
Discussion
All of the measures discussed provide an overall linear relationship with respect to reliability. Further, the
trend of all measures show increasing reliability correlated with increasing reported confidence.
The MICD based measure shows quite acceptable performance, showing saturation of the reported
confidence at the same time that total reliability is reached.
The Cτ[0...1] measure, while not showing large amounts of noise, saturates at complete reliability significantly before the confidence measure predicts this. Using a confidence measure which under-predicts
the true reliability is certainly preferable to an over-prediction, however this does decrease the measured
correlation significantly.
The measure CPD-Probabilistic shows significant noise, and is the poorest of all of the measures discussed.
The noise in this measure is due largely to the quantization of confidence along with the quantization of
decision outcome based on the MME quantization of the PD data space.
A further consideration in CPD-Probabilistic is that the rules fired for this measure do not exactly correspond to those used for WOE calculation, as only the rules which would be fired using the PD ruleindependence scheme are used. This will introduce a further confounding factor into the relationship
between assertion confidence and this measure of confidence reporting.
The plots for the Cδ/τ Probability measure show the least deviation from linearity. The regression line on
these plots shows that there is a close relation with the desired reliability as the slope is near 1.0. This
outcome is not surprising, as of the four measured evaluated, this is the only one which is directly based
on an evaluation of error; the high correlation is expected in a calculation directly reporting observed error
during training.
The smoothness of the lines in Figures 9.6, 9.7 and 9.8 is further due to the averaging effect of the
histogram calculation of expected confidence combined with the averaging effect of the quantized reliability calculation. This “double filtering” in the Cδ/τ Probability confidence reduces the noise exhibited by
this measure, and provides a better estimate of the overall system reliability.
The irregularities which appear in the spiral plots in Figure 9.8 are due to the poor separation of errors
in the δ/τ space for this problem. This in turn is driven by the difficulty of the problem overall, indicating
that there is a relationship between the information available to the system to solve a problem, and the
information available to estimate the reliability of the answer within the problem.
CHAPTER 9. CONFIDENCE AND RELIABILITY
9.4
126
Conclusion
While all of the measures exhibit reasonable performance, the extremely good correlation of the Cδ/τ Probability
measure with respect to true system reliability indicates that this will be the best estimator to choose.
In order to use this estimator with a given decision support problem, a histogram of δ/τ values must
be constructed. The construction of this histogram need only use an examination of the same training data
used to construct the rule-base itself. Once created, this histogram functions as a lookup table from which
the confidence values are chosen.
The explainability of this measure is very high, as it allows a decision to be characterized in terms of
the reliability on similar decisions made using training data. An explanation of this form allows a decisionmaking user to understand that the confidence is not constant for the system, but is easily parameterized
by a set of direct measures (δ and τ) based on the training data.
Cδ/τ Probability therefore is the measure which will be used in the decision support interface in order to
report decision confidence.
The presentation of such a confidence value allows a decision maker using the suggested label to
combine the DSS presented suggestion with data available from other sources. While this idea is normally
conceived in terms of a human decision maker, an equally plausible scenario would see the suggested
classification label and the confidence value as inputs to a computationally based multi-classifier voting
system.
9.5
Summary
Several possible candidate measures for decision confidence have been evaluated. The measure selected
for this hybrid system is that which provides the best indication of the system performance and which best
estimates the reliability of the suggestion. This measure will allow the user to incorporate the likelihood of
an erroneous label being suggested into a larger decision framework. The contribution of this confidence
measure within this work provides a new tool for measuring PD based performance and estimating system
confidence.
This type of confidence measure, coupled with a transparent means of determining a suggested label,
are the defining characteristics of a functional DSS.
Chapter 10
Decision Support and Exploration
Statistics are no substitute for judgement.
— Henry Clay
Decision support is a rigorous field in which the mechanism by which data is presented to the user will
define the utility of the system. The entire purpose of a decision support system is to provide explainable
insights into the decision process, based on numerical evaluation. If the results of this evaluation are
opaque or obscure, the resulting system will be confusing and, at worst, may make the resulting decisions
less reliable.
As stated by Edward Tufte (1997, pp. 27):
When we reason about quantitative evidence, certain methods for displaying and analyzing
data are better than others. Superior methods are more likely to produce truthful, credible, and precise findings. The difference between an excellent analysis and a faulty one can
sometimes have momentous consequences.
10.1
Design
It is therefore important to evaluate the work-flow by which a user of the hybrid system described here
will obtain and evaluate a decision suggested by this system. The resulting inference exploration is topdown, starting with the presentation of a class label for a given input vector and working back towards
127
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
More General
Cognitive Model
comparitive decisions
overall characterization
FIS Decision Support
Exploration through ‘What if’ Scenarios
Suggestion
Defuzzification of Rule Assertions
fusion of
different test values
More Specific
128
Rule Firings
Membership Functions
Input Features
measuring values
calculating measures
Gathering Data
Figure 10.1: Hierarchical Design Encourages Drill-Down Exploration
the underlying statistical support structures forming the basis for the decision. Effectively, the process by
which the classifier functions must be examined by running the data through in reverse.
The decision support system (DSS) supports the cognitive model pictured in Figure 10.1 by modelling
the problem using a similarly layered logical abstraction as the conceptual abstractions in the cognitive
model.
10.2
Evaluation
In order to evaluate the DSS proposed here, a detailed discussion of the user work-flow will be discussed
for a few illustrative examples.
The following inference results will be examined:
• a straightforward positive classification (one for which PD has infinite WOE). This will provide an
overall flavour of the system and its confidence support.
• a classification at the decision boundary of a simple domain. This will allow rule and data explanations containing more complexity.
• a positive classification based on real-world data, containing a small degree of conflict. This will
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
129
Unimodal 2-class, 2-feature Data and MME Bins
classA Points
classB Points
Feature B
27.26
7.23
4.10
1.68
-1.73
-22.58
-31.58
-8.59
9.69
Feature A
45.74
Figure 10.2: MME Division of 2-class, 2-feature Covaried Data
introduce the measurement of uncertainty as it is presented to the decision-maker, relating the confidence measure described in Chapter 9 to the exploration of the FIS rules.
• a positive real-world data classification with significant conflict. This will provide a further exploration of the uncertainty measurement, with further stress on the ability of the decision maker to
assure themselves of the degree of support for any possible decision.
10.2.1
Unimodal, 2-Feature Covaried Data
Two feature covaried data will be used to provide a simple example to begin the discussion. Simple 2-class,
2-feature data was created by generating 1000 records for each of two classes (A and B). This
i
h
i
h
160 −52.58 and Cov =
48 16.63
data was then coloured using covariance matrices of CovA = −52.58
B
48
16.63 16 . The
means of the two distributions were separated along the x axis by 9.36. The PD/FIS algorithm was then run
on this data with Q=5 quantization intervals, generating descriptive rules as well as the quantization grid
shown in Figure 10.2. This grid shows the training data points as well as the quantization bin boundaries
into which they have been divided. Note that the most extreme points at top, bottom, left and right fall
exactly onto (and actually define) the outer grid boundaries. This two-dimensional data set can be easily
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
130
Distribution Histograms
Correct
Errors
δ -vs- τ : Covaried 2 Feature Correct
0.03
0.025
0.02
0.015
0.01
0.005
0
0
0.2 0.4 0.6 0.8
1
δ
1.2 1.4 1.6 1.8
2
τ
τ
δ -vs- τ : Covaried 2 Feature Error
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.12
0.1
0.08
0.06
0.04
0.02
0
0
δ -vs- τ : Covaried 2 Feature Error
count / num records
0.03
0.025
0.02
0.015
0.01
0.005
00
1
δ
1.2 1.4 1.6 1.8
2
δ -vs- τ : Covaried 2 Feature Correct
count / num records
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
1.8 2
δ
0.025
0.02
0.015
0.2 0.4 0.6 0.8
1
0.8
0.6
0.4
0.2
0
τ
-0.2
-0.4
-0.6
-0.8
-1
0.01
0.005
0.12
0.1
0.08
0.06
0.04
0.02
00
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
1.8 2
δ
0.1
0.08
0.06
1
0.8
0.6
0.4
0.2
0
τ
-0.2
-0.4
-0.6
-0.8
-1
0.04
0.02
Figure 10.3: δ/τ Histograms from Covaried 2-feature Data
visualized, allowing the relationship between the measured data values and the decision space to be well
understood.
Figure 10.3 shows the δ/τ histogram surfaces for this data set, upon which we will calculate Cδ/τ Probability
confidence as the measure chosen in Chapter 9. In this figure we see that the extent of the errors is similar to that of the correct classifications, however there are significantly more correct observations in the
intersecting area.
40.0,
−20.0
A
Conflict
F2
Suggestion
Confidence
F1
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
1.000
0.0
131
# of Rules
Fired
6 of 54
Figure 10.4: Covaried Classification With High Confidence Summary
10.2.2
Unimodal, 2-Feature Covaried Classification With High Confidence
For the first analysis, consider the record (40, −20), which in the simple data topology of Figure 10.2, will
appear in the lower-right corner. Records at this location are unambiguously associated with A.
If we assign this point a label using the FIS in a DSS context, we are presented with the summary
output display shown in Figure 10.4, which shows the input data from features F1 and F2, along with: the
suggested labelling; the decision confidence calculated using Cδ/τ Probability ; and the decision conflict. The
conflict is calculated simply using
τ − τ2
Conflict =
0
if τ2 > 0
otherwise.
(10.1)
As is obvious in Figure 10.4, there is no conflict for this simple example, and the projected reliability is
unity. While rather an extreme projection of confidence, due to the position of the point in a portion of the
space with no conflicting classifications having a τ value of 0.858 and a δ of 1.573. Locating these values
in Figure 10.4 places the (δ, τ) location in the black bar above all of the observed errors, but within the
observed correct classifications. This location therefore has the support of previous correct classifications,
and has never been associated with an error, so the reported confidence is in fact reasonable based on
Cδ/τ Probability .
The output shown in Figure 10.4 is kept terse for two major reasons:
• the purpose of a decision support system is to reduce the cognitive load by summarizing decision
information. The decision exploration task is driven by user request in order to allow greater clarity
in the summarization;
• this technique will allow summaries from multiple decisions to be collected — this will allow “what
if?” scenarios to be explored, where the decision maker can examine the effect of changing an input
value without removing the result of the true values. Such a scenario will be discussed at the end of
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
132
B
A
Assertion
Implied When . . .
(F1 is V H and F2 is V L)
1.000
F1 is V H and F2 is V L
0.928
F1 is V H
0.645
F2 is V L
0.858 Support for A (Chosen Label Value)
1.000 Confidence
0.000 No Assertion Conflict
(F1 is V H and F2 is V L)
−0.290
F2 is V L
−0.855
F1 is V H
−1.000
F1 is V H and F2 is V L
−0.715
Refutation for B
Figure 10.5: Covaried Classification With High Confidence Rule Set
the Chapter.
Decision Exploration: Drill-Down Into Rules
Assuming the user would like further information regarding the means by which the system arrived at this
suggestion, rules associated with this suggestion may be displayed. The appropriate rules are presented,
ranked in decreasing order of assertion value, separated by the class of support, as shown in Figure 10.5,
which shows the same data as was indicated in Figure 10.4.
As is seen in Figure 10.5, only six rules have been used to characterize this data value, so a complete
inspection of all the rules associated with this assignment will be both clear and informative.
The display shows the results in support (or refutation) of the suggested labelling on the top, with
the label value highlighted. Choices for other classes are then displayed in decreasing order of overall
assertion value. In this example there are only two possible labels, “A” and “B.”
Considering the rules associated with A, one has fired with the highest possible assertion value
(1.0), and the other two have fired with strongly positive assertions. Note that the higher order rules
displayed here have higher assertion values. This is consistent with a rule set that has only support for one
of the possible labellings, and this is reflected to the user by the phrase “No Assertion Conflict.”
When examining the competing class, the display of Figure 10.5 indicates that there is indeed no
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
Low
Medium
111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
45.7
4
9
9,.6
−/73
9
0
−3.5
−8.5
.56
−31
Low
HJigh
Very High
6
27.2
7.23
4.09
1.68
3
−1.7
.58
−22
1111111111111111111
0000000000000000000
0000000000000000000
1111111111111111111
0000000000000000000
1111111111111111111
0000000000000000000
1111111111111111111
Medium
40
µ 1.0
Very Low
F2
µ 1.0
Very High
High
F1
Very Low
133
−20
Figure 10.6: Membership Values For 2-Feature Simple Classification Example
support at all for B as all the assertion values are negative (indicating a complete refutation of the
label B). This is indicated in the summary line with the word “refutation” instead of “assertion.”
To review the success of this display, it is illustrative to list the goals which have been met:
Relay Complexity of Decision : In this simple decision, all the rules have been presented.
Describe Degree of Conflict : Conflict is explicitly labelled in this simple plot. The conflict value is
intentionally offset in the column of numerical data in order to avoid visual clutter with the assertion
value.
Decision Exploration: Drill-Down Into Membership Functions
If the user wishes to “drill down” to discover the degree of membership in the various input classes, the
display of Figure10.6 is shown. This display is part of the same example featured in both Figure 10.4 and
Figure 10.5, and so shows the same input data values.
This display shows the number line between the minimum and maximum observed training values
which forms the universe of discourse (UOD) for this fuzzy relation. The extent of the universe is covered
by the fuzzy membership functions describing the various fuzzy input sets.
The names of the fuzzy sets are shown above the membership functions, and the points which form
the bin boundaries in the underlying PD training configuration are labelled. These are the same points that
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
134
correspond to the ends of each trapezoidal plateau, from which the ramp extends. The end points of the
ramps are not labelled, to avoid excessive “chart-junk” as described by Tufte (1983, pp. 100-121).
Also following the principles outlined by Tufte (1997, pp. 73-78), the “smallest effective difference”
is employed to highlight the information in the display while maintaining the background of supporting
information in a non-distracting form. This is achieved by highlighting the membership function to which
the input point is assigned using the same line style but with a darker weighting. The set name is also
coloured with a darker pen. The membership within the set is added above the set name, associating
the set with its information content and dissociating this set from those which are irrelevant to further
discussion. Finally, the input point itself is added, along with a vertical line to describe the point at which
the input enters the number line and a tag to describe its value.
Given this description, we can see that the display in Figure 10.6 clearly shows us that the two input
values for Features “F1” and “F2” were 40 and −20 respectively and places these visually within the
universe of observed events. These input values have been assigned to the fuzzy input sets of “V H”
in Feature F1 and “V L” in Feature F2. As the points fall on the number line far away from any
regions of fuzziness, there is membership in only a single fuzzy set for each UOD.
The goal in this display is to associate the rule set with underlying fuzzy set membership functions. For
this reason, the greatest visual draw in the display are the membership function boundaries, which indicate
to the user the portion of the UOD associated with the fired rules. Secondary to these are the actual input
values, as when using MME and fuzzy membership aggregation schemes, the distinct values from the
input are preserved only in terms of the fuzzy set membership relations. This relationship is managed by
the relative amount of ink devoted to the membership function versus that for the representation of the
input values.
The remainder of the display is shown in muted tones in order to reduce clutter while maintaining a
framework for the important information.
10.2.3
Unimodal, 2-Feature Covaried Classification With Low Confidence
As a contrasting test, let us classify the record (−2, −20). This location will fall on the decision surface
between the classes and will therefore demonstrate what happens when the certainty of the system is
minimal in this very simple data domain.
For this point, the display shown in Figure 10.7 is presented to the user as the initial summary screen.
This point falls very near the quantization boundary between the two classes. The PD algorithm would
have assigned this point to B but here it is instead marked as no class had a positive τ value.
−2.0,
−20.0
(B)
Conflict
F2
Suggestion
135
Confidence
F1
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
0.500
∞
# of Rules
Fired
7 of 54
Figure 10.7: Covaried Classification With Low Confidence Summary
As a further indication that there is no positive support for any class, the “Conflict” column in Figure 10.7
is marked as ∞. This shows the conservatism of the PD/FIS DSS relative to simple PD.
The overall confidence of this classification is marked as 0.5, as for this δ/τ location we have observed
an equal number of correct and incorrect decisions.
All of the above factors combine to make it clear to the user that this suggestion is based on only
the thinnest degree of discernment. A very important corollary of such an assertion is that a confident
decision cannot be made based on this data, and that therefore in the context of a larger decision, this lack
of confidence needs to be taken into account.
This will suggest to the decision maker that testing using another source of data may be prudent, as
the DSS suggestion indicates that the particular data values recorded are not able to produce a confident
decision. If an additional test or report is available to provide more insight, the decision may thereby be
improved.
Decision Exploration: Drill-Down Into Rules
Figure 10.8 shows the rules upon which this decision is based. Note that the support for the both classes
in Figure 10.8 have rule output values which are both positive and negative. This indicates that the rules
triggered within the FIS capture the knowledge that this location is a boundary location, and further, that
logical arguments exist which would assign it to either class. It is by means of the simple aggregation
through the defuzzification operation that these sets of contradictory votes are turned into single scalar
assertions (Ak ) summarized at the bottom of each column.
The overall Ak for both classes are negative, indicating that τ will be negative. This indicates that no
classification will be performed, thus the report of the summary display in Figure 10.7.
To further amplify the point that no assertion is made due to complete lack of support, the conflict
display is marked as ∞ and the accompanying summary text is set to “No Positive Assertions”.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
A
B
Assertion
Implied When . . .
(F1 is M and F2 is V L)
0.640
F1 is M
0.723
F1 is L
−0.290
F2 is V L
−0.675
F1 is M and F2 is V L
−0.009 Refutation for B (Chosen Label Value)
0.500 Confidence
∞
No Positive Assertions
(F1 is M and F2 is V L)
0.645
F2 is V L
−0.445
F1 is L
−0.280
F1 is M
−0.575
F1 is M and F2 is V L
−0.750
F1 is L and F2 is V L
−0.183
Refutation for A
Figure 10.8: Covaried Classification With Low Confidence Rule Set
136
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
F1
Very Low
µ 0.407 µ 1.0
Low Medium
Very High
4
45.7
9.69
1.73
9
Low
HJigh
Very High
6
27.2
7.23
4.09
1.68
3
−1.7
.58
−22
1111111111111111111
0000000000000000000
0000000000000000000
1111111111111111111
0000000000000000000
1111111111111111111
0000000000000000000
1111111111111111111
Medium
−2
µ 1.0
Very Low
F2
High
0
−3.5
−8.5
.56
−31
1111111
0000000
0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
137
−20
Figure 10.9: Membership Values For 2-Feature Simple Classification Example
Decision Exploration: Drill-Down Into Membership Functions
Examining the membership functions underlying the rule set of Figure 10.8 will bring up the display in
Figure 10.9. Note that feature F1 is no longer assigned uniquely to a single fuzzy membership function.
The point −2 now falls into the overlapping membership functions for the fuzzy sets M and L,
with memberships of 1.0 and 0.407 in these sets respectively. This in turn causes more rules to be activated
and shown in Figure 10.8, as both the rules associated with the set M and L may be used.
Comparing Figure 10.6 from the last membership function example to Figure 10.9 we see that the
figure at once looks strikingly different, even though the positions of the lines defining the memberships
have not changed. Confining the highlighting to changes in intensity allows various different parts of the
figure to be brought to focus easily, while maintaining a common paradigm for all decisions made by a
given rule base. Remembering that the membership functions are defined as a product of training on a
given data set, this means that when applying this knowledge system within a specific application area
an operator will always see the same membership functions for the same feature, regardless of the input
values. It is only the highlighting of the memberships which will change from decision to decision.
This interactive drill down mechanism thereby allows the user to explore as much, or as little, of the
decision space as is desired, while keeping the summary information terse and thus not confusing.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
10.2.4
138
Heart Disease Data
The Hungarian heart disease data introduced in Chapter 8 will be used as a “real-world” example, in order
to demonstrate how the system may be used by a medical professional. This data set has been chosen
to support the discussion as it provides a complex, real-world data example in which the relationships
between the features are not easily discovered by casual inspection by the reader, yet is from a domain
in which the conclusions reached by the system can be examined in terms of the reader’s understanding
about the factors underlying heart disease data. As this is a topic with significant discussion in both the
popular press and the medical community, it is expected that the reader will have a passing familiarity
with the symptoms and import, if not directly the measurement, of heart disease.
For these reasons, though it is assumed that only a trained user will be completely familiar with the
fields which have been described in Table 8.3 (from page 92 in Chapter 8), persons with a casual interest
in heart disease will still find some of these features to be familiar from other literature, and this will frame
the following rule evaluations into a useful context.
For characterization purposes, the suggestions based on the heart disease database have one of two
values: “D” or “N.”
The δ/τ distribution surface histograms are shown in Figure 10.10, which shows that the separation
between correct and incorrect is present, but not substantial. The fraction of characterizations made correctly by this system is 0.83 of the total. This in conjunction with the wide distribution of the error values
indicates that there will seldom, if ever, be a value which will be asserted at unity (1.0) confidence as was
seen in the simplistic 2-feature data just examined. Similarly, the confidence values will attain reasonable
values as inspection of the error and correct confidence surfaces of Figure 10.10 shows relatively few
errors occur over much of the region where δ/τ values have been observed.
10.2.5
Heart Disease Classification With High Confidence
In order to provide an example from a real database with as much clarity as possible, the record was
located which had the highest degree of confidence in its suggestion of any record in the Hungarian heart
disease data set. The values for this record are shown in Table 10.1.
When this record is considered by the FIS decision support system, we are presented with the results
shown in Figure 10.11. This display is quite similar to the 2-feature display shown in Figure 10.4, with the
addition of several more input feature values. The confidence value is again very high as the histogram cell
associated with (δ, τ) pair (0.679, 0.626) received 34 correct classifications and one error during training.
This datum will be available to the user by highlighting the “Confidence” value of 0.971 in Figure 10.11.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
139
Distribution Histograms
Correct
Errors
δ -vs- τ : Heart Disease Correct
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0
0.2 0.4 0.6 0.8
1
δ
1.2 1.4 1.6 1.8
τ
τ
δ -vs- τ : Heart Disease Error
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
0.12
0.1
0.08
0.06
0.04
0.02
0
0
δ -vs- τ : Heart Disease Error
1
δ
1.2 1.4 1.6 1.8
2
δ -vs- τ : Heart Disease Correct
count / num records
count / num records
0.014
0.012
0.01
0.008
0.006
0.004
0.002
00
0.2 0.4 0.6 0.8
0.2 0.4 0.6 0.8 1 1.2 1.4
1.6 1.8 2
δ
0.01
0.12
0.1
0.08
0.06
0.04
0.02
00
1
0.8
0.6
0.4
0.2
0
τ
-0.2
-0.4
-0.6
-0.8
-1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
1.8 2
δ
0.1
0.08
0.06
0.005
1
0.8
0.6
0.4
0.2
0
τ
-0.2
-0.4
-0.6
-0.8
-1
0.04
0.02
Conflict
48.0, 1, 4, 160.0, 193.0, 0, 0, 102.0, 1, 3.0, 2, 0, 3
Suggestion
Confidence
OP
S
CA
T
EA
TA
FBS
RECG
C
TRB
S
CP
A
Figure 10.10: δ/τ Histograms from Heart Disease Data
D
0.971
0.0
# of Rules
Fired
39 of 132
Figure 10.11: Heart Disease Classification High Confidence Summary Display
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
Table 10.1: Heart Disease Classification High Confidence Data Record
Feature
A
S
CP
TRBPS
C
FBS
RECG
TA
EA
OP
S
CA
T
Value
48.0
1
4
160.0
193.0
0
0
102.0
1
3.0
2
0
3
(male)
(asymptomatic pain)
mm Hg
mg/dl
(fasting blood sugar ≤ 120 mg/dl)
(normal)
bps
(exercise induced)
(downsloping)
major vessels coloured by fluoroscopy
(normal)
140
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
141
Decision Exploration: Drill-Down Into Rules
Assuming again that the user wishes to drill down to the underlying rules, the display shown in Figure 10.12 is displayed. As there are 39 rules associated with this classification, it is neither feasible nor
desirable to display them all. Instead, the top 10 rules are shown for each class, ranked in order of decreasing assertion value. The user is provided access to the complete list via another drill-down control.
The bracketed rules across the top of each section of Figure 10.12 show the compound rule formed
from the highest WOE independent rule firings (this is the same compound rule described in relation to
equation (9.8) in the CPD-Probabilistic discussion on page 104). While this form of rule construction is not
used to generate a weighting, it is expected that this information will aid the user in understanding the
characterization. The brackets in this statement indicate the FIS rules from which the compound has been
composed. Each bracketed sub-clause therefore refers to one of the rules in the list below. The entire
statement is shown in a muted tone in order to indicate that while it contains information useful to the
user, it should not distract from the rule list below.
Note that in this example there is no conflict recorded, as the defuzzified assertion values are ADiseased =
0.642 and ANormal = −0.518 respectively, even though the maximum rule assertion is positive in both
classes. The positive values for the N class shown in Figure 10.12 are combined with the other
N assertions into the defuzzified centroid value shown at the bottom of each column.
The suggestion of the label D for this record is therefore a conflict-free, high confidence classification as the highest assertion (τ = 0.642) is both positive and distant from the second highest assertion
(τ2 = −0.518 so therefore δ = 1.160), leading to a very high confidence of Cδ/τ Probability = 0.971. As
this is the highest-confidence record in the database, and given the common knowledge that males are at
higher risk for heart disease, along with an understanding that the correlation with disease increases with
age, it is unsurprising that this record would be a M of at least M age.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
A
D
Assertion
Implied When . . .
(A is M H and S is M and S is M H and
CP is A and FBS is M) and (RECG is M L)
0.950
A is M H and S is M and S is M H
0.905
A is M H and S is M and EA is H
0.852
A is M H and S is M
and CP is A and FBS is M
0.846
A is M H and S is M H
0.833
A is M H and S is M and CP is A
0.800
A is M H and S is M and CP is A
and RECG is M L
0.791
S is M H
0.787
EA is H
0.786
A is M H and EA is H
0.730
A is M H and CP is H
10 more rules fired with confidence [0.675 . . . 0.292]
0.642 Support for D (Chosen Label Value)
0.971 Confidence
0.000 No Assertion Conflict
(A is M H and C is L) and (S is M and CP is
A) and (OP is H) and (EA is H)
0.813
A is M H and C is L
−0.128
S is M
−0.277
OP is H
−0.345
A is M and EA is H
−0.349
A is M and S is M H
−0.368
A is M H and S is M and FBS is M
−0.398
A is M H and S is M
−0.434
TA is V L
−0.482
A is M H and CP is A
−0.490
CP is A
9 more rules fired with confidence [−0.507 · · · − 0.903]
−0.518
Refutation for N
Figure 10.12: Heart Disease Classification High Confidence Rule Set
142
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
143
Decision Exploration: Drill-Down Into Membership Functions
If the user wishes to proceed to examine the inputs triggering the rule firings, the input membership
functions will be displayed by user selection as discussed earlier. The display shown in Figure 10.13
indicates the mapping from the input value into the mixture of membership functions named in the rules
for five input features of the heart disease database. Only the features A, S, CP, TRB and C
are shown, in order to allow the figure to be placed on a single page. The remaining features would appear
below the ones shown here on a computer based system. If necessary, a scroll bar will allow the user to
see all the data elements.
Note that in contrast to Figure 10.6, there are input values which trigger membership in more than one
fuzzy set in Figure 10.13, specifically A and C.
The age of the patient indicated in this record (i.e., 48 years of age) falls within the central region of
one of the membership functions while only catching the ramp at the end of the neighbouring function.
The reading for serum cholesterol (C) exhibits a similar relationship to its membership functions. As
indicated, this means that the patient has “L” cholesterol in the terminology of the FIS system, and may
be considered “V L.” In contrast, the measurement for TRB (resting blood pressure in mm Hg
upon admission to the Emergency department) falls uniquely into the “V H” category.
The measures for S and for CP are shown as crisp sets; this is because these fields are stored as nominal integer labels in the input data set. Representation of S with non-fuzzy membership makes sense as
(except for the tiny fraction of the population exhibiting hermaphroditism) this is a physiologically crisp
division. The CP value represents a nominal label applied by an examining physician; while there may be
some vagueness associated with the label choice, information representing the degree of vagueness associated with this choice is not available to us, so a crisp nominal label is the clearest and most informative
choice here. Remembering that the purpose of a DSS is to aid in decision making, it is clear that representation of vagueness in the data must be driven by that vagueness which we can accurately quantify and
thus represent truly. Adding further vagueness measures will simply add more complexity to the system,
making it more confusing and therefore less useful.
The use of a crisp set within the FIS simply prevents the construction of “ramps” and ensures that all
possible associated memberships are equal to 1.
Using the architecture of the FIS, the linguistic labels {M, F} and {T A, A
A, N-A P, A P} have been loaded as the names of the sets in the FIS. This
allows the input database to remain in its standard integer form, but supply the user with the appropriate
label.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
Low
µ 0.5
Medium
µ 1.0
Medium High
High
Very High
66
57.5
52.5
47.5
42.5
37.5
32.5
28
1111111111
0000000000
0000000000
1111111111
00000000000
11111111111
0000000000
1111111111
00000000000
11111111111
0000000000
1111111111
Medium Low
Age
Very Low
144
48
µ 1.0
Male
111111111111111111111
000000000000000000000
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
000000000000000000000
111111111111111111111
1
Sex
Female
Atypical Angina
µ 1.0
Asymptomatic Pain
11111111111
00000000000
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
3
Non−Anginal Pain
h
Hig
Very Low
Med
iu
Hig m
h
Low
Med
iu
Low m
Med
ium
TRestBps
CP
Typical Angina
µ 1.0
Very High
200
.5
.5
143
135
.5
.5
126
119
111
102
−9
11111111111
00000000000
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Chol
High
Very High
.5
193
Figure 10.13: Heart Disease Classification High Confidence Membership Functions
603
337
289
.5
265
.5
237
212
183
−9
1111
0000
0000
1111
00000000000000
11111111111111
0000
1111
00000000000000
11111111111111
0000
1111
ium
Med
iu
Hig m
h
µ 1.0
Low
Med
µ 0.3125
Very Low
Med
iu
Low m
160
Suggestion
Conflict
145
Confidence
OP
S
CA
T
EA
TA
FBS
RECG
C
TRB
S
CP
A
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
D
0.667
0.053
42.0, 1, 3, 160.0, 147.0, 0, 0, 146.0, 0, 2.25, 2, 1, 3
# of Rules
Fired
29 of 132
Figure 10.14: Heart Disease Classification Conflict Summary Display
Decision Exploration: Drill-Down Into Statistical Support
A further drill-down is available to the user based on individual rules, allowing inspection of the underlying
relationships. For example, if the user wishes to inspect the rule “S is M,” the statistics on which
this rule was generated are presented. In this case, this second order event (a M patient in the training
set who is D) was observed 94 times out of a total of 212 M patients, indicating a probability of
0.443 for an incidence of heart disease in M patients. For F patients in the training data set, the
probability is 12/81 or 0.148.
This form of inspection is available for every rule which has been stored to define the FIS; that is,
the complete set of rules (patterns) deemed to be of statistical significance by the PD pattern extraction
algorithm.
The mechanism for initiating this inspection will be selection of (i.e., clicking on) the label “Assertion”
in a display such as Figure 10.12. This action will cause the assertion values to be replaced with the
underlying occurrence numbers. Note also, that as the assertion values are based on FIS(O/A) evaluation, these values are simply probabilities calculated through occurrence observations.
10.2.6
Heart Disease Classification With Conflict
The true utility of a decision support system lies in its ability to aid a user in understanding data which
contains conflict or uncertainty. Let us therefore assume that the data record shown in Table 10.2 is
processed. This record was selected by examining the database for records with conflict that still contained
a high τ value.
The summary display of Figure 10.14 resulting from the processing of this record immediately flags
the fact that there is some conflict in the decision involving this patient record, and that the confidence
is reduced from the previous case as more errors were found relative to correct classifications in this
δ/τ histogram cell. The increased visual contrast of the conflict measure allows a user to immediately
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
146
Table 10.2: Heart Disease Classification Conflict Data Record
Feature
A
S
CP
TRBPS
C
FBS
RECG
TA
EA
OP
S
CA
T
Value
52.0
0
4
130.0
180.0
0
0
140.0
1
1.5
2
0
3
(female)
(asymptomatic)
mm Hg
mg/dl
(fasting blood sugar ≤ 120 mg/dl)
(normal)
bps
(exercise induced)
(downsloping)
major vessels coloured by fluoroscopy
(normal)
distinguish between records of this type and those with no conflict.
Decision Exploration: Drill-Down Into Rules
In response to a drill-down selection, Figure 10.15 displays all the rules which have been fired. In this
interesting record, it is instructive to examine all of the record values, rather than just the top 10 of each
set, therefore the display in Figure 10.15 has been fully expanded.
Note the range of assertions in the defuzzification space for N which results in a low assertion of
0.053 even though some rules support this classification with values as high as 0.875. Defuzzified support
for the D class is much higher as there are very few votes refuting this label, in contrast to the
N class which has both strong support and refutation. The conflict apparent in the characterization
as a D patient’s record thus has obvious roots in the rule base.
Whether a user of the decision support system would wish to use a record with such conflict is strongly
tied to the application area. A user may proceed to make a decision incorporating both the characterization
shown here and the new knowledge that there are strong conflicts relating to this record, based on an
understanding of the relationship between conflict and reliability; conversely, as in the simple conflict
example discussed previously, the user may decide that further analysis is required.
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
D
Assertion
Implied When . . .
(A is M H and S is M H) and (OP is M H) and
(S is M H) and (EA is H) and (S is F)
0.846
A is M H and S is M H
0.813
OP is M H
0.791
S is M H
0.787
EA is H
0.786
A is M H and EA is H
0.730
A is M H and CP is A
0.675
CP is A
0.391
A is H and EA is H
0.375
A is H and S is M H
0.340
A is H and CP is A
−0.590
S is F
A
0.626
0.667
0.053
Support for D (Chosen Label Value)
Confidence
Conflict
(A is M H and S is F and T is L) and (CP is A) and
(EA is H) and (S is M H) and (OP is M H)
0.875
A is M H and S is F and T is L
0.852
S is F
0.840
A is M H and S is F
0.833
A is M H and S is F and FBS is M
0.833
A is M H and S is F and CA is L
0.810
A is M H and S is F and RECG is M L
0.580
A is M H and C is L
0.375
A is H and TA is M
−0.274
A is H and S is M H
−0.277
A is H and CP is A
−0.308
A is H and EA is H
−0.482
A is M H and CP is A
−0.490
CP is A
−0.570
A is M H and EA is H
−0.666
EA is H
−0.673
S is M H
−0.706
OP is M H
−0.720
A is M H and S is M H
0.053
Support for N
Figure 10.15: Heart Disease Classification Conflict Rule Set
147
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
148
Explored Rules and Heart Disease
It is instructive to examine the rules themselves, as such relationships are displayed such as an assertion
of N with 0.852 assertion support simply because the patient is female. Another unsurprising relationship triggered by this record is that of low cholesterol and N classification even while age is
M H; this is in agreement with innumerable articles in the popular press.
Clear phrasing of these relationships will aid in understanding and acceptance of a system of this type
by professionals in the area of application.
10.2.7
Interactive Exploration — Brushing and Selection
A further means of exploring the interaction between rules and input values may be made available through
brushing.
When holding the mouse pointer over any rule in the list (for example, the list shown in Figure 10.12),
the associated input membership functions for that rule will highlight (in this case, the appropriate membership functions shown in Figure 10.13).
If this rule is selected (clicked), the highlight will remain after the mouse pointer is moved away, until
the rule is selected a second time, or another rule is selected.
In this manner the interplay between multiple rules and their input values can be shown.
Similarly, brushing an input membership function will highlight all of the rules in the associated list,
allowing an exploration of the relationship from input values through rules.
10.2.8
“What If?” Decision Exploration
A stream of output values are produced as a decision is explored, such as that shown in Figure 10.16 in
which the results of several “what if?” suggestions exploring the results of changing various input feature
values are shown.
This activity is expected to be useful in the context of exploring the consequences of input value
changes in the data space. Consider the case where a patient has gone to a clinic and is exploring their
medical status with a physician with respect to heart disease. The PD/FIS system may have classified this
patient as being at risk for disease.
In order to explore this with the patient, the physician may ask the system “what if — the patient
reduced their cholesterol count?” by adjusting that input parameter. The PD/FIS system would then
respond with the suggested characterization involving the new data.
40.0,
30.0,
10.0,
10.0,
−20.0
20.0
−20.0
−10.0
A
A
A
A
Conflict
F2
Suggestion
Confidence
F1
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
0.867
0.867
0.841
0.658
0.0
0.0
0.0
0.028
149
# of Rules
Fired
6 of 54
6
6
7
Figure 10.16: Covaried Classification With High Confidence Summary - Exploration
Upon examining the rule base related to the new suggestion, the physician may note that the highest factor still supporting heart disease risk may be a low number of hours per week of exercise. The
physician may then ask “what if — exercise hours per week were adjusted?”, and again a new suggested
characterization would be presented.
In this way, the decision maker can use the system to evaluate various possible courses of action, based
on an interactive query–and–response work-flow involving the rule base and presented suggestions.
10.3
Discussion
The DSS presented here exhibits the main features required of a decision support system: exceptional
transparency and reliable confidence.
10.3.1
Transparency
The transparency of the system is maintained through interactive inspection of the relationships between
input data, rules fired and output classification values.
The user is directed through the chain of inference from the highest level (the suggestion and summary
values) back through the rules and input data values to the underlying statistics framing the input data in
terms of the training data set. At each level, the system relays the complexity of the decision made by
summarizing the confidence, conflict and number of rules fired.
This functionality is critical in a DSS, as only through the presentation of complete transparency will a
user be able to understand the reasoning by which a decision has been presented. In the context of a larger
decision being made from several sources (of which the DSS is but one), total transparency is required in
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
150
order that aspects of the decision involving a deep appreciation of the interplay between measured data
values can be brought into the context of the final decision.
Without the ability to inspect, for instance, the effect that gender has on heart disease, it is impossible
for a user of the system to be able to suggest a course of action involving any other parameter, such as
cholesterol level, as this secondary attribute needs to be taken into account in a gender-specific way.
Similarly, if a patient who has a family history of heart disease is being examined, it is important
to be able to correlate both gender and other factors into a larger history. This form of exploration must
accessible and supported in order that a larger understanding of the import of the data record of a particular
patient may be fully brought to light.
10.3.2
Confidence
The confidence in the system is calculated using a metric proven to have a good correlation with true
system failure. This metric provides the user a real sense of how likely the system is to be suggesting an
incorrect characterization as measured using the internal assertion weightings of the algorithm.
As seen in Chapter 9, this provides a very good estimate of the probability of failure. The evaluation
of the system in Chapters 7 and 8 shows that the system performance is strong for both the tested synthetic
and real data.
10.4
Conclusions
The DSS presented in this chapter captures the most important features required for confident decision
analysis and exploration.
This chapter has provided an overview of the exploratory mechanism present in the FIS. Several
mechanisms have been presented by which an investigative user may determine rationale supporting a
presented characterization. A graphical display for the presentation of input value membership in the
fuzzy sets driving the system has been described.
The overall statistical nature of the FIS system allows the presentation of the statistical metric underlying each rule used; simple weighted combinations of rule assertion values result in an assertion for the
class overall. This simple relationship leaves itself open to inspection by means of iterative drill-down
analysis.
The hierarchical structure of the data, as shown in Figure 10.1 naturally supports an interactive exploratory work-flow. The stages of inspection supported by the drill-down within the FIS correspond to
CHAPTER 10. DECISION SUPPORT AND EXPLORATION
151
the relative abstraction of the cognitive model the decision maker will employ. Branching shown in the
hierarchy shows the amalgamation of information from different sources at each level. Considering the
hierarchy in a bottom up formation, each branch can be considered a simplification and summarization of
the data space, allowing high level decisions to be made based on the underlying data, but without needing
to directly evaluate each fact.
The weights and the rules may both be inspected, and the previously mentioned confidence value can
be used to weight the overall suggestion in terms of a larger decision making process.
The next obvious step in the design process is to evaluate this decision making architecture with real
decision makers. This future work will be done under separate cover adapting the system to a specific
application area, such as clinical electromyographic decision support.
Chapter 11
Conclusions
“The best thing for being sad,” replied Merlyn, beginning to puff and blow, “is to learn
something. That’s the only thing that never fails. You may grow old and trembling in your
anatomies, you may lie awake at night listening to the disorder of your veins, you may miss
your only love, you may see the world about you devastated by evil lunatics, or know your
honour trampled in the sewers of baser minds. There is only one thing for it then — to learn.
Learn why the world wags and what wags it. That is the only thing which the mind can
never exhaust, never alienate, never be tortured by, never fear or distrust, and never dream
of regretting. Learning is the only thing for you.”
— T.H. White, “The Once and Future King”
Evaluation of the PD/FIS system has shown several interesting strengths and weaknesses.
The system meets the goals required for a decision support system. Specifically, it provides suggestions to a decision making user with sufficiently high performance to be useful, coupling each decision
with a characterization of the degree of confidence the suggestion can bear while providing a transparent,
explorable, explainable framework describing how the decision was formed.
11.1
Transparency
The PD based FIS provides a type of transparency not common among decision support systems. It
provides not only the transparency of a rule based system, but also allows inspection of the means by
152
CHAPTER 11. CONCLUSIONS
153
which the MME quanta were created as well as the assurance (backed up by occurrence based probabilities
if desired) that the rules found have statistical significance.
The rules themselves are formed based on the ability to distinguish input data values from each other
in order to provide a labelling. In contrast to systems such as BP, this allows a user to see several important
aspects of the underlying rationale: exactly what parts of the current input data vector carry the greatest
information; the path of inference supporting the suggested characterization; and the logic by which that
path was formed.
This transparency in PD, along with the confidence measure, allows a drill-down based analysis to
be performed supporting a user-driven decision exploration. Through such an interactive exploration,
the complexity of the decision making is made manageable. A sufficiently simple summary can be presented to the user at the highest level while still allowing the underlying data to be available as context.
This reduces the cognitive load decision makers will experience when using the system. The drill-down
methodology allows the decision maker to inspect any part of the decision process about which they hold
curiosity. This allows the system to be regarded as trustworthy, as a “white box” presentation is maintained.
11.2
Performance
Through evaluation of the PD/FIS system on a variety of continuous and mixed type data, it has been
shown that the FIS performance is generally measurably lower than that of “natively continuous” classifiers such as BP. For decision support purposes, the transparency of the PD/FIS makes it the superior
candidate.
When training data set is small, such as those seen in the thyroid and heart-disease data sets, the
PD/FIS, and especially the FIS(O/A) implementation, may have a measured performance
equal to that of BP, as the scarcity of the data provides insufficient resources for BP training. For these
reasons, although the performance of the PD/FIS system classification performance may be lower than
that of other classifiers, the outstanding qualities as a means of supporting explainable decisions makes it
preferable in this context.
Reviewing the PD/FIS systems tested, the performance of the FIS(O/A) system is particularly interesting as it is significantly higher than that of the PD system when run on continuous and
mixed-mode data. The addition of the fuzzy input membership functions along with the occurrence based
rule weighting improves the PD system and allows classification of more input values even in complex
data class distributions such as the spiral data set examined here.
CHAPTER 11. CONCLUSIONS
154
This shows that the addition of fuzzy attributes and a re-thinking of the rule weightings provides a
useful improvement over the raw PD system from which the FIS(O/A) algorithm is derived.
The performance of the FIS(O/A) classifier is therefore sufficiently high that, coupled
with the confidence measure presented, a reliable DSS can be constructed. This is superior to any design
using a system with slightly higher classification performance but lower transparency.
11.3
Confidence
A confidence measure has been provided which is a good indicator of true reliability. This has been
evaluated over a series of data class distributions and found to be stable, but related to the amount of
information available in the problem.
This confidence measure allows the user to incorporate a suggestion from the DSS into a larger decision context, providing a means of differentiating between occasions when the suggestion is trustworthy,
and those when the probability of an erroneous suggestion has become high. This design allows the system performance to degrade gracefully as input values carry the inference into regions of the decision
space which are poorly understood or have a large degree of conflict. Such grace is required in a DSS,
as without a means of indicating the variability in confidence as a function of the input data and internal
system state, the user cannot gain trust in the decisions made.
11.4
Decision Exploration
Decision exploration describes the means by which a user will learn about a decision space and thereby
come to understand and trust a decision support system. The FIS(O/A) based DSS described
here provides a “drill-down” based exploration metaphor which accurately represents the hierarchy of the
cognitive model, allowing a decision maker to explore the problem space in the context of a suggestion
being made on a set of input values.
This form of exploration allows the user both to relate a proposed suggestion to other possible labellings and to determine how a suggestion is supported or refuted. By using both positive and negative
logic rules, the conflicting support for multiple classes is clearly shown; through the use of fuzzy inference methodology, this conflict is summarized though simple suggestions with accompanying confidence
values.
This provides a user with a trustworthy, transparent system which can provide characterization based
suggestions across a variety of data domains.
CHAPTER 11. CONCLUSIONS
11.5
155
Quantization Costs
The FIS adaptation of the PD system reduces the quantization cost in the evaluated problems by recasting
the crisp quantization of the MME bins into fuzzy input membership functions (i.e., the addition of ramps).
A further investigation into the performance of a class-dependent quantization scheme may prove fruitful,
as the quantization bin bounds are not in any way driven by the optimal decision surface in the system
described here.
The BP based classifiers provide an interesting upper bound describing how high the cost of quantization is; a further evaluation relative to these classifiers will allow more discussion on the recovery of
this cost.
11.6
Future Work
Several major directions are possible as further goals of the research here.
A major remaining aim includes the adaptation to a particular application area, specifically characterization of the disease relationship of motor unit potentials within an electromyographic domain. Beyond
this direction, the following research topics immediately are accessible as extensions of this work:
Class Dependent Quantization: Much of the quantization discussion during this work has shown that
the decision surfaces of classifiers such as MICD are not reflected in the bin boundaries of MME.
The use of a class-dependent quantization scheme would allow the exploration of the benefits of
driving the bin boundaries from the observed inter-class bounds, independently by feature.
Discrete and Mixed Mode Performance Analysis: The PD(O/A) and FIS(O/A)
algorithms have shown a marked improvement in classification performance over the PD algorithm
in continuous-valued data. A further topic of research will cover the performance analysis of these
algorithms on discrete data to determine the relative utility of WOE and occurrence weighting with
these data landscapes.
Multi-Part Classification: Of particular interest in electromyographic characterization is the production
of a composite suggestion from the analysis of a set of related input vectors. Such a decision could
be built upon the system described here by running the PD/FIS system for each input vector and
combining the results together, essentially by producing an overall suggestion by a collection of
individual ones, in a way similar to that by which rule firings are combined to achieve an overall
assertion in the system described here.
Calculation of the confidence for such a scheme would be required; at this point, this remains an
CHAPTER 11. CONCLUSIONS
156
open problem, albeit one which could be addressed using techniques adapted from those shown
here.
Analysis of Noise Due to Missing Data: While the PD system has been designed to function in cases of
missing data values, an authoritative analysis of the effects of missing data values within the PD/FIS
system remains to be done. In particular, the effects of the increase in missing data fields on the
quality of the analysis is work which demands attention.
Calculation of a Fuzzy Confidence Value: While the performance of the confidence value suggested
here correlates highly with measured reliability, an interesting topic of analysis would be the evaluation of a fuzzy prediction of confidence, based on δ/τ or other internal values, and a comparison
of these results with those of the Cδ/τ Probability based confidence measure.
“Multi-Bin” Fuzzy Input Set Definitions: The discussion of the fuzzy input set adaptations in this work
have concentrated on using each MME bin as the basis for a fuzzy input set. It would be equally
possible to join multiple MME bins together and use a rule calculated on the joined bin, allowing
membership values to be computed based on a relative measure of association which may differ for
each composite MME bin. This would produce fuzzy membership functions from a contiguous join
of MME bins, and produce a lower number of rules. These rules may describe the major features of
a database more succinctly than the many rules based on smaller divisions.
This approach would be very useful for forming characterizations of a data base in the form of a rule
summary. This in turn could form the basis of a data mining system used to explore and identify
major patterns within a data set.
Multi-Vote Classification: The combination of a suggested label and a confidence value raises the possibility of an automated means of collecting multiple votes into one composite decision, as mentioned
in Chapter 9. Such a system could incorporate the suggestions from a number of weighted classifiers into an overall vote, increasing the likelihood of a correct labelling. The confidence measure
just described could then be used as a weighting, allowing the construction of a collaborative voting
system using reliability estimation, rather than one simply based on majority voting.
Multi-voting classifiers have been described in Wanas and Kamel (2002); Levitin (2002, 2003) and
Montani et al. (2003), and have an ongoing popularity as a means of producing an overall system
with a higher classification performance than that of any of the composite systems. While many
such systems combine “votes” from each sub-classifier simply through majority voting, the ability
to weight each vote by its estimated reliability provides an interesting means of letting a classifier
“abstain” when confidence is low. Such a system is described in Cordella et al. (1999).
Clinical Electromyographic Characterization: The author’s main interest as an immediate application
CHAPTER 11. CONCLUSIONS
157
of this work is the construction of a clinical DSS based on the techniques presented here. A combination of the data analysis and suggestion just discussed as well as domain knowledge of the
electromyographic clinical examination will allow the work described here to be presented in the
form of a diagnostic suggestion system for muscular disease. Such a system would allow the analysis of diagnostic suggestion clarity and accuracy as a spectrum of disease involvement is presented.
The characteristic changes in input values due to the progression of a disease must be taken into
account by a clinical user, and therefore a vital part of the adaptation of this system to clinical work
is a discussion and analysis of how disease progression is apparent in the DSS suggestions based on
this particular data domain.
Part IV
Appendices
158
Appendix A
Mathematical Notation
This appendix lists the meaning of the variables used in the equations in the paper including their point
of introduction. Greek letters are arranged at the beginning of the table, followed by Roman letters and
acronyms beginning with Roman letters.
A.1
Greek Letter Variables
δ the difference between the two best output votes (τ and τ2 ) from the FIS. Introduced in equation (9.3)
in Section 9.2.1.
δk is used in equation (5.10) as part of the discussion of the MICD classifier in Section 5.7.2.
θ index to the spiral of equation (5.8) in Section 5.4 on page 58.
κ the “strength” of the covaried data. Used in equation (5.2) to produce Covi j . In this work, κ = 0.6.
µ when used as a variable, a mean. When used as a function, µ indicates the membership function of
some fuzzy set.
ν j the number of discretely observed values (i.e., bin assignment IDs when considering continuous data)
observed in column j.
ξ shape parameter used in equation (5.7) to control the degree of spread of the Log-Normal class distribution. Described in Section 5.2 on page 56.
π the circularity constant, π = 3.14159265 . . .
ρ scale or acceleration of the generating spiral for the spiral class distribution described in equation (5.8)
159
APPENDIX A. MATHEMATICAL NOTATION
160
in Section 5.4 on page 58.
σ standard deviation.
τ the best of the output votes Ak asserted by the FIS. Introduced in Section 9.2.1 and used throughout the
discussion of reliability in Chapter 9.
A.2
Roman Letter Variables
axml k the assertion produced by firing the rule xm
l associated with class k.
Ak the defuzzified assertion produced in support or refutation of class k. Defined originally in section 4.3
in Chapter 4, on page 48.
ci the class mean calculated for a unimodal distribution using equation (5.5) and then used to generate a
separation value, si . Described on page 56.
Covi j a covariance matrix constructed to generate unimodal covaried data. Described on page 54.
CMICD MICD based confidence, as described by equation (9.5).
Cτ[0...1] bounded normalized confidence based on examination of τ, defined in equation (9.5).
Cδ/τ Probability probability based confidence, introduced in equation (9.5).
E(Θ) expectation operator; provides the “expected value of Θ”.
Exml the estimate of the number of input records expected at a given order xm
l . This value is defined in
equation (3.9) on page 41.
e the natural constant, e = 2.71828 . . .
exml the expected number of occurrences of event xm
l . The construction of this value is shown in equation
(3.4). See also oxml .
i a looping variable used for several purposes with local meaning only
j usually indicates the column index, i.e., j ∈ [1 . . . M].
K the number of true classes (labels) to which the data may be assigned. Note that there is always an
extra “” class also, not included in K. See also yk and Y.
k an index into the number of labels, K.
L the list of rules used within the workings of the independent or “I” fuzzy rule firing scheme outlined
in Section 4.4.2.
M the list of highest order match rules used within the workings of the independent or “I” fuzzy rule
firing scheme outlined in Section 4.4.2.
M the number of input columns in the training and testing data set, excluding the label column.
N the number of rows in the training data set.
APPENDIX A. MATHEMATICAL NOTATION
161
N(0, 1) indicates a (Gaussian) Normal distribution with zero mean, and unit standard deviation.
m
oxml the number of observed occurrences of event xm
l . See also exl .
Pr(x) the probability of event x.
P the weighted performance measure described in equation (5.12) in Section 5.8.
Q the number of MME quantization intervals into which continuous data has been discretized.
q j the number of MME quantization intervals into which continuous data in column j has been discretized. For the tests in this work, q j = Q ∀ j ∈ [1 . . . M].
r0 initial distance from the origin for the spiral equation (5.8) of Section 5.4 on page 58.
rxml the adjusted residual of the current input event. the calculated residual of the current input event.
Defined in equation (3.2) on page 35. If |rxml | exceeds 1.96, the multi order event xm
l defines a
statistically significant pattern. See also zxml .
si a separation value from S, the list of separations. This is calculated in equation (5.6) on page 56.
T k transform generated to colour unimodal N(0, 1) data with a supplied covariance. Defined on page 55
in equation (5.4).
Terr training error introduced in Section 5.5.
Vxml K a single value produced by firing rule xm
l in support or refutation of class K. This will be evaluated
along with all other such values for a given class to produce an assertion Ak .
V a vector of variance values used to generate unimodal covaried data.
vxml the variance of zxml , defined in equation (3.3) on page 35, and used in the discussion of Bimodal data
mode separation on page 57.
WOE the weight of evidence of an event, defined in equation (3.7) on page 37.
xm
l an input event at some order m, m ∈ [1 . . . M + 1], consisting of one or more input columns. In order
to be considered as a pattern, it must also contain a label value. The index l indicates the ordinality
in the set of of the order-m events.
x?l
the portion of xm
l consisting only of the vector of input column data after the label column has been
removed, that is,
x?l ∩ Y = xm
l .
x⊕l a composite event built from the x?l portions of all rules firing which match a given input event x. This
value is used in the construction of conditional probability from weight of evidence, as described in
equation (9.8) in Chapter 9.
Wk , wk and wk are all used in equation (5.10) as part of the discussion of the MICD classifier in Section 5.7.2.
APPENDIX A. MATHEMATICAL NOTATION
Y the output label column.
yk the kth value of the label, k ∈ [1 . . . K].
Y=yk a notation describing the considered assignment of label k (k ∈ [1 . . . K]) to the label column.
zxml the calculated residual of the current input event. Defined in equation (3.1) on page 35.
162
Appendix B
Derivation of PD Confidence
The following proof describing (9.8) was derived by Pino (2005), and is reproduced here for reference in
the discussion of confidence measures within PD and the FIS.
A “composite input vector” is formed by the connection of the input portions of two patterns referring
to distinct columns. These patterns have conditional independence as conditioned by the class label; that
is:
Pr(xa , xb |L) = Pr(xa |L) × Pr(xb |L)
(B.1)
The result is used to produce conditional probabilities of class membership based on the composite
WOE produced during the firing of PD rules, as described in equation (9.8) in Chapter 9.
Proof. Derivation of Confidence from WOE The proof begins by considering (3.7), which defines “weight
of evidence” (WOE), and proceeds to a form representing the conditional probability of a given label
(Y=yk ), given a composite input vector, x⊕l constructed as the union of all rules matching a complete input
pattern xm
l
x⊕l =
[
x?l ∈ M
x?l
(B.2)
where M is the list of all patterns selected to match the current input pattern using the algorithm described
in Section 3.4.1.
163
APPENDIX B. DERIVATION OF PD CONFIDENCE
164
The derivation begins
WOE = logβ
Pr(x⊕l , Y=yk )Pr(Y,yk )
Pr(Y=yk )Pr(x⊕l , Y,yk )
Pr(x⊕l , Y=yk ) (1 − Pr(Y = yk ))
h
i
= logβ
Pr(Y=yk ) Pr(x⊕l ) − Pr(x⊕l , Y=yk )
(B.3)
(B.4)
Equation B.4 is the form used in Wang (1997, pp. 94).
Now letting Φ =
(1−Pr(Y=yk ))
Pr(Y=yk )
Pr(x⊕l , Y=yk )
i ·Φ
= logβ h
Pr(x⊕l ) − Pr(x⊕l , Y,yk )
= logβ
but Pr(Y = yk |x⊕l ) =
1
Pr(x⊕l )
Pr(x⊕l ,Y=yk )
(B.5)
·Φ
(B.6)
·Φ
(B.7)
−1
Pr(x⊕l ,Y=yk )
Pr(x⊕l )
= logβ
= logβ
1
1
Pr(Y=yk |x⊕l )
−1
1
Pr(Y=yk |x⊕l )
−1
Φ
(B.8)
Rearranging for Pr(Y = yk |x⊕l ), we get
WOE = logβ
βWOE =
Φ
1
Pr(Y=yk |x⊕l )
Φ
1
Pr(Y=yk |x⊕l )
−1
−1
·Φ
(B.9)
(B.10)
APPENDIX B. DERIVATION OF PD CONFIDENCE
165
Letting α = Pr(Y = yk |x⊕l ) and γ = βWOE , we continue
γ=
Φ
−1
1
α
(B.11)
Φ
γ
Φ
Φ+γ
+1=
γ
γ
1
γ
α=
= Φ
Φ+γ
γ +1
1
−1=
α
1
=
α
so as Φ =
(1−P(Y=yk ))
P(Y=yk ) ,
(B.12)
(B.13)
(B.14)
we .˙. conclude
Pr(Y = yk |x⊕l ) =
1
Φ
βWOE
(B.15)
+1
or
Pr(Y = yk |x⊕l ) =
1
1
βWOE
(1−Pr(Y=yk ))
Pr(Y=yk )
Note that in this work all logarithms are base 2 (i.e., β=2).
+1
(B.16)
Appendix C
Conditional Probabilities Derived From
Synthetic Data
In order to discuss possible confidence measures, it is logical to refer back to the underlying probability
distributions by which the data examined here were generated, as a conditional probability based on the
underlying data scheme should exhibit some relationship with the final confidence.
C.1
Confidence and Conditional Probability
The conditional probability of assignment of a given point to each distribution gives us the “true” confidence of assignment, based on the maximum amount of information available by measuring the point
location.
C.2
Conditional Probability As Calculated Using z-Scores
Considering the synthetic data discussed in Chapter 5, it is possible to translate data point locations in the
n-dimensional spaces examined for performance measurement back into z-scores relative to the various
generating means. If this is done, we can then generate conditional probability values from the known
PDFs (probability density functions) that generated the data originally.
As shown in Figure C.1 the conditional probability of true association with each class at any given
166
APPENDIX C. CONDITIONAL PROBABILITIES DERIVED FROM SYNTHETIC DATA
P(A|x) = 2.8 / (1.2 + 2.8) = 0.7
167
P(B|x) = 1.2 / (1.2 + 2.8) = 0.3
A
B
2.8
1.2
x
Figure C.1: Logically Constructing Conditional Probability from PDFs
z-score along the line defining two 1-dimensional distributions is simply
P(yk )
P(yk |x) = PK
i=i P(yi )
(C.1)
for any class k of the set of K possible classes. This is easily generalizable to n-dimensional cases by
separately considering the Euclidean distance to the K means.
Performing this calculation allows us to produce conditional probability values for any function for
which we know the a priori PDF definition. Surfaces showing the conditional probability values for twodimensional versions of the covaried, bimodal and spiral data are shown in Figures C.2, C.3 and C.4,
respectively.
By examining the covaried data in Figures C.2—C.4, it can be seen that as one moves away from the
mode of class A, the lowest conditional probabilities are found as one approaches class B, as expected.
Points which are distant from both modes, such as the point at location (−3, −3) at the left-hand side
of the covaried data in Figure C.2 have the highest conditional probability. This also logically follows,
as points distant from both A and B, while unlikely to occur at all (as shown by the low probabilities
in their PDF values), have a strong association with the nearest class mode as a condition of their (rare)
occurrence.
That is, given that a point (−3, −3) actually occurs, the probability of its association with class A is
very much higher than the probability of association with class B.
Similarly, conditional probability surfaces can be constructed for bimodal and for spiral data, as shown
APPENDIX C. CONDITIONAL PROBABILITIES DERIVED FROM SYNTHETIC DATA
Covaried Probability Surfaces
1
0.8
0.6
0.4
0.2
-30
-2 -1
0
1
2
-1
3 -3 -2
0
1
Class A PDF
Class B PDF
Class A Conditional Probability
Class B Conditional Probability
Figure C.2: Covaried Conditional Probability Surfaces
2
3
168
APPENDIX C. CONDITIONAL PROBABILITIES DERIVED FROM SYNTHETIC DATA
Bimodal Probability Surfaces
1.2
1
0.8
0.6
0.4
0.2
0-4 -3
-2 -1
0 -1
1
0 1 2
2
3 4 43
Class A PDF
Class B PDF
Class A Conditional Probability
Class B Conditional Probability
Figure C.3: Bimodal Conditional Probability Surfaces
-4
-3
-2
169
APPENDIX C. CONDITIONAL PROBABILITIES DERIVED FROM SYNTHETIC DATA
Spiral Probability Surfaces
1
08
6
4
2
0
-2
-4
-2
0
2
-4
-6
Class A Conditional Probability
Class B Conditional Probability
-6
Figure C.4: Spiral Conditional Probability Surfaces
4
6
8
170
APPENDIX C. CONDITIONAL PROBABILITIES DERIVED FROM SYNTHETIC DATA
171
in Figures C.3 and C.4 respectively, simply by summing the probability of association with all modes and
adapting (C.1) to include this extra sum, or
P(yk )
P(yk |x) = PN
PK
modes
j=1
C.3
i=i
.
P(yi )
(C.2)
Calculating z-Scores from Data Points
For the synthetic data analyzed in this work, we can calculate the z-score because the model by which
the data was generated is available. For the covaried data, we simply need to reverse the effects of the
applied covariance added by equation (5.4) from Chapter 5 and whiten the data according to the known
mean vector and covariance matrix. Each conditional probability is then calculated separately relative to
each class, much as is done in the MICD classification algorithm described earlier.
In the case of bimodal data, the process is much the same. The distance relative to each mode is,
however, converted separately into a z-score before applying (C.2) and the per-class conditional probability
is produced by summing the probabilities among all the modes for each class.
For spiral data, some care must be taken in order to ensure that both points in a positive and in a
negative rotation around the origin are taken into account; other than this small concern the PDF can be
generated by simply remembering that the N(0, 1) distribution centred at each point of the arm of the
spiral extends around the origin at a fixed radius.
Using the described method, “true” confidence values in the form of conditional probability assignments to each class can be calculated for all the synthetic data considered in this work.
Appendix D
Further Tables Regarding Reliability
Statistics
This appendix contains summary information calculated for the relationship between measured and expected confidence in support of the reliability discussion in Chapter 9.
The measures included here are mutual information, symmetric uncertainty and the interdependence
redundancy measures, as well as the summaries for Spearmannranking.
The discussion here indicate some of the issues with using a non-rank based measure. Due to the
problems discussed in this appendix, only the Spearmannrank distribution is used in the text.
For information regarding the effects of quantization on these measures, please see the accompanying
discussion in Appendix E.
D.1
Symmetric Uncertainty
Symmetric uncertainty is a [0 . . . 1] bounded information based measure which describes the degree of
correlation between two values.
The name “symmetric uncertainty” is slightly confusing because of the trend of the result reported, as
the symmetric uncertainty value of independent data is 0, while the value reported for data with a perfect
correlation is 1.
While this would suggest that a better name would imply the degree of “certainty,” the standard in the
literature is to use the name “symmetric uncertainty.”
This usage seems to stem from the discussion of this measure in Press et al. (1992, pp. 634) where
172
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
173
the main discussion is in terms of uncertainty coefficients, which already have this trend of 0 indicating
independent data and a rising trend as data moves away from independence.
The calculation of symmetric uncertainty is performed using
H[a] + H[b] − H[a, b]
S U[A, B] ≡ 2
H[a] + H[b]
!
(D.1)
which is defined in Press et al. (1992, pp. 634).
The symmetric uncertainty measure is an entropy-normalized version of the mutual information between the two values A and B.
D.2
Mutual Information
Mutual information (MI[A; B]) is defined to be
MI[A, B] ≡ H[a] + H[b] − H[a, b]
(D.2)
and may also be calculated using
MI[A, B] =
X
p(a, b) log
(a,b) ∈ A×B
p(a, b)
p(a)p(b)
(D.3)
where a and b are probability mass functions as described in in Duda et al. (2001, pp. 632).
The entropy definition
H[A] = −
X
p(a) log p(a)
(D.4)
a
formalized in Shannon (1948a,b) is extended to joint entropy (e.g., Moon, 2000) by simply considering
the joint p(x, y) instead of a single probability mass function. This is calculated using
H[A, B] = −
X
p(a, b) log p(a, b)
(a,b) ∈ A×B
where p(a, b) is the joint distribution of A and B.
(D.5)
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
D.3
174
Interdependency Redundancy
Interdependency redundancy of two random variables is calculated as the mutual information normalized
by the joint entropy of the pair (from Wong and Ghahrarnan, 1975; Wong, Liu and Wang, 1976)
R[A, B] =
MI[A, B]
2 N H[A, B]
(D.6)
where N is the minimum number of occurrences of A or B.
D.4
Discussion
Upon evaluation, none of the entropy based statistics, namely mutual information, symmetric uncertainty,
and the interdependency redundancy measure have significantly different trends to those in the simpler
correlation calculations.
The mutual information, symmetric uncertainty and interdependency redundancy are all calculated
over the observed confidence values after binning these values using an equal-range bin scheme with 8
bins per dimension.
An examination of the performance of equal-range and MME quantization schemes in conjunction
with these statistics, as well as the effect of Q for these measures is explored in Appendix E, for readers
interested in this background.
As can be seen by examining the performance of mutual information, interdependency redundancy
and symmetric uncertainty, the saturation of the data at (1, 1) leads to a very poor correlation. For these
reasons, Spearmannrank correlation is used in the discussion in Chapter 9.
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
Cδ/τ Probability
Cτ[0...1]
CPD-Probabilistic
C
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
B
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
S
0.125
0.250
0.500
0.750
1.000
µ
CMICD
Table D.1: Mutual Information Comparison Summary: Equal Binning, Q=10
0.260
0.274
0.206
0.236
0.178
0.067
0.000
0.174
0.274
0.280
0.224
0.254
0.136
0.073
0.000
0.177
0.352
0.352
0.296
0.268
0.150
0.086
0.001
0.215
0.149
0.170
0.182
0.257
0.157
0.188
0.000
0.158
0.556
0.866
0.703
0.610
0.337
0.071
0.000
0.449
0.583
0.874
0.823
0.708
0.434
0.064
0.000
0.498
0.590
0.961
0.937
0.836
0.508
0.106
0.000
0.563
0.457
0.624
0.758
0.738
0.498
0.083
0.000
0.451
0.023
0.037
0.031
0.033
0.040
0.033
0.023
0.018
0.037
0.023
0.021
0.024
0.028
0.026
0.062
0.076
0.124
0.063
0.033
0.026
0.038
0.040
0.076
0.042
175
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
Cδ/τ Probability
Cτ[0...1]
CPD-Probabilistic
C
0.125
0.250
0.500
1.000
2.000
4.000
µ
B
0.125
0.250
0.500
1.000
2.000
4.000
µ
S
0.125
0.250
0.500
0.750
1.000
µ
CMICD
Table D.2: Symmetric Uncertainty Comparison Summary: Equal Binning, Q=10
0.134
0.145
0.116
0.150
0.122
0.000
0.111
0.143
0.155
0.133
0.144
0.122
0.000
0.116
0.189
0.203
0.158
0.155
0.103
0.000
0.135
0.077
0.094
0.091
0.179
0.181
0.000
0.104
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.004
0.001
0.014
0.027
0.057
0.021
0.002
0.003
0.015
0.002
0.000
0.004
0.004
0.001
0.013
0.024
0.049
0.018
0.000
0.001
0.007
0.000
0.000
0.001
176
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
Cδ/τ Probability
Cτ[0...1]
CPD-Probabilistic
C
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
B
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
S
0.125
0.250
0.500
0.750
1.000
µ
CMICD
Table D.3: Interdependency Redundancy Comparison Summary: Equal Binning, Q=10
0.004
0.004
0.003
0.003
0.004
0.002
0.000
0.003
0.004
0.003
0.003
0.004
0.003
0.003
0.000
0.003
0.006
0.005
0.004
0.004
0.003
0.002
0.000
0.003
0.002
0.002
0.002
0.003
0.004
0.007
0.000
0.003
0.012
0.015
0.015
0.014
0.009
0.003
0.000
0.010
0.012
0.016
0.016
0.014
0.011
0.004
0.000
0.011
0.012
0.015
0.015
0.015
0.009
0.003
0.000
0.010
0.009
0.012
0.012
0.012
0.010
0.003
0.000
0.008
0.000
0.001
0.000
0.001
0.001
0.001
0.000
0.000
0.001
0.000
0.001
0.000
0.000
0.000
0.001
0.001
0.002
0.001
0.001
0.000
0.001
0.001
0.002
0.001
177
APPENDIX D. FURTHER TABLES REGARDING RELIABILITY STATISTICS
Cδ/τ Probability
Cτ[0...1]
CPD-Probabilistic
C
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
B
0.125
0.250
0.500
1.000
2.000
4.000
8.000
µ
S
0.125
0.250
0.500
0.750
1.000
µ
CMICD
Table D.4: Confidence Correlation Comparison Summary
0.496
0.495
0.423
0.438
0.333
0.166
0.001
0.336
0.525
0.528
0.453
0.461
0.357
0.117
−0.002
0.348
0.593
0.593
0.549
0.499
0.373
0.140
0.018
0.395
0.378
0.398
0.399
0.424
0.404
0.113
−0.001
0.302
0.743
0.715
0.665
0.583
0.426
0.205
––
0.556
0.486
0.406
0.375
0.270
0.383
0.099
––
0.336
0.715
0.701
0.672
0.579
0.389
0.132
––
0.531
0.656
0.628
0.585
0.511
0.354
0.094
––
0.471
0.056
0.094
0.119
0.181
0.202
0.130
0.065
0.051
0.100
0.127
0.109
0.091
0.063
0.076
0.125
0.199
0.259
0.144
0.013
−0.041
0.066
0.108
0.190
0.067
178
Appendix E
Statistical Measure Performance Under
Quantization
This appendix summarizes the performance of the mutual information, symmetric uncertainty and the
Spearmannranking statistics as evaluated under various quantization schemes for Gaussian and for Uniform data distributions.
Spearmannranking does not, of course, require binning, as the rank correlation will directly convert
continuous values. It is included in this comparison simply to demonstrate what the effects of binning are
in order to illuminate the behaviour of the other measures.
In each distribution, 1000 points are generated. The covariance in the distribution is adjusted from 0
(no covariance/independent data) to 1 (complete dependency of features).
E.1
Performance Tables
Tables E.1—E.5 display the numerical results for calculations using 1000 points for the statistics: mutual
information, symmetric uncertainty and Spearmannranking using both equal width and MME binning.
Equal-width bin results are shown in Table E.1 for mutual information data, Table E.3 for symmetric
uncertainty and Table E.2 for Spearmannranking.
Similar data binned using MME is displayed in Tables E.4, Tables E.6 and E.5 for mutual information
symmetric uncertainly and Spearmannranking (respectively).
The mutual information statistic is described in equation (D.2) from Section D.2 in Appendix D;
symmetric uncertainty is described in Section D.1 in equation (D.1). Both of these are entropy based
179
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
180
measures.
Spearmannranking is defined in the main text in Chapter 9, in equation (9.10). Spearmannranking is
simply a correlation based on relative ordinal position in the data set.
E.1.1
Entropy Statistics Versus SpearmannRanking
When comparing the results of the SpearmannRanking data in Tables E.2 and E.5, it is apparent that
the Spearmannranking statistic benefits from being run on the raw data, as would be expected. When
discussing Spearmannranking, no binning will be performed.
E.1.2
Choice of Binning Mechanism for Entropy Summary Statistics
Examination of the figures showing equal bin plots in Figures E.2 through E.7, we see that the equalwidth bins preserves the overall shape of the underlying distribution while gathering occurrence counts
regarding similar events.
A comparison with MME the based binning in Figures E.8 through E.13, demonstrates that the MME
bins distort the effective shape of the distribution, leaving the distribution looking somewhat rounded as
in Figure E.13.
This feature of MME, while useful for confidence generation in the PD and FIS algorithms will skew
our calculation of confidence equivalency, and should thus be avoided for any discussion of these statistics.
In the discussion of confidence, we will therefore use equal-width bins in order to generate the summary statistics.
E.1.3
Choice of Q for Entropy Summary Statistics
Tables E.1 through E.6 show the effects of the number of quantization bins (Q) on the results.
As can be seen, too low a quantization (such as Q=2) results in very poor performance as there are too
few distinct events in the space to adequately represent a trend. As seen by examining different correlations
at Q=2 in Table E.3, there is no change in the statistic until perfect correlation is reached.
Conversely, a Q value which is too high incorrectly represents a great deal of information presence
even in cases where there is independence, due to the irregularities which appear due to the relatively
small N relative to the number of bins.
We must therefore balance Q and N if we are to use any of the non-ranked statistics, and choose as
small a Q value as we can in order to get reasonable performance for our choice of N. As we will calculate
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
181
our statistics over all jackknife runs, N will be 1000, and therefore, from these tables, it would seem that
a choice of Q=8 is reasonable, as choosing Q=8 allows us to see the greatest degree of change across the
symmetric uncertainty statistic in Table E.3, as well as the greatest change in mutual information as seen
in Table E.1.
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Table E.1: Mutual Information of Equally Binned Uniform Data (1000 Points)
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
0.000001
0.015014
0.039223
0.049157
0.159509
0.264482
0.25
0.000001
0.141960
0.297239
0.354696
0.432796
0.518409
Correlation Coefficient
0.50
0.75
0.000001 0.000001
0.298019 0.528048
0.388209 0.706805
0.436370 0.752605
0.567665 0.844415
0.650064 0.942965
0.99
0.000001
1.402589
1.839019
2.062620
2.356481
2.558243
1.00
0.011408
2.006350
2.811652
3.174170
3.807815
4.247510
Table E.2: SpearmannRanking of Equally Binned Uniform Data (1000 Points)
raw
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
−0.093548
−0.001001
−0.094267
−0.072047
−0.085259
−0.088046
−0.095187
0.25
0.250188
−0.001001
0.204093
0.231281
0.247671
0.239236
0.242767
Correlation Coefficient
0.50
0.75
0.510996
0.777258
−0.001001 −0.001001
0.477265
0.711144
0.482814
0.762818
0.498017
0.769129
0.508771
0.774676
0.510054
0.772848
0.99
0.989883
−0.001001
0.949343
0.969068
0.977148
0.983600
0.987102
1.00
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
Table E.3: Symmetric Uncertainty of Equally Binned Uniform Data (1000 Points)
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
0.000127
0.007478
0.013954
0.015484
0.041904
0.062282
0.25
0.000127
0.077203
0.114255
0.119853
0.121071
0.129200
Correlation Coefficient
0.50
0.75
0.000127 0.000127
0.160231 0.275472
0.147840 0.263865
0.146094 0.248715
0.157596 0.231302
0.160865 0.230623
0.99
0.000127
0.699178
0.655911
0.651879
0.620990
0.606215
1.00
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
182
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Table E.4: Mutual Information of MME Binned Uniform Data (1000 Points)
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
0.005096
0.015979
0.035495
0.069933
0.149346
0.285521
0.25
0.017626
0.175338
0.279476
0.325090
0.446866
0.605000
Correlation Coefficient
0.50
0.75
0.116276 0.298526
0.296970 0.621629
0.416989 0.703978
0.458489 0.754228
0.550683 0.889301
0.748973 0.986812
0.99
0.757705
1.575068
1.960117
2.142002
2.475754
2.542354
1.00
0.999997
2.321921
2.999988
3.321914
3.906105
4.321899
Table E.5: SpearmannRanking of MME Binned Uniform Data (1000 Points)
raw
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
−0.093548
−0.084004
−0.088480
−0.086795
−0.089513
−0.093930
−0.092472
0.25
0.250188
0.155997
0.255892
0.241088
0.245647
0.253125
0.251797
Correlation Coefficient
0.50
0.75
0.510996 0.777258
0.395998 0.619998
0.487579 0.767038
0.509401 0.769164
0.505015 0.772254
0.509376 0.779506
0.510864 0.778500
0.99
0.989883
0.920000
0.959497
0.974470
0.979996
0.986811
0.987335
1.00
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
Table E.6: Symmetric Uncertainty of MME Binned Uniform Data (1000 Points)
Q=2
Q=5
Q=8
Q=10
Q=15
Q=20
0.00
0.005096
0.006882
0.011832
0.021052
0.038234
0.066064
0.25
0.017626
0.075514
0.093159
0.097862
0.114402
0.139985
Correlation Coefficient
0.50
0.75
0.116277 0.298527
0.127898 0.267722
0.138997 0.234660
0.138020 0.227046
0.140980 0.227669
0.173297 0.228328
0.99
0.757707
0.678347
0.653375
0.644810
0.633817
0.588249
1.00
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
183
Uniform (1000 Points) with 0.25 Covariance
Uniform (1000 Points) with 0.50 Covariance
2
2
1
1
1
0
Feature Y
Uniform (1000 Points) with 0.00 Covariance
2
Feature Y
Feature Y
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
0
-1
-1
-2
-2
-1
0
1
-2
Covariance of 0.25
2
-2
-1
Feature X
0
1
Covariance of 0.50
2
-2
-1
Feature X
0
1
2
Feature X
Uniform (1000 Points) with 0.99 Covariance
Uniform (1000 Points) with 1.00 Covariance
2
2
1
1
1
0
-1
Feature Y
Uniform (1000 Points) with 0.75 Covariance
2
Feature Y
Feature Y
0
-1
Covariance of 0.00
-2
0
-1
-2
-2
-1
0
1
0
-1
-2
Covariance of 0.75
-2
Covariance of 0.99
2
-2
Feature X
184
-1
0
1
Covariance of 1.00
2
Feature X
-2
-1
0
1
2
Feature X
Figure E.1: Raw Uniform Data (1000 Points)
E.2
Quantization Figures
The remainder of this appendix contains figures showing the distribution of the points used for the calculations in Tables E.1 through E.6.
Figure E.1 displays the raw points, while Figures E.2 through E.13 show the effects of transforming
the raw points through various types of binning strategies.
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=2 Equal Binned (1000 Points) with 0.25 Covariance
1
0.8
0.8
0.8
0.6
0.4
0
Feature Y
1
0.2
0.6
0.4
0.2
Q=2 Equal Binned Data 0.00
0
0.2
0.4
0.6
0
0.8
1
0.6
0.4
0.2
Q=2 Equal Binned Data 0.25
0
0.2
0.4
0.6
0
0.8
1
Q=2 Equal Binned Data 0.50
0
0.2
0.4
0.6
0.8
1
Feature X
Feature X
Feature X
Q=2 Equal Binned (1000 Points) with 0.75 Covariance
Q=2 Equal Binned (1000 Points) with 0.99 Covariance
Q=2 Equal Binned (1000 Points) with 1.00 Covariance
1
1
0.8
0.8
0.8
0.6
0.4
0.2
0
Feature Y
1
Feature Y
Feature Y
Q=2 Equal Binned (1000 Points) with 0.50 Covariance
1
Feature Y
Feature Y
Q=2 Equal Binned (1000 Points) with 0.00 Covariance
185
0.6
0.4
0.2
Q=2 Equal Binned Data 0.75
0
0.2
0.4
0.6
0
0.8
1
0.6
0.4
0.2
Q=2 Equal Binned Data 0.99
0
0.2
Feature X
0.4
0.6
0
0.8
1
Q=2 Equal Binned Data 1.00
0
0.2
Feature X
0.4
0.6
0.8
1
Feature X
Figure E.2: Q=2 Equally Binned Uniform Data (1000 Points)
Q=5 Equal Binned (1000 Points) with 0.50 Covariance
4
4
3.5
3.5
3.5
3
3
2.5
2.5
2
1.5
Feature Y
3
2.5
2
1.5
2
1.5
1
1
1
0.5
0.5
0.5
0
Q=5 Equal Binned Data 0.00
0
0.5
1
1.5
2
2.5
0
3
3.5
4
Q=5 Equal Binned Data 0.25
0
0.5
1
1.5
2
2.5
0
3
3.5
4
Q=5 Equal Binned Data 0.50
0
0.5
1
1.5
2
2.5
3
3.5
4
Feature X
Feature X
Feature X
Q=5 Equal Binned (1000 Points) with 0.75 Covariance
Q=5 Equal Binned (1000 Points) with 0.99 Covariance
Q=5 Equal Binned (1000 Points) with 1.00 Covariance
4
4
4
3.5
3.5
3.5
3
3
2.5
2.5
2
1.5
Feature Y
3
2.5
Feature Y
Feature Y
Q=5 Equal Binned (1000 Points) with 0.25 Covariance
4
Feature Y
Feature Y
Q=5 Equal Binned (1000 Points) with 0.00 Covariance
2
1.5
2
1.5
1
1
1
0.5
0.5
0.5
0
Q=5 Equal Binned Data 0.75
0
0.5
1
1.5
2
2.5
Feature X
0
3
3.5
4
Q=5 Equal Binned Data 0.99
0
0.5
1
1.5
2
2.5
Feature X
0
3
3.5
4
Q=5 Equal Binned Data 1.00
0
0.5
1
1.5
2
2.5
Feature X
Figure E.3: Q=5 Equally Binned Uniform Data (1000 Points)
3
3.5
4
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=8 Equal Binned (1000 Points) with 0.25 Covariance
Q=8 Equal Binned (1000 Points) with 0.50 Covariance
7
7
6
6
6
5
5
5
4
3
4
3
2
2
1
1
0
Q=8 Equal Binned Data 0.00
0
1
2
3
4
0
5
6
Feature Y
7
Feature Y
Feature Y
Q=8 Equal Binned (1000 Points) with 0.00 Covariance
7
4
3
2
1
Q=8 Equal Binned Data 0.25
0
1
2
3
4
0
5
6
7
Q=8 Equal Binned Data 0.50
0
1
2
3
4
5
6
7
Feature X
Feature X
Q=8 Equal Binned (1000 Points) with 0.75 Covariance
Q=8 Equal Binned (1000 Points) with 0.99 Covariance
Q=8 Equal Binned (1000 Points) with 1.00 Covariance
7
7
6
6
6
5
5
5
4
3
4
3
2
2
1
1
0
Q=8 Equal Binned Data 0.75
0
1
2
3
4
0
5
6
Feature Y
7
Feature Y
Feature Y
Feature X
7
4
3
2
1
Q=8 Equal Binned Data 0.99
0
1
2
3
Feature X
4
0
5
6
7
Q=8 Equal Binned Data 1.00
0
1
2
3
Feature X
4
5
6
7
Feature X
Figure E.4: Q=8 Equally Binned Uniform Data (1000 Points)
Q=10 Equal Binned (1000 Points) with 0.25 Covariance
Q=10 Equal Binned (1000 Points) with 0.50 Covariance
9
9
8
8
8
7
7
7
6
6
6
5
4
5
4
3
3
2
2
1
0
Feature Y
9
Feature Y
Feature Y
Q=10 Equal Binned (1000 Points) with 0.00 Covariance
0
1
2
3
4
5
6
0
7
8
9
4
3
2
1
Q=10 Equal Binned Data 0.00
5
1
Q=10 Equal Binned Data 0.25
0
1
2
3
4
5
6
0
7
8
9
Q=10 Equal Binned Data 0.50
0
1
2
3
4
5
6
7
8
9
Feature X
Feature X
Q=10 Equal Binned (1000 Points) with 0.75 Covariance
Q=10 Equal Binned (1000 Points) with 0.99 Covariance
Q=10 Equal Binned (1000 Points) with 1.00 Covariance
9
9
8
8
8
7
7
7
6
6
6
5
4
5
4
3
3
2
2
1
0
Feature Y
9
Feature Y
Feature Y
Feature X
0
1
2
3
4
5
Feature X
6
0
7
8
9
4
3
2
1
Q=10 Equal Binned Data 0.75
5
1
Q=10 Equal Binned Data 0.99
0
1
2
3
4
5
Feature X
6
0
7
8
9
Q=10 Equal Binned Data 1.00
0
1
2
3
4
5
Feature X
Figure E.5: Q=10 Equally Binned Uniform Data (1000 Points)
6
7
8
9
186
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=15 Equal Binned (1000 Points) with 0.25 Covariance
Q=15 Equal Binned (1000 Points) with 0.50 Covariance
14
14
12
12
12
10
10
10
8
6
8
6
4
4
2
2
0
Q=15 Equal Binned Data 0.00
0
2
4
6
8
10
0
12
Feature Y
14
Feature Y
Feature Y
Q=15 Equal Binned (1000 Points) with 0.00 Covariance
14
8
6
4
2
Q=15 Equal Binned Data 0.25
0
2
4
6
8
10
0
12
14
Q=15 Equal Binned Data 0.50
0
2
4
6
8
10
12
14
Feature X
Feature X
Q=15 Equal Binned (1000 Points) with 0.75 Covariance
Q=15 Equal Binned (1000 Points) with 0.99 Covariance
Q=15 Equal Binned (1000 Points) with 1.00 Covariance
14
14
12
12
12
10
10
10
8
6
8
6
4
4
2
2
0
Q=15 Equal Binned Data 0.75
0
2
4
6
8
10
0
12
Feature Y
14
Feature Y
Feature Y
Feature X
14
8
6
4
2
Q=15 Equal Binned Data 0.99
0
2
4
6
Feature X
8
10
0
12
14
Q=15 Equal Binned Data 1.00
0
2
4
6
Feature X
8
10
12
14
Feature X
Figure E.6: Q=15 Equally Binned Uniform Data (1000 Points)
Q=20 Equal Binned (1000 Points) with 0.25 Covariance
Q=20 Equal Binned (1000 Points) with 0.50 Covariance
18
18
16
16
16
14
14
14
12
12
12
10
8
10
8
6
6
4
4
2
0
Feature Y
18
Feature Y
Feature Y
Q=20 Equal Binned (1000 Points) with 0.00 Covariance
0
2
4
6
8
0
10 12 14 16 18
8
6
4
2
Q=20 Equal Binned Data 0.00
10
2
Q=20 Equal Binned Data 0.25
0
2
4
6
8
0
10 12 14 16 18
Q=20 Equal Binned Data 0.50
0
2
4
6
8
10 12 14 16 18
Feature X
Feature X
Q=20 Equal Binned (1000 Points) with 0.75 Covariance
Q=20 Equal Binned (1000 Points) with 0.99 Covariance
Q=20 Equal Binned (1000 Points) with 1.00 Covariance
18
18
16
16
16
14
14
14
12
12
12
10
8
10
8
6
6
4
4
2
0
Feature Y
18
Feature Y
Feature Y
Feature X
0
2
4
6
8
0
10 12 14 16 18
Feature X
8
6
4
2
Q=20 Equal Binned Data 0.75
10
2
Q=20 Equal Binned Data 0.99
0
2
4
6
8
10 12 14 16 18
Feature X
0
Q=20 Equal Binned Data 1.00
0
2
4
6
8
10 12 14 16 18
Feature X
Figure E.7: Q=20 Equally Binned Uniform Data (1000 Points)
187
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=2 MME Binned (1000 Points) with 0.25 Covariance
1
0.8
0.8
0.8
0.6
0.4
0
Feature Y
1
0.2
0.6
0.4
0.2
Q=2 MME Binned Data 0.00
0
0.2
0.4
0.6
0
0.8
1
0.6
0.4
0.2
Q=2 MME Binned Data 0.25
0
0.2
0.4
0.6
0
0.8
1
Q=2 MME Binned Data 0.50
0
0.2
0.4
0.6
0.8
1
Feature X
Feature X
Feature X
Q=2 MME Binned (1000 Points) with 0.75 Covariance
Q=2 MME Binned (1000 Points) with 0.99 Covariance
Q=2 MME Binned (1000 Points) with 1.00 Covariance
1
1
0.8
0.8
0.8
0.6
0.4
0.2
0
Feature Y
1
Feature Y
Feature Y
Q=2 MME Binned (1000 Points) with 0.50 Covariance
1
Feature Y
Feature Y
Q=2 MME Binned (1000 Points) with 0.00 Covariance
188
0.6
0.4
0.2
Q=2 MME Binned Data 0.75
0
0.2
0.4
0.6
0
0.8
1
0.6
0.4
0.2
Q=2 MME Binned Data 0.99
0
0.2
Feature X
0.4
0.6
0
0.8
1
Q=2 MME Binned Data 1.00
0
0.2
Feature X
0.4
0.6
0.8
1
Feature X
Figure E.8: Q=2 MME Binned Uniform Data (1000 Points)
Q=5 MME Binned (1000 Points) with 0.50 Covariance
4
4
3.5
3.5
3.5
3
3
2.5
2.5
2
1.5
Feature Y
3
2.5
2
1.5
2
1.5
1
1
1
0.5
0.5
0.5
0
Q=5 MME Binned Data 0.00
0
0.5
1
1.5
2
2.5
0
3
3.5
4
Q=5 MME Binned Data 0.25
0
0.5
1
1.5
2
2.5
0
3
3.5
4
Q=5 MME Binned Data 0.50
0
0.5
1
1.5
2
2.5
3
3.5
4
Feature X
Feature X
Feature X
Q=5 MME Binned (1000 Points) with 0.75 Covariance
Q=5 MME Binned (1000 Points) with 0.99 Covariance
Q=5 MME Binned (1000 Points) with 1.00 Covariance
4
4
4
3.5
3.5
3.5
3
3
2.5
2.5
2
1.5
Feature Y
3
2.5
Feature Y
Feature Y
Q=5 MME Binned (1000 Points) with 0.25 Covariance
4
Feature Y
Feature Y
Q=5 MME Binned (1000 Points) with 0.00 Covariance
2
1.5
2
1.5
1
1
1
0.5
0.5
0.5
0
Q=5 MME Binned Data 0.75
0
0.5
1
1.5
2
2.5
Feature X
0
3
3.5
4
Q=5 MME Binned Data 0.99
0
0.5
1
1.5
2
2.5
Feature X
0
3
3.5
4
Q=5 MME Binned Data 1.00
0
0.5
1
1.5
2
2.5
Feature X
Figure E.9: Q=5 MME Binned Uniform Data (1000 Points)
3
3.5
4
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=8 MME Binned (1000 Points) with 0.25 Covariance
Q=8 MME Binned (1000 Points) with 0.50 Covariance
7
7
6
6
6
5
5
5
4
3
4
3
2
2
1
1
0
Q=8 MME Binned Data 0.00
0
1
2
3
4
0
5
6
Feature Y
7
Feature Y
Feature Y
Q=8 MME Binned (1000 Points) with 0.00 Covariance
7
4
3
2
1
Q=8 MME Binned Data 0.25
0
1
2
3
4
0
5
6
7
Q=8 MME Binned Data 0.50
0
1
2
3
4
5
6
7
Feature X
Feature X
Q=8 MME Binned (1000 Points) with 0.75 Covariance
Q=8 MME Binned (1000 Points) with 0.99 Covariance
Q=8 MME Binned (1000 Points) with 1.00 Covariance
7
7
6
6
6
5
5
5
4
3
4
3
2
2
1
1
0
Q=8 MME Binned Data 0.75
0
1
2
3
4
0
5
6
Feature Y
7
Feature Y
Feature Y
Feature X
7
4
3
2
1
Q=8 MME Binned Data 0.99
0
1
2
3
Feature X
4
0
5
6
7
Q=8 MME Binned Data 1.00
0
1
2
3
Feature X
4
5
6
7
Feature X
Figure E.10: Q=8 MME Binned Uniform Data (1000 Points)
Q=10 MME Binned (1000 Points) with 0.25 Covariance
Q=10 MME Binned (1000 Points) with 0.50 Covariance
9
9
8
8
8
7
7
7
6
6
6
5
4
5
4
3
3
2
2
1
0
Feature Y
9
Feature Y
Feature Y
Q=10 MME Binned (1000 Points) with 0.00 Covariance
0
1
2
3
4
5
6
0
7
8
9
4
3
2
1
Q=10 MME Binned Data 0.00
5
1
Q=10 MME Binned Data 0.25
0
1
2
3
4
5
6
0
7
8
9
Q=10 MME Binned Data 0.50
0
1
2
3
4
5
6
7
8
9
Feature X
Feature X
Q=10 MME Binned (1000 Points) with 0.75 Covariance
Q=10 MME Binned (1000 Points) with 0.99 Covariance
Q=10 MME Binned (1000 Points) with 1.00 Covariance
9
9
8
8
8
7
7
7
6
6
6
5
4
5
4
3
3
2
2
1
0
Feature Y
9
Feature Y
Feature Y
Feature X
0
1
2
3
4
5
Feature X
6
0
7
8
9
4
3
2
1
Q=10 MME Binned Data 0.75
5
1
Q=10 MME Binned Data 0.99
0
1
2
3
4
5
Feature X
6
0
7
8
9
Q=10 MME Binned Data 1.00
0
1
2
3
4
5
Feature X
Figure E.11: Q=10 MME Binned Uniform Data (1000 Points)
6
7
8
9
189
APPENDIX E. STATISTICAL MEASURE PERFORMANCE UNDER QUANTIZATION
Q=15 MME Binned (1000 Points) with 0.25 Covariance
Q=15 MME Binned (1000 Points) with 0.50 Covariance
14
14
12
12
12
10
10
10
8
6
8
6
4
4
2
2
0
Q=15 MME Binned Data 0.00
0
2
4
6
8
10
0
12
Feature Y
14
Feature Y
Feature Y
Q=15 MME Binned (1000 Points) with 0.00 Covariance
14
8
6
4
2
Q=15 MME Binned Data 0.25
0
2
4
6
8
10
0
12
14
Q=15 MME Binned Data 0.50
0
2
4
6
8
10
12
14
Feature X
Feature X
Q=15 MME Binned (1000 Points) with 0.75 Covariance
Q=15 MME Binned (1000 Points) with 0.99 Covariance
Q=15 MME Binned (1000 Points) with 1.00 Covariance
14
14
12
12
12
10
10
10
8
6
8
6
4
4
2
2
0
Q=15 MME Binned Data 0.75
0
2
4
6
8
10
0
12
Feature Y
14
Feature Y
Feature Y
Feature X
14
8
6
4
2
Q=15 MME Binned Data 0.99
0
2
4
6
Feature X
8
10
0
12
14
Q=15 MME Binned Data 1.00
0
2
4
6
Feature X
8
10
12
14
Feature X
Figure E.12: Q=15 MME Binned Uniform Data (1000 Points)
Q=20 MME Binned (1000 Points) with 0.25 Covariance
Q=20 MME Binned (1000 Points) with 0.50 Covariance
18
18
16
16
16
14
14
14
12
12
12
10
8
10
8
6
6
4
4
2
0
Feature Y
18
Feature Y
Feature Y
Q=20 MME Binned (1000 Points) with 0.00 Covariance
0
2
4
6
8
0
10 12 14 16 18
8
6
4
2
Q=20 MME Binned Data 0.00
10
2
Q=20 MME Binned Data 0.25
0
2
4
6
8
0
10 12 14 16 18
Q=20 MME Binned Data 0.50
0
2
4
6
8
10 12 14 16 18
Feature X
Feature X
Q=20 MME Binned (1000 Points) with 0.75 Covariance
Q=20 MME Binned (1000 Points) with 0.99 Covariance
Q=20 MME Binned (1000 Points) with 1.00 Covariance
18
18
16
16
16
14
14
14
12
12
12
10
8
10
8
6
6
4
4
2
0
Feature Y
18
Feature Y
Feature Y
Feature X
0
2
4
6
8
0
10 12 14 16 18
Feature X
8
6
4
2
Q=20 MME Binned Data 0.75
10
2
Q=20 MME Binned Data 0.99
0
2
4
6
8
10 12 14 16 18
Feature X
0
Q=20 MME Binned Data 1.00
0
2
4
6
8
10 12 14 16 18
Feature X
Figure E.13: Q=20 MME Binned Uniform Data (1000 Points)
190
Bibliography
Abu-Hanna, A. and N. de Keizer. Integrating classification treest with local lgistic regression in Intensive
Care prognosis. Artificial Intelligence In Medicine, 29:5–23, 2003.
Adelman, L. Evaluating Decision Support and Expert Systems. Wiley Series in Systems Engineering.
John Wiley & Sons, 1992.
Aha, D. W., D. Kibler and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37–66,
1991.
Anderson, E., Z. Bai et al. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics
(SIAM), 3rd edition, 1999. ISBN 0-89871-447-8. Software Library Available Online.
URL http://www.netlib.org/lapack/
Anzai, Y. Pattern Recognition and Machine Learning. Academic Press Inc., San Diego, 1989.
Arbib, M. A., editor. Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA, June
1995. ISBN 0-262-01148-4.
Barlow, R. E., C. A. Clarotti and F. Spizzichino, editors. Reliability and Decision Making. Chapman &
Hall, London, 1993.
Bayes, R. T. Essay towards solving a problem in the doctrine of chances. Philosophical Transactions of
the Royal Society of London, 53:370–418, 1763.
Bean, C. L., C. Kambhampati and S. Rajasekharan. A rough set solution to a fuzzy set problem. In
FUZZ-IEEE ’02, pages 18–23.
Becker, P. W. Recognition of Patterns: Using the Frequencies of Occurrence of Binary Words. SpringerVerlag, 2nd edition, 1968.
191
BIBLIOGRAPHY
192
Bellman, R. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey, 1961.
Bennett, N. L., L. L. Casebeer et al. Family physicians’ information seeking behaviours: A survey comparison with other specialties. BMC Medical Informatics and Decision Making, 5(9), 2005.
Berner, E. S., editor. Clinical Decision Support Systems: Theory and Practice. Springer-Verlag, 1988.
ISBN 0-387-98575-1.
Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms. Advanced Applications In
Pattern Recognition. Plenum Press, New York and London, 1981.
Booty, W. G., D. C. L. Lam et al. Great Lakes toxic chemical decision support system. In Denzer, Swayne
and Schimak (1997). IFIP TC5 WG5.11 International Symposium on Environmental Software Systems
(ISESS ’97).
Boyen, X. and L. Wehenkel. Automatic induction of fuzzy decision trees and its application to power
system security assessment. Fuzzy Sets and Systems, 102(1):3–19, 1999. ISSN 0165-0114. doi:http:
//dx.doi.org/10.1016/S0165-0114(98)00198-5.
Brath, R. 3D interactive information visualization: Guidelines from experience and analysis of applications. In 4th International Conference on Human–Computer Interaction. June 1997a.
——. Metrics for effective information visualization. In Information Visualization. IEEE, Phoenix, Oct
1997b. doi:10.1109/INFVIS.1997.636794.
——. Paper landscapes: A visualization design methodology. In R. F. Erbacher, P. C. Chen, J. C. Roberts,
M. T. Groehn and K. Boerner, editors, Visualization and Data Analysis, volume 5009. International
Society for Optical Engineering (SPIE), Jun 2003. ISBN 0-8194-4809-5.
Brath, R. and M. Peters. Dashboard design: Why design is important. Data Mining Review/Data Mining
Direct, October 2004.
Brillman, J. C., T. Burr et al. Modeling emergency department visit patterns for infectious disease complaints: Results and application to disease surveillance. BMC Medical Informatics and Decision Making, 5(4), 2005.
Buchanan, B. G. and E. H. Shortliffe, editors. Rule-Based Expert Systems: The MYCIN Experiments of
the Stanford Heuristic Programming Project. Addison-Wesley, Reading, Massachusetts, 1984.
BIBLIOGRAPHY
193
Camps-Valls, G., M. Martı́nez-Ramón et al. Robust γ-filter using support vector machines. Neurocomputing, 62:493–499, 2004. doi:10.1016/j.neucom.2004.07.003.
Carpenter, G. A. and S. Grossberg. ART 2: Self-organization of stable category recognition codes for
analog input patterns. Applied Optics, 26(23):4919–4930, 1987a.
——. A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37(1):54–115, 1987b.
——. ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition
architectures. Neural Networks, 3(2):129–152, 1990.
Carpenter, G. A., S. Grossberg and J. H. Reynolds. ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4(5):565–588,
1991a.
Carpenter, G. A., S. Grossberg and D. B. Rosen. Fuzzy ART: Fast stable learning and categorization of
analog patterns by an adaptive resonance system. Neural Networks, 4(6):759–771, 1991b.
Catlett, J. Megainduction: Machine Learning on Very Large Databases. Ph.D. thesis, Basser Department
of Computer Science, University of Sydney, Sydney, Australia, 1991.
Chan, K. C. C. and D. K. Y. Wong, Andrew K. C. Chiu. Learning sequential patterns from probabilistic
inductive prediction. IEEE Transactions Systems, Man, Cybernetics, 24(10):1532–1547, October 1994.
Chau, T. Marginal maximum entropy partitioning yields asymptotically consistent probability density
functions. IEEE Transactions on Pattern Analalysis & Machine Intelligent, 23(4):414–417, April 2001.
Chau, T. and A. K. C. Wong. Pattern discovery by residual analysis and recursive partitioning. IEEE
Transactions on Knowledge & Data Engineering, 11(6):833–852, Nov-Dec 1999.
Chen, L., N. Tokuda et al. A new scheme for an automatic generation of multi-variable fuzzy systems.
Fuzzy Sets and Systems, 120:323–329, 2001.
Chen, M.-Y. Establishing interpretable fuzzy models from numeric data. In Proceedings of the 4th World
Congress on Intelligent Control and Automation, volume 3, pages 1857–1861. IEEE, Jun 2002.
Chen, M.-Y. and D. A. Linkens. Rule-base self generation and simplification for data-driven fuzzy models.
In FUZZ-IEEE ’01, pages 424–427.
BIBLIOGRAPHY
194
Chiang, I.-J. and J. Y.-j. Hsu. Fuzzy classification on trees for data analysis. Fuzzy Sets and Systems,
130(1):87–99, 2002. ISSN 0165-0114. doi:http://dx.doi.org/10.1016/S0165-0114(01)00212-3.
Ching, J. Y., A. K. C. Wong and K. C. C. Chan. Class-dependent discretization for inductive learning from
continuous and mixed-mode data. IEEE Transactions on Pattern Analalysis & Machine Intelligent,
17(7):641–651, 1995.
Chong, A., T. D. Gedon et al. A histogram-based rule extraction technique for fuzzy systems. In FUZZIEEE ’01, pages 638–641.
Coiera, E. Guide to Health Informatics. Arnold/Hodder & Stoughton, UK, 2nd edition, 2003. ISBN
0-340-76425-2.
Colombet, I., T. Dart et al. A computer decision aid for medical prevention: A pilot qualitative study of
the personalized estimate of risks (EsPeR) system. BMC Medical Informatics and Decision Making,
3(13), 2003.
Cordella, L. P., P. Foggia et al. Reliability parameters to improve combination strategies in multi-expert
systems. Pattern Analasis & Applications, 2:205–214, 1999.
Cordón, O., F. Herrera et al. Genetic Fuzzy Systems : Evolutionary Tuning and Learning of Fuzzy Knowledge Bases, chapter 11, pages 375–382. In Cordón, Herrera et al. (2001b), 2001a.
——. Genetic Fuzzy Systems : Evolutionary Tuning and Learning of Fuzzy Knowledge Bases. World
Scientific, Singapore, 2001b. ISBN 981-02-4017-1.
Cortés, U. and M. Sànchez-Marrè, editors. Environmental Decision Support Systems and Artificial Intelligence: Papers from the AAAI Workshop, Report WS-99-07. AAAI Press, 1999. ISBN 1-57735-091-x.
Cortés, U., M. Sànchez-Marrè and F. Wotawa, editors. Workshop on Environmental Decision Support
Systems (EDSS’2003). IJCAI, Acapulco, Mexico, Aug 2003.
Costa, P. J., J. P. Dunyak and M. Mohtashemi. Models, prediction, and estimation of outbreaks of infectious disease. In Proceedings of SoutheastCon, pages 174–178. IEEE : Institute of Electrical and
Electronics Engineers, Inc., April 2005.
Costa Branco, P. and J. A. Dente. Fuzzy systems modeling in practice. Fuzzy Sets and Systems, 121:73–93,
Jul 2001.
BIBLIOGRAPHY
195
Cowan, D. F., editor. Informatics for the Clinical Laboratory: A Practical Guide. Health Informatics
Series. Springer-Verlag, 2003.
Cox, E. The Fuzzy Systems Handbook. Academic Press Professional, Cambridge, MA, 1994.
Cox, R. T. Probability frequency and reasonable expectation. American Journal of Physics, 14:1–13,
1946.
Cristianini, N. and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University
Press, 2000. ISBN 0-521-78019-5.
de Graaf, P. M. A., G. C. van den Eijkel et al. A decision-driven design of a decision supoprt system in
anesthesia. Artificial Intelligence In Medicine, 11:141–153, 1997.
de Tré, G. and R. de Caluwe. Level-2 fuzzy sets and their usefulness in object-oriented database modelling.
Fuzzy Sets and Systems, 140(1):29–49, Nov 2003.
Dempster, A. P. A generalization of Bayesian inference. Journal of the Royal Statistical Society, Series
B, 30:205–247, 1968.
Denzer, R., D. A. Swayne and G. Schimak, editors. Environmental Software Systems, volume 2. Chapman
& Hall, London, New York, 1997. ISBN 0-412-81740-3. IFIP TC5 WG5.11 International Symposium
on Environmental Software Systems (ISESS ’97).
Devadoss, P. R., S. L. Pan and S. Singh. Managing knowledge integration in a national health-care crisis:
Lessons learned from combating SARS in Singapore. IEEE Transactions on Information Technology
in Biomedicine, 9(2):266–275, June 2005.
Dick, S., A. Schencker et al. Re-granulating a fuzzy rulebase. In FUZZ-IEEE ’01, pages 372–375.
Duda, R. O., P. E. Hart and D. G. Stork. Pattern Classification. John Wiley & Sons, 2nd edition, 2001.
ISBN 0-471-05669-3.
Friedland, D. J., editor. Evidence-Based Medicine: A Framework for Clinical Practice. Appleton &
Lange, 1998.
Fukumi, M. and N. Akamatsu. Evolutionary approach to rule generation from neural networks. In Proceedings of the 1999 IEEE International Fuzzy Systems Conference, volume III, pages 1388–1393.
FUZZ-IEEE, Aug 1999.
BIBLIOGRAPHY
196
FUZZ-IEEE ’00. Proceedings of the 9th IEEE International Conference on Fuzzy Systems. IEEE, San
Antonio, USA, May 2000. ISBN 0-7803-5877-5. ISSN 1098-7584.
FUZZ-IEEE ’01. Proceedings of the 10th IEEE International Conference on Fuzzy Systems, FUZZIEEE’01. IEEE, Melbourne, Australia, 2001. ISBN 0-7803-7293-X. ISSN 1098-7584.
FUZZ-IEEE ’02. Proceedings of the 11th IEEE International Conference on Fuzzy Systems, FUZZIEEE’02. IEEE, Hololulu, Hawaii, 2002. ISBN 0-7803-7279-4. ISSN 1098-7584.
FUZZ-IEEE ’99. Proceedings of the 8th IEEE International Conference on Fuzzy Systems. IEEE, Seoul,
Korea, Aug 1999. ISBN 0-7803-5406-0. ISSN 1098-7584.
Gabrys, B. Learning hybrid neuro-fuzzy classifier models from data: To combine or not to combine?
Fuzzy Sets and Systems, 147(1):39–56, 2004.
Galassi, M., J. Davies et al. GNU Scientific Library Reference Manual. Network Theory, 2nd edition,
March 2005. ISBN 0-954161-73-4. Software Library Available Online.
URL http://www.gnu.org/software/gsl
Gay, A. K. Simulating biological variation in MFAPs: Resultant effects on MUAPs and EMG signals.
Graduate research term report, Department of Systems Design Engineering, University of Waterloo,
Waterloo, Ontario, Canada, Dec 1999. SYDE 642: Simulation.
Gelernter, D. Machine Beauty: Elegance and the Heart of Technology. Basic Books (Perseus), New York,
Jan 1998. ISBN 0-465-04516-2.
Gibb, W. J., D. M. Auslander and J. C. Griffin. Adaptive classification of myocardial electrogram waveforms. IEEE Transactions on Biomedical Engineering, 4(8):804–808, Aug 1994.
Ginsberg, M. L. Essentials of Artificial Intelligence. Morgan Kaufman, San Francisco, 1993. ISBN
1-55860-221-6.
Gokhale, D. V. On joint and conditional entropies. Entropy, 1(2):21–24, 1999.
Goldberg, D. E. Genetic Algorithms in Search, Optimization & Learning. Addison-Wesley, 1989.
Goldberg, D. E. and K. Deb. A comparative analysis of selection schemes used in genetic algorithms. In
G. J. E. Rawlins, editor, Foundataions of Genetic Algorithms, pages 69–93. Morgan Kaufman Publishers, 1991.
BIBLIOGRAPHY
197
Grossberg, S. Adaptive pattern classification and universal recoding: I. parallel development and coding
of neural feature detectors. Biological Cybernetics, 23:121–134, 1976.
——. Adaptive resonance theory (ART). In M. A. Arbib, editor, Handbook of Brain Theory and Neural
Networks, pages 79–81. MIT Press, Cambridge, MA, June 1995. ISBN 0-262-01148-4.
Grzymala-Busse, J. and W. Ziarko. Discovery through rough set theory. Communications of the ACM,
42:55–57, 1999.
——. Data mining and rough set theory. Communications of the ACM, 43:108–109, 2000.
Gupta, M. M., R. K. Ragade and R. R. Yager, editors. Advances in Fuzzy Set Theory and Applications.
North-Holland, Oxford, 1979. ISBN 0-444-85372-3.
Gurov, S. I. Reliability estimation of classification algorithms I: Introduction to the problem - point
frequency estimates. Computation Mathematics and Modeling, 15(4):365–376, 2004.
——. Reliability estimation of classification algorithms II: Point baysian estimates. Computation Mathematics and Modeling, 16(2):169–178, 2005.
Guthrie, G., D. A. Stacey and D. Calvert. Detection of disease outbreaks in pharmaceutical sales: Neural
networks and threshold algorithms. In IJCNN ’05, pages 3138–3143.
Haberman, S. J. The analysis of residuals in cross-classified tables. Biometrics, 29(1):205–220, Mar
1973.
——. Analysis of Qualitative Data, volume 1, pages 78–79,82–83. Academic Press, Toronto, 1979. ISBN
0-12-312502-2.
Harris, E. K. and J. C. Boyd. Statistical Bases of Reference Values in Laboratory Medicine. Marcel
Dekker, Inc., New York – Basel – Hong Kong, 1995.
Hathaway, R. J. and J. C. Bezdek. Clustering incomplete relational data using the non-Euclidean relational
fuzzy c-means algorithm. Pattern Recognition Letters, 23:151–160, 2002.
Heckerman, D. Probabilistic interpretation for MYCIN’s certainty factors. In L. N. Kanal and J. F.
Lemmer, editors, Uncertainty in Artificial Intelligence, pages 167–196. Elsevier/North-Holland, Amsterdam, London, New York, 1986.
BIBLIOGRAPHY
198
Hertz, J., A. Krogh and R. G. Palmer. Introduction to the Theory of Neural Computation. Santa Fe Institute
Studies in the Sciences of Complexity, 1991.
Heske, T. and J. N. Heske. Fuzzy Logic for Real World Design. Annabooks, San Diego, 1999. ISBN
0-929392-24-8.
Hipel, K. W. and Y. Ben-Haim. Decision making in an uncertain world: Information-gap modeling in
water resources management. IEEE Transactions Systems, Man, Cybernetics C, 29(4):506–517, Nov
1999.
Hipel, K. W., L. Fang and D. M. Kilgour. Game theoretic models in engineering decision making. Journal
of Infrastructure Planning and Management, 470(4):1–16, July 1993.
Hipel, K. W., D. M. Kilgour et al. Strategic decision support for the services industry. IEEE Transactions
on Engineering Management, 48(3):358–369, Aug 2001.
Hipel, K. W., X. Yin and D. M. Kilgour. Can a costly reporting system make environmental enforcement
more efficient? Journal of Stochastic Hydrology and Hydraulics, 9(2):151–170, 1995.
Hisdal, E. Possibilistically dependent variables and a general theory of fuzzy sets. In Gupta, Ragade and
Yager (1979), pages 215–234.
Hoffmann, F. Combined boosting and evolutionary algorithms for learning of fuzzy classification rules.
Fuzzy Sets and Systems, 141:47–58, 2004.
Hong, T.-P. and C.-Y. Lee. Induction of fuzzy rules and membership function from training examples.
Fuzzy Sets and Systems, 84:33–47, Nov 1996.
Höppner, F., F. Klawonn et al. Fuzzy Cluster Analysis. Chichester, England, 1999.
House, W. C., editor. Decision Support Systems: A Data-Based, Model-Oriented, User-Developed Discipline. Petrocelli, New York/Princeton, 1983. ISBN 0-89433-225-2.
IJCNN ’05. Proceedings of the International Joint Conference on Neural Networks (IJCNN). IEEE/INNS,
Montréal, Québec, July 2005. ISBN 0-7803-9049-0.
Innocent, P. R. Fuzzy symptoms and a decision support index for the early diagnosis of confusable
diseases. In Proceedings of the RASC Conference, pages 1–8. Dept. of Computing Science, De Montfort
BIBLIOGRAPHY
199
University, Leicester, UK, Jul 2000a.
URL http://www.cse.dmu.ac.uk/∼pri/rasc.pdf
——. A lightweight fuzzy process to support early diagnosis of confusable diseases. In FUZZ-IEEE ’00,
pages 516–521. doi:10.1109/FUZZY.2000.838713.
Ishibuchi, H. and T. Murata. Learning of fuzzy classification rules by a genetic algorithm. Electronics and
Communications in Japan, Part 3, 80(3):37–46, 1997. Translated from Denshi Joho Tsushin Gakkai
Ronbunshi, Vol 79-A, No. 7, 1996, pp.1289–1297.
Ishibuchi, H., K. Nozaki et al. Construction of fuzzy classification systems with rectangular fuzzy rules
using genetic algorithms. Fuzzy Sets and Systems, 65(2/3):237–253, 1994.
Jain, A. K., M. N. Murty and P. J. Flynn. Data clusting: A review. ACM Computing Surveys, 31(3):264–
323, September 1999. ISSN 0360-0300. doi:10.1145/331499.331504.
Jensen, R. and Q. Shen. Fuzzy-rough sets for descriptive dimensionality reduction. In FUZZ-IEEE ’02,
pages 29–34.
Joachims, T. Making large-scale svm learning practical. Technical Report LS8, Universität Dortmund,
1998.
URL http://www.joachims.org/publications/joachims 98c.pdf
——. Svmlight. 2005.
URL http://svmlight.joachims.org
Kapler, T., R. Harper and W. Wright. Correlating events with tracked movements in time and space: a
GeoTime case study. In Proceedings of the 2005 Intelligence Analysis Conference. Jan 2005.
Karnik, N. N. and J. M. Mendel. Applications of type-2 fuzzy logic systems: Handling the uncertainty
associated with surveys. In FUZZ-IEEE ’99, pages 1546–1551. doi:10.1109/FUZZY.1999.790134.
Karnik, N. N., J. M. Mendel and Q. Liang. Type-2 fuzzy logic systems. IEEE Transactions on Fuzzy
Systems, 7(6):643–658, Dec 1999.
Kaynak, O., K. Jezernik and Á. Szeghegti. Complexity reduction of rule based models: A survey. In
FUZZ-IEEE ’02, pages 1216–1220.
BIBLIOGRAPHY
200
Keller, H. and C. Trendelenburg, editors. Data Presentation: Interpretation. Clinical Biochemistry:
Principles, Methods, Applications: 2. Walter de Gruyter, Berlin – New York, 1989.
Kennedy, J. and R. C. Eberhart. Particle swarm optimization. In Proceedings of IEEE International Conference on Neural Networks, pages 1942–1948. IEEE : Institute of Electrical and Electronics Engineers,
Inc., 1995.
Kilgour, D. M., L. Fang and K. W. Hipel. GMCR in negotiations. Negotiation Journal, pages 151–156,
April 1995.
Kim, M. W., J. G. Lee and C. Min. Efficient fuzzy rule generation based on fuzzy decision tree for data
mining. In Proceedings of the 1999 IEEE International Fuzzy Systems Conference, volume III, pages
1223–1228. FUZZ-IEEE, Aug 1999.
Kirkpatrick, S., C. D. J. Gelatt et al. Optimization by simulated annealing. Science, 220(4298):671–680,
May 1983.
Klir, G. J. Generalized information theory: Emerging crossroads of fuzziness and probability. In NAFIPS
’05.
——. Measuring uncertainty associated with convex sets of probability distributions: A new approach. In
NAFIPS ’05.
Knauf, R., A. J. Gonzalez and T. Abel. A framework for validation of rule-based systems. IEEE Transactions Systems, Man, Cybernetics B, 32(3):281–295, June 2002.
Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perpsective. Artificial
Intelligence In Medicine, 23:89–109, 2001.
Kononenko, I. and I. Bratko. Information-based evaluation criterion for classifier’s performance. Machine
Learning, 6:67–80, 1991.
Koza, J. R. Genetic Programming : On the Programming of Computers by Means of Natural Selection.
MIT Press, 1992.
Krishnapuram, R. and J. M. Keller. A possibilistic approach to clustering. IEEE Transactions on Fuzzy
Systems, 1(2):98–110, 1993.
BIBLIOGRAPHY
201
——. The possibilistic c-means algorithm: Insights and recommendations. IEEE Transactions on Fuzzy
Systems, 4(3):385–393, 1996.
Kruse, R., J. E. Gebhardt and F. Klawonn. Foundations of Fuzzy Systems. John Wiley & Sons, New York,
1994.
Kukar, M. Transductive reliability estimation for medical diagnosis. Artificial Intelligence In Medicine,
29:81–106, 2003.
Kukar, M., I. Kononenko et al. Analysing and improving the diagnosis of ischaemic heart disease with
machine learning. Artificial Intelligence In Medicine, 16:25–50, 1999.
Kukolj, D. Design of adaptive Takagi-Sugeno-Kang fuzzy models. Applied Soft Computing, 2:89–103,
2002.
Kuncheva, L. I. How good are fuzzy if-then classifiers. IEEE Transactions Systems, Man, Cybernetics A,
30(4):501–509, Aug 2000.
Labbi, A. and E. Gauthier. Combining fuzzy knowledge and data for neuro-fuzzy modeling. Journal of
Intelligent Systems, 6(4), 1997.
Larsson, J. E., B. Hayes-Roth et al. Evaluation of a medical diagnosis system using simulator test scenarios. Artificial Intelligence In Medicine, 11:119–140, 1997.
Lehmann, E. L. and H. J. M. D’Abrera. Nonparametrics: Statistical Methods Based on Ranks. Pearson
Education/Prentice-Hall, revised edition, 1998. ISBN 0-13997-735-X.
León, L. F., D. C. Lam et al. An environmental impact assessment model for water resources screening. In
Denzer et al. (1997). IFIP TC5 WG5.11 International Symposium on Environmental Software Systems
(ISESS ’97).
Levitin, G. Evaluating correct classification probability for weighted voting classifiers with plurality
voting. European Journal of Operations Research, 141:596–607, 2002.
——. Threshold optimization for weighted voting classifiers. Naval Research Logistics, 50(4):322–344,
2003.
Liang, Q. and J. M. Mendel. An introduction to type-2 TSK fuzzy logic systems. In FUZZ-IEEE ’99,
pages 1534–1539. doi:10.1109/FUZZY.1999.790132.
BIBLIOGRAPHY
202
——. Interval type-2 fuzzy logic systems. In FUZZ-IEEE ’00, pages 328–333. doi:10.1109/FUZZY.
2000.838681.
Lin, T. Y. and N. Cercone, editors. Rough Sets and Data Mining. Kluwer Academic Publishers, 1997.
López de Mántaras, R. A distance-based attribute selection measure for decision tree induction. Machine
Learning, 6:81–92, 1991.
Mahfouf, M. and D. A. Linkens. Rule-base generation via symbiotic evolution for a mamdani-type fuzzy
control system. In FUZZ-IEEE ’01, pages 396–399.
Majowicz, S. and D. A. Stacey. The use of clustering to analyze symptom-based case definitions for acute
gastrointestinal illness. In IJCNN ’05, pages 2429–2434.
Mamdani, E. H. Applications of fuzzy algorithms for simple dynamic plant. In Procedings of the IEEE,
volume 121, pages 1585–1588. IEEE, 1974.
MathWorld. MathWorld, August 2005.
URL http://mathworld.wolfram.com/
Mendel, J. M. Fuzzy logic sytems for engineering: a tutorial. Procedings of the IEEE, 83(2):345–377,
Mar 1995a.
——. Fuzzy logic sytems for engineering: a tutorial – errata. Procedings of the IEEE, 83, Sep 1995b.
——. Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall, 2001.
ISBN 0-13-040969-3.
Mendel, J. M. and R. I. B. John. Type-2 fuzzy sets made simple. IEEE Transactions on Fuzzy Systems,
10(2):117–127, April 2002.
Metz, C. E. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8:283–298, 1978.
Minsky, M. L. and S. A. Papert. Perceptrons : An Introduction to Computational Geometry. MIT Press,
2nd edition, 1988. ISBN 0-262-63111-3.
Mitchell, M. An Introduction to Genetic Algorithms. MIT Press, Cambridge, Massachusetts, 1996.
Mitra, S. and Y. Hayashi. Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE
Transactions on Neural Networks, 11(3):748–768, May 2000.
BIBLIOGRAPHY
203
Montani, S., P. Magni et al. Integrating model-based decision support in a multi-modal reasoning system
for managing type 1 diabetic patients. Artificial Intelligence In Medicine, 29:131–151, 2003.
Moon, T. K. Electrical and Computer Engineering 7680: Information Theory Course Notes. Utah State
University, Sept 2000.
URL http://www.engineering.usu.edu/classes/ece/7680/lecture2.pdf
Morton, S. Management Decision Systems: Computer-Based Support for Decision Making. Harvard
University Press, 1971. ISBN 0-87584-090-6.
Muresan, L., K. T. László and K. Hirota. Similarity in hierarchical fuzzy rule – base systems. In FUZZIEEE ’02, pages 746–750.
Murphy, A. K. G. Effective Information Display and Interface Design for Decomposition-Based Quantitative Electromyography. Master’s thesis, University of Waterloo, Systems Design Engineering, 2002.
NAFIPS ’05. Proceedings of the 2005 International Joint Conference of the North American Fuzzy Information Processing Society Biannual Conference. NAFIPS, Jun 2005. ISBN 0-7803-9188-8.
Nauck, D., F. Klawonn and R. Kruse. Foundations of Neuro-Fuzzy Systems. John Wiley & Sons, New
York, 1997.
Newman, D. J., S. Hettich et al. UCI repository of machine learning databases. 1998.
URL http://www.ics.uci.edu/∼mlearn/MLRepository.html
Norman, D. A. The Design of Everyday Things. Basic Books (Perseus) or MIT press (UK), New York,
1998/2002. ISBN 0-262-64037-6. Formerly The Psychology of Everyday Things.
O’ Carroll, P. W., W. A. Yasnoff et al., editors. Public Health Informatics and Information Systems. Health
Informatics. Springer-Verlag, 2002. ISBN 0387954740.
Øhrn, A. Discernability and Rough Sets in Medicine: Tools and Applications. Ph.D. thesis, Department
of Computer and Information Science, Norwegian University of Science and Technology, Trondheim,
Norway, 1999. Institutt for datateknikk og informasjonvitenskap IDI-rapport 1999:14.
Pal, N. R., J. C. Bezdek and R. J. Hathaway. Sequential competitive learning and the fuzzy c-means
clustering algorithms. Neural Networks, 9(5):787–796, 1996.
BIBLIOGRAPHY
204
Pal, N. R., K. Pal and J. C. Bezdek. A mixed c-means clustering model. In Proceedings of the Sixth IEEE
International Conference on Fuzzy Systems, volume 1, pages 11–21. FUZZ-IEEE, Jul 1997.
Pal, S. K. and S. Mitra. Neuro-Fuzzy Pattern Recognition : Methods in Soft Computing. Wiley Series on
Intelligent Systems. Wiley-Interscience, 1999. ISBN 0-471-34844-9.
Pawlak, Z. Rough sets. International Journal of Computing and Information Sciences, 11(5):341–356,
1982.
——. Rough Sets: Theoretical Aspects of Reasoning about Data. Studies in Fuzziness and Soft Computing. Kluwer Academic, Norwell, MA, 1992. ISBN 0-7923-1472-7.
Pedrycz, W. Fuzzy Sets Engineering. CRC Press, 1995. ISBN 0-8493-9402-3.
Penaloza, M. A. and R. M. Welch. Infectious disease and climate change: A fuzzy database management
system approach. In Remote Sensing - A Scientific Vision for Sustainable Development, volume 4, pages
1950–1952. IGARSS: Geoscience and Remote Sensing, Aug 1997.
Pino, L. Discussion on PD conditional probability derivation from WOE measure. Personal Communication, July 2005.
Polkowski, L. and A. Skowron, editors. Rough Sets in Knowledge Discovery 1: Methodology and Applications, volume 18 of Studies in Fuzziness and Soft Computing. Physica-Verlag, 1998. ISBN 3-79081119-X.
Press, W. H., S. A. Teukolsky et al. Numerical Recipes in C : The Art of Scientific Computing. Cambridge
University Press, 2nd edition, 1992. ISBN 0-521-43108-5.
Quinlan, J. R. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
——. C4.5 : Programs for Machine Learning. Morgan Kaufman, 1993.
——. Learning first-order definitions of functions. Journal of Artificial Intelligence Research, 5:139–161,
1996.
Rajabi, S., D. M. Kilgour and K. W. Hipel. Modeling action-interdependence in multiple critera decision
making. European Journal of Operations Research, 110(3):490–508, Nov 1998.
BIBLIOGRAPHY
205
Rumelhart, D. E., G. E. Hinton and R. J. Williams. Learning representations by back-propagating errors.
Nature, 323:533–536, 1986.
Russell, S. and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2003. ISBN 0-13790395-2.
Sage, A. P. Decision Support Systems Engineering. Wiley Series in Systems Engineering. John Wiley &
Sons, 1991.
Sage, A. P. and W. B. Rouse, editors. Handbook of Systems Engineering and Management, chapter
Operations Research and Refinement of Courses of Action. John Wiley & Sons, Apr 1999. ISBN
0-471-15405-9.
Sayood, K. Introduction to Data Compression. Morgan Kaufmann, 2nd edition, 2000.
Schniederjans, M. Case Studies in Decision Support. Petrocelli, Princeton, 1987.
Scott, A. C., J. E. Claton and E. L. Gibson. A Practical Guide to Knowledge Acquisition. Addison-Wesley,
1991.
Shafer, G. Perspectives on the theory and practice of belief functions. International Journal of Approximate Reasonong, 3:1–40, 1990.
Shafer, G. and J. Pearl. A Mathematical Theory of Evidence. Princeton University, Princeton, New Jersey,
1976.
Shafer, G. and J. Pearl, editors. Readings in Uncertain Reasoning. Morgan Kaufmann, 1990. ISBN
1-55860-125-2.
Shannon, C. E. A mathematical theory of communication: Part 1. Bell Systems Technical Journal, 27:379–
423, July 1948a. Reprinted in Slepian (1974).
URL http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html
——. A mathematical theory of communication: Part 2. Bell Systems Technical Journal, 27:623–656,
October 1948b. Reprinted in Slepian (1974).
URL http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html
Shen, Q. and A. Chouchoulas. A rough-fuzzy approach for generating classification rules. Pattern Recognition, 35:2425–2438, 2002.
BIBLIOGRAPHY
206
Shortliffe, E. H. Computer-Based Medical Consultations: MYCIN. Elsevier/North-Holland, Amsterdam,
London, New York, 1976.
Shortliffe, E. H. and L. E. Perreault, editors. Medical Informatics : Computer Applications in Health Care
and Biomedicine. Springer-Verlag, 2nd edition, 1990.
Silver, M. S. Systems That Support Decision Makers: Description and Analysis. John Wiley Information
Systems Series. John Wiley & Sons, 1991. ISBN 0-471-91968-3.
Simpson, P. K. Artificial Neural Systems. Windcrest/McGraw-Hill, 1991. ISBN 0-07-105355-7.
Slepian, D., editor. Key Papers in the Development of Information Theory. IEEE Press, New York, 1974.
Spiegel, D. and T. Sudkamp. Sparse data in the evolutionary generation of fuzzy models. Fuzzy Sets and
Systems, 138:363–379, 2003.
Sprague, R. and E. Carlsons. Building Effective Decision Support Systems. Prentice-Hall, 1982. ISBN
0-13-086215-0.
Sugeno, M., M. F. Griffin and A. Bastian. Fuzzy hierarchical control of an unmanned helicopter. In
Proceedings of the 5th IFSA World Congress (IFSA ’93). Seoul, 1993.
Sugeno, M. and G. T. Kang. Structure identification of fuzzy model. Fuzzy Sets and Systems, 28:15–33,
1988.
Sundararajan, C. R. Guide to Reliability Engineering: Data Analysis, Applications, Implementation, and
Management. Van Nostrand Reinhold, New York, 1991.
Syswerda, G. A study of reproduction in generational and steady-state genetic algorithms. In G. J. E.
Rawlins, editor, Foundataions of Genetic Algorithms, pages 94–112. Morgan Kaufman Publishers,
1991.
Takagi, T. and M. Sugeno. Fuzzy identification of systems and its application to modeling. IEEE Transactions Systems, Man, Cybernetics, 15:116–132, 1985.
Tape, T. G. Interpreting diagnostic tests. Technical report, University of Nebraska Medical Center, 2005.
URL http://gim.unmc.edu/dxtests
BIBLIOGRAPHY
207
Toscano, R. and P. Lyonnet. Parameterization of a fuzzy classifier for the diagnosis of an industrial process.
Reliability Engineering & System Safety, 77:269–279, 2002.
Tsumoto, S. Statistical evidence for rough set analysis. In FUZZ-IEEE ’02, pages 757–762.
Tufte, E. R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Conneticut, 2nd
edition, 1983. ISBN 0-961-39214-2.
——. Envisioning Information. Graphics Press, Cheshire, Conneticut, 1991. ISBN 0-961-39211-8.
——. Visual Explanations: Images and Quantities, Evidence and Narritive. Graphics Press, Cheshire,
Conneticut, 1997. ISBN 0-961-39212-6.
Türkşen, I. B. Interval valued fuzzy sets and fuzzy measures. In Proceedings of the First International
Joint Conference of the North American Fuzzy Information Processing Society Biannual Conference,
volume 3, pages 317–321. NAFIPS, Dec 1994.
——. Knowledge representation and approximate reasoning with type II fuzzy sets. In International Joint
Conference of the Fourth IEEE International Conference on Fuzzy Systems and The Second International Fuzzy Engineering Symposium, volume 4, pages 1911–1917. IEEE : Institute of Electrical and
Electronics Engineers, Inc., March 1995a.
——. Linguistics and uncertainty in intelligent systems. In Proceedings of the International Joint Conference of the Fourth IEEE International Conference on Fuzzy Systems and The Second International
Fuzzy Engineering Symposium, volume 3, pages 2297–2304. FUZZ-IEEE, 1995b.
——. Type I and type II fuzzy system modeling. Fuzzy Sets and Systems, 106(1):11–34, 1999. ISSN
0165–0114.
Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. ISBN 0-262-01148-4.
Wanas, N. M. and M. S. Kamel. Weighted combination of neural network ensembles. In D. B. Fogel,
M. A. El-Sharkawi, X. Yao, G. Greenwood, H. Iba, P. Marrow and M. Shackleton, editors, Proceedings
of the 2002 World Congress on Computational Intelligence WCCI 2002, pages 1748–1752. IEEE Press,
Honolulu, 2002. ISBN 0-7803-7278-6.
Wang, L. X. Fuzzy systems are universal approximators. In Proc. IEEE Int. Conf. on Fuzzy Systems,
pages 1163–1169. San Diego, 1992.
BIBLIOGRAPHY
208
Wang, X. Z., Y. D. Wang et al. A new approach to fuzzy rule generation: Fuzzy extension matrix. Fuzzy
Sets and Systems, 123:291–306, 2001.
Wang, Y. High Order Pattern Discovery and Analysis of Discrete-Valued Data Sets. Ph.D. thesis, University of Waterloo, Systems Design Engineering, 1997.
Wang, Y. and A. K. C. Wong. From association to classification: Inference using weight of evidence.
IEEE Transactions on Knowledge & Data Engineering, 15(3):764–767, May-June 2003.
Waterman, D. A. and F. Hayes-Roth, editors. Pattern-Directed Inference Systems. Academic Press, New
York, 1978.
Wiener, N. Cybernetics: or Control and Communication in the Animal and the Machine. MIT Press,
1948, 1961. ISBN 0-262-73009-X.
——. The Human Use of Human Beings: Cybernetics and Society. Da Capo Series in Science. Da Capo
Press, 1950. ISBN 0-306-80320-8.
Witten, I. H. and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, chapter WEKA : Machine Learning Algorithms in Java. Morgan Kaufman, 2000.
Wong, A. K. C. and D. Ghahrarnan. A statistical analysis of interdependence in character sequences.
Information Sciences, 8(2):173–188, April 1975.
Wong, A. K. C., T. S. Liu and C. C. Wang. Statistical analysis of residue variability in cytochrome C.
Journal of Molecular Biology, 102(2):287–295, April 1976.
Wong, A. K. C. and Y. Wang. High-order pattern discovery from discrete-valued data. IEEE Transactions
on Knowledge & Data Engineering, 9(6):877–893, Nov-Dec 1997.
——. Pattern discovery: A data driven approach to decision support. IEEE Transactions Systems, Man,
Cybernetics C, 33(1):114–124, Feb 2003.
Wright, W. Multi-dimensional representations — how many dimensions? In New Paradigms in Information Visualization and Manipulation. ACM, Nov 1997.
Wrigley, N. Categorical Data Analysis for Geographers and Environmental Scientists. Longman, 1985.
BIBLIOGRAPHY
209
Xing, H., S. H. Huang and J. Shi. Rapid development of knowledge-based systems via integrated knowledge acquisition. Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 17:221–
234, 2003.
Xu, L., J. Neufeld et al. Maximum margin clustering. In L. K. Saul, Y. Weiss and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1537–1544. Neural Information Processing
Systems Foundation NIPS, MIT Press, Cambridge, MA, 2005.
Yager, R. R., M. Fedrizzi and J. Kacprzyk, editors. Advances in the Dempster-Shafer Theory of Evidence.
Wiley, 1994.
Yager, R. R. and D. P. Filev. Relational partitioning of fuzzy rules. Fuzzy Sets and Systems, 80:57–69,
1996.
Yager, R. R., S. Ovchinnikov et al., editors. Fuzzy Sets and Applications: Selected Papers by L.A. Zadeh.
John Wiley & Sons, 1987.
Young, W. J., D. C. L. Lam et al. Development of an environmental flows decision support system. In
Denzer et al. (1997). IFIP TC5 WG5.11 International Symposium on Environmental Software Systems
(ISESS ’97).
Yyrdusev, M. An environmental impact assessment model for water resources screening. In Denzer et al.
(1997). IFIP TC5 WG5.11 International Symposium on Environmental Software Systems (ISESS ’97).
Zadeh, L. Reviews of books: A mathematical theory of evidence. AI Magazine, 5(3), Fall 1984.
Zadeh, L. A. Fuzzy sets. Information and Control, 8:338–353, 1965. Reprinted in Yager et al. (1987).
——. Probability measures of fuzzy events. Journal of Mathimatical Analysis and Application, 23:421–
427, 1968. Reprinted in Yager et al. (1987).
——. Outline of a new approach to the analysis of complex systems and decision processes. IEEE
Transactions Systems, Man, Cybernetics, 3:28–44, 1973. Reprinted in Yager et al. (1987).
——. The concept of a linguistic variable and its application to approximate reasoning – part 1. Information Sciences, 8:199–249, 1975a. Reprinted in Yager et al. (1987).
——. The concept of a linguistic variable and its application to approximate reasoning – part 2. Information Sciences, 8:301–357, 1975b. Reprinted in Yager et al. (1987).
BIBLIOGRAPHY
210
——. The concept of a linguistic variable and its application to approximate reasoning – part 3. Information Sciences, 9:43–80, 1975c. Reprinted in Yager et al. (1987).
——. A fuzzy-algorithmic approach to the definition of complex or imprecise concepts. International
Journal of Man-Machine Studies, 8:249–291, 1976. Reprinted in Yager et al. (1987).
——. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1:3–28, 1978. Reprinted
in Yager et al. (1987).
——. The role of fuzzy logic in the management of uncertainty in expert systems. In J. Hayes, D. Michie
and L. I. Mikulich, editors, Machine Intelligence, volume 9, pages 149–194. Halstead Press, New York,
1979. Reprinted in Yager et al. (1987).
——. The role of fuzzy logic in the management of uncertainty in expert systems. Fuzzy Sets and Systems,
11:199–227, 1983. Reprinted in Yager et al. (1987).
Zhang, X., R. Fiedler and M. Popovich. A biointelligence system for identifying potential disease outbreaks. IEEE Engineering in Medicine & Biology Magazine, 23(1):58–64, Jan–Feb 2004.
Ziarko, W. Decision making with probabalistic decision tables. In Proceedings of the 7th International
Workshop on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, pages 463–471. RSFDCrC’99, Yamaguchi, Japan, 1999.
——. Optimal decision making with data-acquired decision tables. In Proceedings of the International
Symposium on Intelligent Information Systems, pages 75–85. IIS, Bystra, Poland, 2000.
——. Probabilistic decision tables in the variable precision rough set model. Computational Intelligence,
17(3):593–603, 2001.
——. Acquisition of hierarchy-structured probabalistic decision tables and rules from data. In FUZZIEEE ’02, pages 779–784.
Index
adjusted residual, 35, 36, 38, 52, 53, 82
cost of quantization, 7, 16, 44, 45, 75, 83, 84, 86,
adjusted residual (definition), 35
155
crisp, 7
back-propagation artificial neural network (BP), 60,
62, 63, 65–67, 69, 70, 72–76, 78, 79, 85,
86, 89–91, 94, 95, 153, 155
crisp membership functions, 47
crisp quantization, 7, 44, 45, 81, 83, 85, 155
crisp quantization, 84
black box, 3, 6, 9, 85
crisp set, 28, 78, 80, 81, 83–85, 143
cell-boundary vagueness, 45
decision support (system), DSS, 1–6, 9, 11, 14, 16,
certainty factor (), 32, 49
19, 24, 38, 44, 69, 75, 86, 95–98, 126, 128,
confidence, 1, 2, 5–7, 9, 24, 28, 31, 36, 47, 73, 75,
97, 98, 101–105, 107, 109, 113, 117, 121,
131, 135, 143, 149, 150, 152–154, 157
decision tree, 23, 24
125, 126, 128, 129, 131, 138, 141, 142,
145, 149–154, 160, 163, 166, 171, 172, event, 34, 36, 38, 41, 42, 51, 70
174, 180
Confidence: [0 . . . 1] bounded Normalized τ, 102,
FIS(Occurrence/All), 79, 81, 83–86, 90, 91, 94, 95,
98, 99, 145, 153–155
104, 108, 117–120, 125, 160, 175–178
Confidence: δ/τ Observed Probability, 102, 103,
108, 113–117, 125, 126, 130, 131, 141,
156, 160, 175–178
FIS(Occurrence/All,Crisp), 83
FIS(WOE/Ind), 84, 90, 94
hyper-cell, 12, 13, 16, 38, 70, 73, 75, 85
Confidence: MICD, 102, 108–113, 117, 160, 175–
infinite weight rule, 48
178
Confidence: PD/WOE Probabilistic, 102, 104, 105,
108, 121–125, 141, 175–178
marginal maximum entropy (MME), 7, 12, 14, 28,
conflict, 9, 73, 85, 97, 99, 102, 128, 129, 131, 133,
135, 141, 145, 146, 149
39–42, 44, 45, 47, 63, 75, 82, 84, 92–94,
109, 125, 129, 134, 155, 156, 161, 174,
179, 180
contingency table, 7, 12, 15, 16, 25, 26, 70, 82, 93
211
INDEX
212
mean inter-class distance (MICD), 60, 62, 63, 65– weight of evidence (WOE), 37, 38, 42, 45, 48–50,
67, 70–72, 74, 75, 78, 79, 89, 90, 94, 95,
63, 72, 74, 76, 83, 93, 104, 105, 125, 141,
102, 103, 109, 113, 125, 155, 159–161,
155, 163
171
weight of evidence (WOE) (definition), 37
weighted performance summary measure (definition),
occurrence based weighting (definition), 51
61
PD(Occurrence/All), 63–66, 68, 69, 71–74, 78, 80, white box, 6, 9, 153
whitening transform, 60, 102, 171
84, 155
PD(WOE/Indep), 63, 65, 69, 71, 72, 74, 78, 80, 84,
90
primary event, 34, 36, 41
probability of success, 98, 102, 103
quantization, 7, 8, 14–16, 28, 38–42, 44, 45, 47, 60,
63, 70, 72, 73, 75, 81–86, 94, 106, 125,
155, 174, 179, 180, 211
ramp, 45, 47, 81, 83–85, 134, 143, 155
Receiver Operating Characteristic (Curve) (ROC),
5, 28–31, 98
reliability, 2, 5, 7, 9, 11, 28, 29, 31–33, 97, 98, 101–
103, 109, 113, 125, 126, 146, 154, 172
rule, infinite weight, 48
Spearmann rank correlation, 106–108, 113, 117, 172,
174, 179, 180, 182, 183
standardized residual, 36
standardized residual (definition), 35
sub-event, 34
universe of discourse (UOD), 133, 134
vagueness, 40, 45
vagueness, cell-boundary, 45