10.1007@978 3 030 35514 2 PDF
10.1007@978 3 030 35514 2 PDF
10.1007@978 3 030 35514 2 PDF
Benjamin Quost
Martin Theobald (Eds.)
LNAI 11940
Scalable Uncertainty
Management
13th International Conference, SUM 2019
Compiègne, France, December 16–18, 2019
Proceedings
123
Lecture Notes in Artificial Intelligence 11940
Series Editors
Randy Goebel
University of Alberta, Edmonton, Canada
Yuzuru Tanaka
Hokkaido University, Sapporo, Japan
Wolfgang Wahlster
DFKI and Saarland University, Saarbrücken, Germany
Founding Editor
Jörg Siekmann
DFKI and Saarland University, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/1244
Nahla Ben Amor Benjamin Quost
• •
Scalable Uncertainty
Management
13th International Conference, SUM 2019
Compiègne, France, December 16–18, 2019
Proceedings
123
Editors
Nahla Ben Amor Benjamin Quost
Institut Supérieur de Gestion de Tunis University of Technology of Compiègne
Bouchoucha, Tunisia Compiègne, France
Martin Theobald
University of Luxembourg
Esch-Sur-Alzette, Luxembourg
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
These are the proceedings of the 13th International Conference on Scalable Uncertainty
Management (SUM 2019) held during December 16–18, 2019, in Compiègne, France.
The SUM conferences are annual events which gather researchers interested in the
management of imperfect information from a wide range of fields, such as artificial
intelligence, databases, information retrieval, machine learning, and risk analysis, and
with the aim of fostering the collaboration and cross-fertilization of ideas from different
communities.
The first SUM conference was held in Washington DC in 2007. Since then, the
SUM conferences have successively taken place in Napoli in 2008, Washington DC in
2009, Toulouse in 2010, Dayton in 2011, Marburg in 2012, Washington DC in 2013,
Oxford in 2014, Québec in 2015, Nice in 2016, Granada in 2017, and Milano in 2018.
The 25 full, 4 short, 4 tutorial, 2 invited keynote papers gathered in this volume
were selected from an overall amount of 44 submissions (5 of which were desk-rejected
or withdrawn by the authors), after a rigorous peer-review process by at least 3 Pro-
gram Committee members. In addition to the regular presentations, the technical
program of SUM 2019 also included invited lectures by three outstanding researchers:
Cassio P. de Campos (Eindhoven University of Technology, The Netherlands) on
“Scalable Reliable Machine Learning Using Sum-Product Networks,” Jérôme Lang
(CNRS, Paris, France) on “Computational Social Choice,” and Wolfgang Gatterbauer
(Northeastern University, Boston, USA) on “Algebraic approximations of the Proba-
bility of Boolean Functions.”
An originality of the SUM conferences is the care for dedicating a large space
of their programs to invited tutorials about a wide range of topics related to uncertainty
management, to further embrace the aim of facilitating interdisciplinary collaboration
and cross-fertilization of ideas. This edition includes five tutorials, for which we thank
Christophe Gonzales, Thierry Denœux, Marie-Jeanne Lesot, Maximilian Schleich, and
the Kay R. Amel working group for preparing and presenting these tutorials (four
of these tutorials have a companion paper included in this volume).
We would like to thank all of the authors, invited speakers, and tutorial speakers for
their valuable contributions. We in particular also express our gratitude to the members
of the Program Committee as well as to the external reviewers for their constructive
comments on the submissions. We would like to extend our appreciation to all par-
ticipants of SUM 2019 for their great contribution and the success of the conference.
We are grateful to the Steering Committee for their suggestions and support, and to the
Organization Committee for their support in the organization for the great work
accomplished. We are also very grateful to the Université de Technologie de
Compiègne (UTC) for hosting the conference, to the Heudiasyc laboratory and the
vi Preface
MS2T laboratory of excellence for their financial and technical support, and to Springer
for sponsoring the Best Paper Award as well as for the ongoing support of its staff in
publishing this volume.
General Chair
Benjamin Quost Université de Technologie de Compiègne, France
Steering Committee
Didier Dubois IRIT-CNRS, France
Lluis Godo IIIA-CSIC, Spain
Eyke Hüllermeier Universität Paderborn, Germany
Anthony Hunter University College London, UK
Henri Prade IRIT-CNRS, France
Steven Schockaert Cardiff University, UK
V. S. Subrahmanian University of Maryland, USA
Program Committee
Nahla Ben Amor (PC Chair) Institut Supérieur de Gestion de Tunis and LARODEC,
Tunisia
Martin Theobald (PC Chair) University of Luxembourg, Luxembourg
Sébastien Destercke CNRS, Heudiasyc, France
Henri Prade CNRS-IRIT, France
John Grant Towson University, USA
Leila Amgoud CNRS-IRIT, France
Benjamin Quost Université de Technologie de Compiègne, Heudiasyc,
France
Thomas Lukasiewicz University of Oxford, UK
Pierre Senellart DI, École Normale Supérieure, Université PSL, France
Francesco Parisi DIMES, University of Calabria, Italy
Davide Ciucci Università di Milano-Bicocca, Italy
Fernando Bobillo University of Zaragoza, Spain
Salem Benferhat UMR CNRS 8188, Université d’Artois, France
Silviu Maniu Université Paris-Sud, France
viii Organization
Organization Committee
Yonatan Carlos Carranza Université de Technologie de Compiègne, France
Alarcon
Sébastien Destercke CNRS, Université de Technologie de Compiègne,
France
Marie-Hélène Masson Université de Picardie Jules Verne, France
Benjamin Quost Université de Technologie de Compiègne, France
(General Chair)
David Savourey Université de Technologie de Compiègne, France
Contents
Inconsistency Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Matthias Thimm
Invited Keynotes
Matthias Thimm(B)
1 Introduction
An inconsistency measure I is a function that assigns to a knowledge base K
(usually assumed to be formalised in propositional logic) a non-negative real
value I(K) such that I(K) = 0 iff K is consistent and larger values of I(K)
indicate “larger” inconsistency in K [3,5,12]. Thus, each inconsistency measure
I formalises a notion of a degree of inconsistency and a lot of different concrete
approaches have been proposed so far, see [11–13] for some surveys. The quest
for the “right” way to measure inconsistency is still ongoing and many (usually
controversial) rationality postulates to describe the desirable behaviour of an
inconsistency measure have been proposed so far [2,12].
Our study aims at providing a new perspective on the analysis of exist-
ing approaches to inconsistency measurement by experimentally analysing the
behaviour of inconsistency measures. More precisely, our study provides a quan-
titative analysis of two aspects of inconsistency measures:
A1 the distribution of inconsistency values on actual knowledge bases, and
A2 the correlation of different inconsistency measures.
Regarding the first item, [11] investigated the theoretical expressivity of incon-
sistency measures, i. e., the number of different inconsistency values a measure
attains when some dimension of the knowledge base is bounded (such as the
number of formulas or the size of the signature). One result in [11] is that e. g.
the measure Idalal
Σ
(see Sect. 3) has maximal expressivity and the number of dif-
ferent inconsistency values is not bounded if only one of these two dimensions
is bounded. However, [11] does not investigate the distribution of inconsistency
values. It may be the case that, although a measure can attain many different
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 1–8, 2019.
https://doi.org/10.1007/978-3-030-35514-2_1
2 M. Thimm
values, most inconsistent knowledge bases are clustered on very few inconsistency
values. Regarding the second item, previous works have shown—see [12] for an
overview—that all inconsistency measures developed so far are “essentially” dif-
ferent. More precisely, for each pair of measures one can find a property that is
satisfied by one measure but not by the other. Moreover, for each pair of incon-
sistency measures one can find knowledge bases that are ordered different wrt.
their inconsistency. However, until now it has not been investigated how “sig-
nificant” the difference between measures actually is. It may be the case that
two measures order all but just a very few knowledge bases differently (or the
other way around). In order to analyse these two aspects we applied 19 different
inconsistency measures from the literature on artificially generated knowledge
bases and performed a statistical analysis on the results. After a brief review of
necessary preliminaries in Sect. 2 and the considered inconsistency measures in
Sect. 3, we provide some details on our experiments and our findings in Sect. 4
and conclude in Sect. 5.
2 Preliminaries
Let At be some fixed propositional signature, i. e., a (possibly infinite) set of
propositions, and let L(At) be the corresponding propositional language con-
structed using the usual connectives ∧ (and ), ∨ (or ), and ¬ (negation).
Definition 1. A knowledge base K is a finite set of formulas K ⊆ L(At). Let K
be the set of all knowledge bases.
If X is a formula or a set of formulas we write At(X) to denote the set of
propositions appearing in X. Semantics to a propositional language is given by
interpretations and an interpretation ω on At is a function ω : At → {true, false}.
Let Ω(At) denote the set of all interpretations for At. An interpretation ω satisfies
(or is a model of) an atom a ∈ At, denoted by ω |= a, if and only if ω(a) = true.
The satisfaction relation |= is extended to formulas in the usual way.
For Φ ⊆ L(At) we also define ω |= Φ if and only if ω |= φ for every φ ∈ Φ.
Define furthermore the set of models Mod(X) = {ω ∈ Ω(At) | ω |= X} for every
formula or set of formulas X. By abusing notation, a formula or set of formulas
X1 entails another formula or set of formulas X2 , denoted by X1 |= X2 , if
Mod(X1 ) ⊆ Mod(X2 ). Two formulas or sets of formulas X1 , X2 are equivalent,
denoted by X1 ≡ X2 , if Mod(X1 ) = Mod(X2 ). If Mod(X) = ∅ we also write
X |=⊥ and say that X is inconsistent.
3 Inconsistency Measures
Let R∞≥0 be the set of non-negative real values including ∞. Inconsistency mea-
sures are functions I : K → R∞ ≥0 that aim at assessing the severity of the
inconsistency in a knowledge base K. The basic idea is that the larger the incon-
sistency in K the larger the value I(K). We refer to [11–13] for surveys.
An Experimental Study on the Behaviour of Inconsistency Measures 3
4 Experiments
In the following, we give some details on our experiments, the evaluation method-
ology, and our findings.
Table 1. Entropy values of the investigated measures wrt. K̂ ⊥ (rounded to two deci-
mals and sorted by increasing entropy).
Id ICC Idalal
hit
Ic Imc Iforget IMI Iis Idalal
max
ICSP
HK̂ ⊥ (I) 0 0.08 0.09 0.12 0.13 0.18 0.24 0.28 0.29 0.29
Ihs Iη Idalal
Σ
IMIC Imv Imcsc Ip Inc ID f
HK̂ ⊥ (I) 0.29 0.33 0.36 0.37 0.45 0.48 0.51 0.52 0.78
Σ max hit
Id IMI IMIC Iη Ic Imc IpIhs Idalal Idalal Idalal ID Imv Inc Imcsc ICSP Iforget ICC Iis
f
Id 1 0.69 0.44 0.5 0.86 0.87 0.35 0.52 0.47 0.52 0.9 0.22 0.48 0.33 0.37 0.68 0.76 0.92 0.67
IMI 1 0.54 0.37 0.72 0.74 0.65 0.38 0.41 0.38 0.76 0.28 0.41 0.47 0.52 0.99 0.7 0.75 0.99
IMIC 1 0.72 0.47 0.51 0.53 0.7 0.73 0.7 0.52 0.49 0.41 0.43 0.84 0.55 0.51 0.5 0.55
Iη 1 0.47 0.48 0.36 0.98 0.93 0.98 0.49 0.53 0.39 0.33 0.84 0.37 0.48 0.5 0.37
Ic 1 0.85 0.4 0.49 0.53 0.49 0.88 0.25 0.45 0.38 0.42 0.72 0.88 0.87 0.72
Imc 1 0.45 0.48 0.48 0.48 0.95 0.26 0.45 0.39 0.39 0.75 0.8 0.94 0.75
Ip 1 0.36 0.39 0.36 0.43 0.25 0.32 0.43 0.5 0.64 0.42 0.41 0.64
Ihs 1 0.95 0.99 0.51 0.52 0.4 0.32 0.85 0.38 0.5 0.52 0.38
Σ
Idalal 1 0.95 0.51 0.53 0.4 0.34 0.89 0.42 0.54 0.5 0.42
max
Idalal 1 0.5 0.52 0.4 0.32 0.85 0.38 0.5 0.52 0.38
hit
Idalal 1 0.26 0.46 0.4 0.41 0.77 0.85 0.98 0.77
ID 1 0.53 0.19 0.56 0.29 0.29 0.26 0.29
f
Imv 1 0.25 0.39 0.41 0.43 0.46 0.41
Inc 1 0.39 0.47 0.4 0.39 0.47
Imcsc 1 0.53 0.44 0.4 0.53
ICSP 1 0.71 0.76 0.99
Iforget 1 0.82 0.71
ICC 1 0.76
Iis 1
Let A be the indicator function, which is defined as A = 1 iff A is true and
A = 0 otherwise.
Definition 3. Let K be a set of knowledge bases and I1 , I2 be two inconsistency
measures. The correlation coefficient CK (I1 , I2 ) of I1 and I2 wrt. K is defined
via
K,K ∈K,K
=K I1 ∼K,K I2
CK (I1 , I2 ) =
|K|(|K| − 1)
In other words, CK (I1 , I2 ) gives the ratio of how much I1 and I2 agree on
the inconsistency order of any pair of knowledge bases from K.4 Observe that
CK (I1 , I2 ) = CK (I2 , I1 ).
4
Note that CK is equivalent to the Kendall’s tau coefficient [8] but scaled onto [0, 1].
An Experimental Study on the Behaviour of Inconsistency Measures 7
4.3 Results
Tables 1 and 2 show the results of analysing the considered measures on K̂ ⊥ wrt.
the two evaluation measures from before5 .
Regarding A1, it can be seen that Id has minimal entropy (by definition).
However, also measures Idalal hit
and ICC and to some extent most of the other
measures are quite indifferent in assigning their values. For example, out of 61086
inconsistent knowledge bases, ICC assigns to 58523 of them the same value 1.
On the other hand, measure IDf has maximal entropy among the considered
measures.
Regarding A2, we can observe some surprising correlations between mea-
sures, even those which are based on different concepts. For example, we have
max
CK̂ ⊥ (Idalal , Ihs ) ≈ 0.99 indicating a high correlation between Idalal
max
and Ihs
although Idalal is defined using distances and Ihs is defined using hitting sets.
max
Equally high correlations can be observed between the three measures IMI , ICSP ,
and Iis . Further high correlations (e. g. above 0.8) can be observed between many
other measures. On the other hand, the measure IDf has (on average) the small-
est correlation to all other measures, backing up the observation from before.
5 Conclusion
Our experimental analysis showed that many existing measures have low entropy
on the distribution of inconsistency values and correlate significantly in their
ranking of inconsistent knowledge bases. A web application for trying out all
the discussed inconsistency measures can be found on the website of Tweety-
Project6 , cf. [10]. Most of these measures have been implemented using naive
algorithms and research on the algorithmic issues of inconsistency measure is
still desirable future work, see also [13].
References
1. Ammoura, M., Raddaoui, B., Salhi, Y., Oukacha, B.: On measuring inconsistency
using maximal consistent sets. In: Destercke, S., Denoeux, T. (eds.) ECSQARU
2015. LNCS (LNAI), vol. 9161, pp. 267–276. Springer, Cham (2015). https://doi.
org/10.1007/978-3-319-20807-7 24
2. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
3. Grant, J., Hunter, A.: Measuring inconsistency in Knowledge bases. J. Intell. Inf.
Syst. 27, 159–184 (2006)
5
We only considered the inconsistent knowledge bases from K̂ as all measures assign
degree 0 to the consistent ones anyway.
6
http://tweetyproject.org/w/incmes/.
8 M. Thimm
4. Grant, J., Hunter, A.: Distance-based measures of inconsistency. In: van der Gaag,
L.C. (ed.) ECSQARU 2013. LNCS (LNAI), vol. 7958, pp. 230–241. Springer,
Heidelberg (2013). https://doi.org/10.1007/978-3-642-39091-3 20
5. Hunter, A., Konieczny, S.: Approaches to measuring inconsistent information. In:
Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol.
3300, pp. 191–236. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-
540-30597-2 7
6. Jabbour, S., Ma, Y., Raddaoui, B.: Inconsistency measurement thanks to mus
decomposition. In: Scerri, L., Huhns, B. (eds.) Proceedings of the 13th International
Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), pp.
877–884 (2014)
7. Jabbour, S., Ma, Y., Raddaoui, B., Sais, L., Salhi, Y.: On structure-based inconsis-
tency measures and their computations via closed set packing. In: Proceedings of
the 14th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2015), pp. 1749–1750 (2015)
8. Kendall, M.: A new measure of rank correlation. Biometrika 30(1–2), 81–89 (1938)
9. Mu, K., Liu, W., Jin, Z., Bell, D.: A syntax-based approach to measuring the degree
of inconsistency for belief bases. Int. J. Approximate Reasoning 52(7), 978–999
(2011)
10. Thimm, M.: Tweety - a comprehensive collection of Java Libraries for logical
aspects of artificial intelligence and knowledge representation. In: Proceedings of
the 14th International Conference on Principles of Knowledge Representation and
Reasoning (KR 2014), pp. 528–537, July 2014
11. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
12. Thimm, M.: On the compliance of rationality postulates for inconsistency mea-
sures: a more or less complete picture. Künstliche Intell. 31(1), 31–39 (2017)
13. Thimm, M., Wallner, J.P.: Some complexity results on inconsistency measurement.
In: Proceedings of the 15th International Conference on Principles of Knowledge
Representation and Reasoning (KR 2016), pp. 114–123, April 2016
Inconsistency Measurement
Matthias Thimm(B)
1 Introduction
Inconsistency is a ubiquitous phenomenon whenever knowledge1 is compiled in
some formal language. The notion of inconsistency refers (usually) to multiple
pieces of information and represents a conflict between those, i. e., they cannot
hold at the same time. The two statements “It is sunny outside” and “It is not
sunny outside” represent inconsistent information and in order to draw meaning-
ful conclusions from a knowledge base containing these statements, this conflict
has to be resolved somehow. In applications such as decision-support systems,
a knowledge base is usually compiled by merging the formalised knowledge of
many different experts. It is unavoidable that different experts contradict each
other and that the merged knowledge base becomes inconsistent. The field of
Knowledge Representation and Reasoning (KR) [7] is the subfield of Artificial
Intelligence (AI) that deals with the issues of logical formalisations of informa-
tion and the modelling of rational reasoning behaviour, in particular in light
of inconsistent or uncertain information. One paradigm to deal with inconsis-
tent information is to abandon classical inference and define new ways of rea-
soning. Some examples of such formalisms are, e. g., paraconsistent logics [6],
default logic [34], answer set programming [15], and, more recently, computa-
tional models of argumentation [1]. Moreover, the fields of belief revision [21] and
belief merging [10,28] deal with the particular case of inconsistencies in dynamic
settings.
The field of Inconsistency Measurement—see the seminal work [20] and the
recent book [19]—provides an analytical perspective on the issue of inconsis-
tency. Its aim is to quantitatively assess the severity of inconsistency in order
1
We use the term knowledge to refer to subjective knowledge or beliefs, i. e., pieces of
information that may not necessary be true in the real world but are only assumed
to be true for the agent(s) under consideration.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 9–23, 2019.
https://doi.org/10.1007/978-3-030-35514-2_2
10 M. Thimm
Again both K3 and K4 are inconsistent, but which one is more inconsistent
than the other? Our reasoning from above cannot be applied here in the same
fashion. The knowledge base K3 contains an apparent contradiction ({a, ¬a})
but also a formula not participating in the inconsistency ({b}). The knowledge
base K4 contains a “hidden” conflict as four formulas are necessary to produce a
contradiction, but all formulas of K4 are participating in this. In this case, it is
not clear how to assess the inconsistency of these knowledge bases and different
measures may order these knowledge bases differently. More generally speaking,
it is not universally agreed upon which so-called rationality postulates should
be satisfied by a reasonable account of inconsistency measurement, see [3,5,41]
for a discussion. Besides concrete approaches to inconsistency measurement the
community has also proposed a series of those rationality postulates in order
to describe general desirable behaviour and the classification of inconsistency
measures by the postulates they satisfy is still one the most important ways to
evaluate the quality of a measure, even if the set of desirable postulates is not
universally accepted. For example, one of the most popular rationality postulates
Inconsistency Measurement 11
is monotony which states that for any K ⊆ K , the knowledge base K cannot
be regarded as more inconsistent as K . The justification for this demand is
that inconsistency cannot be resolved when adding new information but only
increased2 . While this is usually regarded as a reasonable demand there are also
situations where monotony may be seen as counterintuitive, even in monotonic
logics. Consider the next two knowledge bases
2 Preliminaries
Let At be some fixed set of propositions and let L(At) be the corresponding
propositional language constructed using the usual connectives ∧ (conjunction),
∨ (disjunction), → (implication), and ¬ (negation).
Definition 1. A knowledge base K is a finite set of formulas K ⊆ L(At). Let K
be the set of all knowledge bases.
If X is a formula or a set of formulas we write At(X) to denote the set of
propositions appearing in X.
Semantics for a propositional language is given by interpretations where an
interpretation ω on At is a function ω : At → {true, false}. Let Ω(At) denote
the set of all interpretations for At. An interpretation ω satisfies (or is a model
of) a proposition a ∈ At, denoted by ω |= a, if and only if ω(a) = true. The
satisfaction relation |= is extended to formulas in the usual way.
2
At least in monotonic logics; for a discussion about inconsistency measurement in
non-monotonic logics see [9, 43] and Sect. 5.3.
12 M. Thimm
3 Measuring Inconsistency
Let R∞
≥0 be the set of non-negative real values including infinity. The most general
form of an inconsistency measure is as follows.
Definition 2. An inconsistency measure I is any function I : K → R∞
≥0 .
The above definition is, of course, under-constrained for the purpose of provid-
ing a quantitative means to measure inconsistency. The intuition we intend to
be behind any concrete approach to inconsistency measure I is that a larger
value I(K) for a knowledge base K indicates more severe inconsistency in K
than lower values. Moreover, we wish to reserve the minimal value (0) to indi-
cate the complete absence of inconsistency. This is captured by the following
postulate [23]:
Consistency I(K) = 0 iff K is consistent.
Normalisation 0 ≤ I(K) ≤ 1.
Monotony If K ⊆ K then I(K) ≤ I(K ).
Free-formula independence If α ∈ Free(K) then
I(K) = I(K \ {α}).
Dominance If α |=⊥ and α |= β then I(K ∪ {α}) ≥ I(K ∪ {β}).
Inconsistency Measurement 13
K7 = {a ∧ c, b ∧ ¬c, ¬a ∨ ¬b}
4 Approaches
There is a wide variety of inconsistency measures in the literature, the work [41]
alone lists 22 measures in 2018 and more have been proposed since then3 . In this
paper we consider only a few to illustrate the main concepts.
The measure Id is usually referred to as a baseline for inconsistency measures
as it only distinguishes between consistent and inconsistent knowledge bases.
3
Implementations of most of these measures can also be found in the Tweety
Libraries for Artificial Intelligence [40] and an online interface is available at http://
tweetyproject.org/w/incmes.
14 M. Thimm
for K ∈ K.
4
And in this author’s opinion also a bit mislabelled.
Inconsistency Measurement 15
The MIc -inconsistency measure takes also the sizes of the individual minimal
inconsistent subsets into account. The intuition here is that larger minimal incon-
sistent subsets represent less inconsistency (as the conflict is more “hidden”) and
small minimal inconsistent subsets represent more inconsistency (as it is more
“apparent”).
Example 2. Consider again knowledge bases K1 and K2 from before defined via
Here we have
Observe that, while IMI and IMIC disagree on the exact values of the inconsistency
in K1 and K2 they do agree on their order (K1 is more inconsistent than K2 ).
This is not generally true, consider
K8 = {a, ¬a}
K9 = {a1 , ¬a1 ∨ b1 , ¬b1 ∨ c1 , ¬ ∨ d1 , ¬d1 ∨ ¬a1 ,
a2 , ¬a2 ∨ b2 , ¬b2 ∨ c2 , ¬ ∨ d2 , ¬d2 ∨ ¬a2 }
where K8 is less inconsistent than K9 according to IMI and the other way around
for IMIC .
true and false, respectively. The additional truth value B stands for both and is
meant to represent a conflicting truth value for a proposition. The function υ is
extended to arbitrary formulas as shown in Table 1. An interpretation υ satisfies
a formula α, denoted by υ |=3 α if either υ(α) = T or υ(α) = B. Define υ |=3 K
for a knowledge base K accordingly. Now inconsistency can be measured by
seeking an interpretation υ that assigns B to a minimal number of propositions.
for K ∈ K.
for K ∈ K.
Inconsistency Measurement 17
The measure Iη looks for a probability function P that maximises the minimum
probability of all formulas in K. The larger this probability the less inconsistent
K is assessed (if there is a probability function assigning 1 to all formulas then
K is obviously consistent).
Example 3. Consider again knowledge bases K1 and K2 from before defined via
Here we have
Ic (K1 ) = 2 Ic (K2 ) = 1
Iη (K1 ) = 0.5 Iη (K2 ) = 1/3
where, in particular, Ic also agrees with IMI (see Example 2). Consider now
where
Ic (K1 ) = 1 Ic (K2 ) = 3
Iη (K1 ) = 0.5 Iη (K2 ) = 0.5
IMI (K1 ) = 1 Ic (K2 ) = 1
There are further ways to define inconsistency measures that do not fall strictly
in one of the two paradigms above. We have a look at some now.
A simple approach to obtain a more proposition-centric measure (as Ic ) while
still relying on minimal inconsistent sets is the following measure.
for K ∈ K.
18 M. Thimm
In other words, Imv (K) is the ratio of the number of propositions that appear
in at least one minimal inconsistent set and the number of all propositions.
Another approach that makes no use of either minimal inconsistent sets or
non-classical semantics is the following one. A subset H ⊆ Ω(At) is called a
hitting set of K if for every φ ∈ K there is ω ∈ H with ω |= φ.
Definition 11 ([37]). The hitting-set inconsistency measure Ihs : K → R∞
≥0 is
defined as
Here we have
Moreover, Grant and Hunter [18] define new families of inconsistency mea-
sures based on distances of classical interpretations to being models of a knowl-
edge base. Besnard [4] counts how many propositions have to be forgotten—i. e.
removed from the underlying signature of the knowledge base—to turn an incon-
sistent knowledge base into a consistent one.
I CO NO MO IN DO
Id ✓ ✓ ✓ ✓ ✓
IMI ✓ ✗ ✓ ✓ ✗
IMIC ✓ ✗ ✓ ✓ ✗
Ic ✓ ✗ ✓ ✓ ✓
Iη ✓ ✓ ✓ ✓ ✓
Imv ✓ ✓ ✗ ✗ ✗
Ihs ✓ ✗ ✓ ✓ ✓
Inconsistency Measurement 19
In [16], first-order logic is considered as the base logic. Allowing for objects and
quantification brings new challenges to measuring inconsistency as one should
distinguish in a more fine-grained manner how much certain formulas contribute
to inconsistency. For example, a formula ∀X : bird(X) → f lies(X)—which mod-
els that all birds fly—is probably the culprit of some inconsistency in any knowl-
edge base. However, depending on how many objects actually satisfy/violate
the implication, the severity of the inconsistency of the overall knowledge base
may differ (compare having a knowledge base with 10 flying birds and 1 non-
flying bird to a knowledge base with 1000 flying birds and 1 non-flying bird).
[16] address this challenge by proposing some new inconsistency measures for
first-order logic.
There are also several works—see e. g. [29,45]—that deal with measuring
inconsistency in ontologies formalised in certain description logics.
References
1. Baroni, P., Gabbay, D., Giacomin, M., van der Torre, L. (eds.): Handbook of Formal
Argumentation. College Publications, London (2018)
2. Bertossi, L.: Repair-based degrees of database inconsistency. In: Balduccini, M.,
Lierler, Y., Woltran, S. (eds.) LPNMR 2019. LNCS, vol. 11481, pp. 195–209.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20528-7 15
3. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
4. Besnard, P.: Forgetting-based inconsistency measure. In: Schockaert, S., Senellart,
P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 331–337. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-45856-4 23
5. Besnard, P.: Basic postulates for inconsistency measures. In: Hameurlain, A.,
Küng, J., Wagner, R., Decker, H. (eds.) Transactions on Large-Scale Data- and
Knowledge-Centered Systems XXXIV. LNCS, vol. 10620, pp. 1–12. Springer,
Heidelberg (2017). https://doi.org/10.1007/978-3-662-55947-5 1
6. Béziau, J.-Y., Carnielli, W., Gabbay, D. (eds.): Handbook of Paraconsistency.
College Publications, London (2007)
7. Brachman, R.J., Levesque, H.J.: Knowledge Representation and Reasoning.
Morgan Kaufmann Publishers, Massachusetts (2004)
8. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance.
Commun. ACM 54(12), 92–103 (2011)
9. Brewka, G., Thimm, M., Ulbricht, M.: Strong inconsistency. Artif. Intell. 267,
78–117 (2019)
10. Cholvy, L., Hunter, A.: Information fusion in logic: a brief overview. In: Gabbay,
D.M., Kruse, R., Nonnengart, A., Ohlbach, H.J. (eds.) ECSQARU/FAPR-1997.
LNCS, vol. 1244, pp. 86–95. Springer, Heidelberg (1997). https://doi.org/10.1007/
BFb0035614
11. Cholvy, L., Perrussel, L., Thevenin, J.M.: Using inconsistency measures for esti-
mating reliability. Int. J. Approximate Reasoning 89, 41–57 (2017)
5
see http://tweetyproject.org/w/incmes.
22 M. Thimm
12. De Bona, G., Finger, M., Potyka, N., Thimm, M.: Inconsistency measurement in
probabilistic logic. In: Measuring Inconsistency in Information, College Publica-
tions (2018)
13. De Bona, G., Grant, J., Hunter, A., Konieczny, S.: Towards a unified framework
for syntactic inconsistency measures. In: Proceedings of AAAI 2018 (2018)
14. Decker, H., Misra, S.: Database inconsistency measures and their applications. In:
Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 254–265.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5 21
15. Gelfond, M., Leone, N.: Logic programming and knowledge representation - the
a-prolog perspective. Artif. Intell. 138(1–2), 3–38 (2002)
16. Grant, J., Hunter, A.: Analysing inconsistent first-order knowledgebases. Artif.
Intell. 172(8–9), 1064–1093 (2008)
17. Grant, J., Hunter, A.: Measuring consistency gain and information loss in stepwise
inconsistency resolution. In: Liu, W. (ed.) ECSQARU 2011. LNCS (LNAI), vol.
6717, pp. 362–373. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-
642-22152-1 31
18. Grant, J., Hunter, A.: Analysing inconsistent information using distance-based
measures. Int. J. Approximate Reasoning 89, 3–26 (2017)
19. Grant, J., Martinez, M.V. (eds.): Measuring Inconsistency in Information. College
Publications, London (2018)
20. Grant, J.: Classifications for inconsistent theories. Notre Dame J. Form. Log. 19(3),
435–444 (1978)
21. Hansson, S.O.: A Textbook of Belief Dynamics. Kluwer Academic Publishers,
Dordrecht (2001)
22. Hunter, A., Konieczny, S.: Approaches to measuring inconsistent information. In:
Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol.
3300, pp. 191–236. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-
540-30597-2 7
23. Hunter, A., Konieczny, S.: Shapley inconsistency values. In: Proceedings of KR
2006, pp. 249–259 (2006)
24. Hunter, A., Konieczny, S.: Measuring inconsistency through minimal inconsistent
sets. In: Proceedings of KR 2008, pp. 358–366 (2008)
25. Jabbour, S., Ma, Y., Raddaoui, B.: Inconsistency measurement thanks to MUS
decomposition. In: Proceedings of AAMAS 2014, pp. 877–884 (2014)
26. Jabbour, S.: On inconsistency measuring and resolving. In: Proceedings of ECAI
2016, pp. 1676–1677 (2016)
27. Knight, K.M.: Measuring inconsistency. J. Philos. Log. 31, 77–98 (2001)
28. Konieczny, S., Pino Pérez, R.: On the logic of merging. In: Proceedings of KR 1998
(1998)
29. Ma, Y., Hitzler, P. : Distance-based measures of inconsistency and incoherency for
description logics. In: Proceedings of DL 2010 (2010)
30. Ma, Y., Qi, G., Xiao, G., Hitzler, P., Lin, Z.: An anytime algorithm for comput-
ing inconsistency measurement. In: Karagiannis, D., Jin, Z. (eds.) KSEM 2009.
LNCS (LNAI), vol. 5914, pp. 29–40. Springer, Heidelberg (2009). https://doi.org/
10.1007/978-3-642-10488-6 7
31. Ma, Y., Qi, G., Xiao, G., Hitzler, P., Lin, Z.: Computational complexity and any-
time algorithm for inconsistency measurement. Int. J. Softw. Inform. 4(1), 3–21
(2010)
32. McAreavey, K., Liu, W., Miller, P.: Computational approaches to finding and mea-
suring inconsistency in arbitrary knowledge bases. Int. J. Approximate Reasoning
55, 1659–1693 (2014)
Inconsistency Measurement 23
33. Potyka, N., Thimm, M.: Inconsistency-tolerant reasoning over linear probabilistic
knowledge bases. Int. J. Approximate Reasoning 88, 209–236 (2017)
34. Reiter, R.: A logic for default reasoning. Artif. Intell. 13, 81–132 (1980)
35. Thimm, M., Wallner, J. P.: Some complexity results on inconsistency measurement.
In: Proceedings of KR 2016, pp. 114–123 (2016)
36. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
37. Thimm, M.: Stream-based inconsistency measurement. Int. J. Approximate Rea-
soning 68, 68–87 (2016)
38. Thimm, M.: Measuring inconsistency with many-valued logics. Int. J. Approximate
Reasoning 86, 1–23 (2017)
39. Thimm, M.: On the compliance of rationality postulates for inconsistency mea-
sures: a more or less complete picture. Künstliche Intell. 31(1), 31–39 (2017)
40. Thimm, M.: The tweety library collection for logical aspects of artificial intelligence
and knowledge representation. Künstliche Intell. 31(1), 93–97 (2017)
41. Thimm, M.: On the evaluation of inconsistency measures. In: Measuring Inconsis-
tency in Information. College Publications (2018)
42. Ulbricht, M., Thimm, M., Brewka, G.: Inconsistency measures for disjunctive logic
programs under answer set semantics. In: Measuring Inconsistency in Information.
College Publications (2018)
43. Ulbricht, M., Thimm, M., Brewka, G.: Measuring strong inconsistency. In: Pro-
ceedings of AAAI 2018, pp. 1989–1996 (2018)
44. Xiao, G., Ma, Y.: Inconsistency measurement based on variables in minimal unsat-
isfiable subsets. In: Proceedings of ECAI 2012 (2012)
45. Zhou, L., Huang, H., Qi, G., Ma, Y., Huang, Z., Qu, Y.: Measuring inconsistency
in DL-lite ontologies. In: Proceedings of WI-IAT 2009, pp. 349–356 (2009)
Using Graph Convolutional Networks
for Approximate Reasoning with Abstract
Argumentation Frameworks: A Feasibility
Study
1 Introduction
2 Preliminaries
In the following, we recall basic definitions of abstract argumentation and arti-
ficial neural networks.
with neurons as nodes and their connections as edges. For training neural net-
works, the back-propagation algorithm is used in most cases. Back-propagation
is a supervised learning method, meaning that at all times during training, the
output corresponding to the current input must be known. The goal is to find
the most exact mapping of the input vectors to their output vectors. This is
realised by adjusting the weights on the edges of the graph, see [16] for details.
In the context of graph theory, Kipf et al. [17] introduce graph convolutional
networks that are able to directly use graphs as input instead of a vector of reals.
More precisely, they introduce a layer-wise propagation rule for neural networks
that operates directly on graphs. It is formulated as follows:
1 1
H (l+1) = σ D̃− 2 ÃD̃− 2 H (l) W (l) (1)
gθ ∗ x = U gθ U x. (2)
Stacking multiple convolutional layers in the form of Eq. (4) (each layer fol-
lowed by a point-wise non-linearity) leads to a neural network model that can
directly process graphs.
4 Experimental Evaluation
The framework for graph convolutional networks (GCNs) offered by Kipf et al.
[17], which is realised with the aid of Google’s TensorFlow [1], is designed to
find labels for certain nodes of a given graph and is thus a reasonable starting
point for examining if it is possible to decide whether an argument is credulously
accepted wrt. preferred semantics by the use of neural networks.
2
Note that implementation-wise this is not completely true as the size of the output
vector has to be fixed.
Using Graph Convolutional Networks for Abstract Argumentation 29
4.1 Datasets
3
https://sourceforge.net/projects/probo/.
4
https://sourceforge.net/p/afbenchgen/wiki/Home/.
30 I. Kuhlmann and M. Thimm
4.3 Results
When dealing with artificial neural networks, quite a few parameters can influ-
ence the outcome of the training process. The following section describes various
experimental results in which the impact of different factors on the quality of
the classification process is examined. Those factors include, for instance, the
size and nature of the training set, the learning rate, and the number of epochs
being used to train the neural network model. Finally, we report on some runtime
comparison with a sound and complete solver.
Feature Matrix. As explained in Sect. 4.2, there are two different types of
feature matrix that may be used in the training process. While training with
the feature matrix that does not contain any features (henceforth referred to
as fm1) always results in an accuracy of 77.0%, training with the matrix that
encodes incoming and outgoing attacks as features (henceforth referred to as
fm2) offers slightly better results (up to 80.3%). Accuracy is measured by divid-
ing the number of correct predictions by the total number of predictions. The
Using Graph Convolutional Networks for Abstract Argumentation 31
fm1 fm2
Accuracy Yes Accuracy No Accuracy Yes Accuracy No
0.0000 1.0000 0.1499 0.9846
0.0000 1.0000 0.2025 0.9810
0.0000 1.0000 0.2083 0.9803
Table 3. Training results for individual graph types and parameter settings for train-
ing. Additional parameters were set as follows: number of epochs: 500, learning rate:
0.001, dropout: 0.05.
accuracy value for class Yes can also be viewed as the recall value, which is cal-
culated by dividing the number of true positives by the sum of true positives and
false negatives. Moreover, by calculating the precision (true positives divided by
the sum of true positives and false positives), the F1 score can be obtained as
follows:
Precision · Recall
F1 = 2 · (5)
Precision + Recall
Moreover, because it seems unusual that multiple different training setups
all return the same value, it is important to also look into the class-specific
accuracies. Table 2 reveals that the network only learned to classify all nodes as
No when trained with fm1. Incorporating fm2 into the training process leads
to an accuracy of class Yes of up to 20.8%. Whereas this result still needs
optimisation, it shows that using fm2 is the more promising approach. In all
following experiments, fm2 is used.
Table 4. Classification results after training with different-sized training sets. Param-
eter settings: epochs: 500, learning: 0.01, dropout rate: 0.05. However, a difference in
training set size might require different settings. For example, a larger dataset might
need more epochs to converge than a smaller ones.
process. Increasing the number of epochs to 1000 yields exactly the same accu-
racies for Barabási-Albert, Erdős-Rényi, Scc, and Watts-Strogatz graphs, but
improves the values for Grounded and Stable. This leads to the assumption that
the graph types are of different difficulty for the network to learn. The fact that
98.86% (Scc) or even 99.89% (Watts-Strogatz) of the graphs’ nodes belong to
one class supports this assumption. Classifying such unevenly distributed classes
is quite a difficult task for a neural network.
Another observation is that the set of Barabási-Albert graphs is the only one
where the majority of instances is in the class Yes. This might help creating
a dataset with more evenly distributed classes. Generally, it is certainly helpful
to have some graphs with more Yes instances in a dataset in order to generate
more diversity. Having a diverse dataset is a vital aspect when training neural
networks. Otherwise, the network might overfit to irrelevant features or might
not work for some application scenarios.
Dataset Size. Besides the influence of a dataset’s diversity, the amount of data
also has an impact on the training process. Table 4 shows some classification
results for the different datasets described in Sect. 4.1. As expected, it indicates
that bigger training sets have a greater potential to improve classification results.
Nonetheless, utilizing more training data does not automatically mean better
results. As displayed in Table 4, adding more than 50 graphs of each type does
not yield a significant increase in accuracy. The values for overall accuracy and
accuracy for class No do not change much at all (both less than 3.5%) when
adding more training data. It is, however, crucial to look into the accuracy of
class Yes as well as the F1 scores, because it indicates that the network actually
learned some features of a preferred extension, instead of guessing No for all
instances. Training with 25 graphs per type (150 in total) already results in
20.25% accuracy of class Yes—only 1.85% less than a training with a total of 600
graphs yields. Training with 50 graphs per type increases the accuracy for Yes
by another 1.45%, which may still be regarded as significant when considering
that the difference to the next bigger training set is merely 0.04%. In summary,
Using Graph Convolutional Networks for Abstract Argumentation 33
Table 5. Classification results after training with a more balanced dataset in regard
to instances per class.
the increase in accuracy for class Yes rather quickly starts stagnating when
more data is added.
0.62
Accuracy total
0.63
0.97
Accuracy No
0.93
(in respect of instances per class) also leads to more balanced results. Since the
test set consists of 77.0% instances of class No, the total accuracy does not
increase, though.
Competition Data. In order to get a sense of how the training results transfer
to other data, two differently trained models are tested on the competition data
(see Sect. 4.1). The first model is trained with the 50-of-each dataset. The learn-
ing rate is set to 0.01, dropout to 0.05, and number of epochs to 500. The second
model uses the same settings, but is trained with the more balanced dataset con-
taining 127 Barabási-Albert graphs and 100 others as illustrated above. Figure 2
displays a comparison of the results. The overall accuracy is very similar for both
training sets: about 17% lower than for the regular test set, and the class-specific
accuracy values are lower, too. This might be due to the benchmark dataset con-
taining graphs that are smaller or larger than the ones in the training set. Also,
additional types of graphs are included in the benchmark dataset.
or not. While the lowest value is at 0.002 s, the highest one is at 19.27 s—which
is about 8674 times as much. It is also worth noting that, if evaluating the
whole test set takes the GCN 0.22 s, it takes an average of 7 · 10−6 = 0.000007 s.
That means, the minimal amount of time CoQuiAAS needed to evaluate an
argument is still 317 times as much as the average amount of time the GCN takes.
We only report on the mean runtime for the GCN approach as classification is
independent of the instance, it is only polynomial in the size of the trained
network. It follows that the GCN approach has constant runtime wrt. the size
of the instance.
Of course, one needs to consider that a neural network also needs time for
training and possibly for preprocessing. Using the GCN framework, the training
process took approximately between 20 min and two hours—depending on the
dataset size and the parameter settings such as number of epochs or learning
rate. For other network models and frameworks, training might take a lot longer.
Nonetheless, once sufficient data is provided and the network is trained, it can
be used for any test set and it is extremely fast.
5 Conclusion
All in all, the attempt of training a graph-convolutional network on abstract
argumentation frameworks in order to decide whether an argument is included
in a preferred extension or not was rather moderate. The overall accuracy did
under no circumstances exceed 80.5%. When testing with benchmark data, it
was even lower (63%). However, extending the diversity of the training set, for
instance, by adding different-sized graphs or by adding new types of graphs,
might improve this result.
Furthermore, training a neural network model involves adjusting a great
number of parameters. Also, some of these parameters depend on each other.
Considering that training a neural network requires careful adaption of the train-
ing data, the parameter settings, and the network architecture itself, and that
some aspects also affect others, examining all reasonable possibilities exceeds the
extent of this work.
The training results are moderate: On the one hand, the overall classifica-
tion accuracy does not exceed 80.5%, which is not good enough for practical
applications, but on the other hand, it proves that the network learned at least
some rudimental features of a preferred extension. The fact that instances from
36 I. Kuhlmann and M. Thimm
both classes can be classified correctly reinforces this statement. The accuracy
for class Yes is far lower (<30%) than the accuracy for class No (>90%) in all
training procedures. A reason for this effect may be that the majority of the
training data is not included in an extension and thus labelled as No. Using
a training set where the distribution of instances per class is more balanced,
counteracts this effect to some degree. Using benchmark data for testing leads
to an overall accuracy of about 63%. The decrease in accuracy in comparison to
the specifically generated test set might be due to graph sizes and types that are
unknown to the network model, as they were not included in the training data.
Moreover, a GCN’s classification process is very time efficient: the entire test
set (30,603 arguments) is classified in <0.5 s. For comparison: the SAT solver
CoQuiAAS takes about an hour for the same dataset.
Generally, neural networks seem to be suited to perform the task of classi-
fying arguments as “included in a preferred extension” or “not included in a
preferred extension”. After all, it did work to a certain degree. Nevertheless,
the chosen network architecture seems to be inadequate for the task of abstract
argumentation. It is quite possible that a different network architecture leads
to better results. For example, an increased number of layers in a network or
more neurons per layer may increase the network’s ability to learn more com-
plex features. The results gathered in this paper show signs of underfitting, so
a deeper network would be a plausible strategy. Besides, GCNs were originally
constructed to process undirected graphs, yet argumentation frameworks are
represented as directed graphs. If a better suited neural network is found, the
next step could be to expand the classification problem to a regression prob-
lem by training the network to predict entire extensions, or even all possible
extensions of an argumentation framework.
References
1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16,
265–283 (2016)
2. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod.
Phys. 74(1), 47 (2002)
3. Atkinson, K., et al.: Toward artificial argumentation. AI Mag. 38(3), 25–36 (2017)
4. Besnard, P., Hunter, A.: Constructing argument graphs with deductive arguments:
a tutorial. Argum. Comput. 5(1), 5–30 (2014)
5. Cerutti, F., Gaggl, S.A., Thimm, M., Wallner, J.P.: Foundations of implementa-
tions for formal argumentation. In: Baroni, P., Gabbay, D., Giacomin, M., van der
Torre, L. (eds.) Handbook of Formal Argumentation, chap. 15. College Publica-
tions, London (2018)
6. Cerutti, F., Giacomin, M., Vallati, M.: Generating challenging benchmark AFs. In:
COMMA, vol. 14, pp. 457–458 (2014)
7. Cerutti, F., Oren, N., Strass, H., Thimm, M., Vallati, M.: A benchmark framework
for a computational argumentation competition. In: COMMA, pp. 459–460 (2014)
Using Graph Convolutional Networks for Abstract Argumentation 37
1 Introduction
The use of causal interaction models has become popular as a technique for
simplifying probability acquisition upon building Bayesian networks for real-
world applications. These interaction models essentially impose specific patterns
of interaction among the causal influences on an effect variable, by means of a
parameterised conditional probability table for the latter variable. The number
of parameters involved in this table typically is linear in the number of causes
involved, where the full table itself is exponentially large in this number. Vari-
ous different causal interaction models have been designed for use in Bayesian
networks, the best known among which are the (leaky) noisy-or model and its
generalisations (see for example [4,11,17]).
While a causal interaction model describes a conditional probability table
for the effect variable in a causal mechanism by a linear number of parame-
ters, most software packages for inference with the embedding Bayesian network
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 38–51, 2019.
https://doi.org/10.1007/978-3-030-35514-2_4
The Hidden Elegance of Causal Interaction Models 39
require the fully specified table. This full probability table is then generated
from the parameters and the definition of the interaction model used, prior to
the inference. Using fully expanded probability tables is associated with two
serious disadvantages, however. Firstly, the size of the full table is exponential
in the number of cause variables involved in a causal mechanism, which induces
both the specification size of the network and the runtime complexity of infer-
ence to increase substantially. Secondly, using full tables has the engineering
disadvantage that the modelling decision to impose a specific pattern of causal
interaction is no longer explicit in the representation, as a consequence of which
the intricate dependencies between the cells of the table are effectively hidden.
For richly-connected Bayesian networks with large numbers of cause variables
per effect variable, as found for example from probabilistic relational models [7],
inference scales poorly and quickly becomes infeasible. Over the last decades
therefore, researchers have addressed ways to ameliorate the representational
and inferential complexity of using fully expanded probability tables with causal
interaction models. One such approach has focused on the design of tailored
inference algorithms for noisy-or Bayesian networks, which trade off general
applicability and runtime efficiency; these algorithms in essence exploit the struc-
tured specification of the noisy-or model for all variables upon inference (see
for example [5,6,8,12,15]). While experimental results underline their scalability
for noisy-or networks, these tailored algorithms are not easily integrated with
current algorithms for probabilistic inference in general. Another approach to
tackling the representational and inferential complexity of using fully expanded
probability tables for causal interaction models, has focused on the design of
more concise representations of causal mechanisms; these alternative represen-
tations in essence are distilled automatically from the interaction models at hand
and allow use of general inference algorithms (see for example [9,10,16,18,19]).
In this paper we reconsider and integrate some of the early work in which
causal mechanisms with interaction models are represented by alternative graph-
ical structures and probability tables. We demonstrate that interaction models
with specific decomposition properties can be represented efficiently by an alter-
native structure with associated small tables that have an intuitively appeal-
ing semantics. This alternative structure can be readily embedded in a general
Bayesian network and thereby allows for inference without the necessity of pre-
processing tables or using tailored algorithms. We further argue that this alter-
native representation induces elegant properties from an engineering perspective
which allow more ready maintenance and safer fine-tuning of parameters than
the use of fully expanded probability tables in causal mechanisms.
The paper is organised as follows. In Sect. 2, we briefly review causal inter-
action models, and the (leaky) noisy-or model more specifically. In Sect. 3, we
reconsider the partition of causal interaction models into a deterministic function
and associated independent noise variables, and demonstrate when and how the
underlying deterministic function can be decomposed. Based on these insights,
we derive our alternative cascading representation and study its properties in
Sect. 4. We conclude the paper in Sect. 5.
40 S. Renooij and L. C. van der Gaag
Fig. 1. A causal mechanism M(n) with n cause variables Ci and the effect variable E
(left); a conditional probability table imposed by the noisy-or model, for n = 3 (right).
2 Preliminaries
We briefly review causal interaction models for Bayesian networks and thereby
introduce our notational conventions. In this paper, we focus on binary random
variables, which are denoted by (possibly indexed) capital letters X. The values
of such a variable X are denoted by small letters; more specifically, we write x
and x to denote absence and presence, respectively, of the concept modelled by
X. (Sub)sets of variables are denoted by bold-face capital letters X and their
joint value combinations by bold-face small letters x; Ω(X) is used to denote
the domain of all value combinations of X. We further consider joint probability
distributions Pr over sets of variables, represented by a Bayesian network.
Within Bayesian networks, we consider causal 1 mechanisms M(n) composed
of a single effect variable E and one or more cause variables Ci , i = 1, . . . , n, with
arcs pointing to E; Fig. 1 (left) illustrates the basic idea of such a mechanism.
For the effect variable E of a causal mechanism, a conditional probability table is
specified, with distributions Pr(E | C) over E for each joint value combination c
for its set C of cause variables; this table thus specifies a number of distributions
that is exponential in the number of cause variables involved.
A causal interaction model for a causal mechanism M(n) takes the form of
a parameterised probability table for the effect variable involved. The noisy-or
model [17], which is the best known among these interaction models, defines the
conditional probability table for the effect variable E of M(n) through
– the conditional probability Pr(e | c̄1 , . . . , c̄n ) = 0;
– the parameters pi = Pr(e | c̄1 , . . . , c̄i−1
, ci , c̄i+1 , . . . , c̄n ), for all i = 1, . . . , n;
– the definitional rule Pr(e | c) = 1 − i∈Ic (1 − pi ) for the probabilities given
the remaining value combinations c involving the presence of two or more
causes, where Ic is the set of indices of the present causes ci in c.
Figure 1 (right) illustrates the parameterised table of the noisy-or model for a
mechanism with three cause variables. For a causal mechanism M(n), the model
1
Although we do not make any claim with respect to causal interpretation, we adopt
the terminology commonly used.
The Hidden Elegance of Causal Interaction Models 41
Fig. 2. Partition of a causal interaction model into a probabilistic noise part and a
deterministic functional part (left); a chain decomposition for a commutative and asso-
ciative deterministic function (right).
– En has the noise variable Zn for its single parent and encodes the function
application En = f (Zn , I), where the variable I captures identity under f ;
– for all i = 1, . . . , n − 1, the variable Ei has Zi and Ei+1 for its parents and
encodes Ei = f (Zi , Ei+1 ).
Fig. 3. The cascading representation of a causal interaction model, which results from
marginalising out the noise variables Zi from its chain decomposition.
equals 4·n − 2 however, instead of the 2n probabilities required for the effect
variable E in the original partition. For an interaction model with a leak variable,
the number of required probabilities for the effect variable(s) is reduced from
2n+1 to 4·n. We will return to these observations in further detail in Sect. 4.
While the original motivation for partitioning causal interaction models was
to underline their induced ease of knowledge acquisition, Heckerman noted that
the introduction of the hidden noise variables Zi in fact made probability elic-
itation harder rather than easier, as “assessments are easier to elicit (and pre-
sumably more reliable) when a person makes them in terms of observable vari-
ables” [9]. Following this insight, he proposed a temporal interpretation of inde-
pendence of causal influences for causal interaction models in which a cause Ci is
assumed to occur (or not) at time i and has associated its own effect variable Ei
indicating the effect after the presence or absence of the first i causes have been
observed. With this temporal interpretation, the hidden noise variables are no
longer required and the effect variables Ei have in fact become observable vari-
ables with a clear semantics supporting probability elicitation. As noted already
by Heckerman himself, this temporal interpretation for causal interaction models
has reduced applicability for its main drawback [9,10].
n−1
Pr(e1 | c) = Pr(e1 | c1 , e2 ) · Pr(ek | ck , ek+1 ) · Pr(en | cn ) (4)
e− ∈Ω(E− ) k=2
where Ω(E− ) is the domain of the variable set E− = {E2 , . . . , En }, and where
ek ∈ Ω(Ek ), k = 2, . . . , n, is consistent with e− and ck ∈ Ω(Ck ), k = 1, . . . , n,
is consistent with c. We emphasize that we focus on the value e1 of the variable
E1 rather than on the value e1 , to simplify our arguments in the sequel.
We now illustrate the derivation of the probability tables for the cascading
representations of the noisy-or and leaky noisy-or models, and demonstrate
their equivalence to the standard causal-mechanism representation.
The Hidden Elegance of Causal Interaction Models 45
Pr(ei | ci , ei+1 ) = 1 · 0 + 0 · 1 = 0
Pr(ei | ci , ei+1 ) = 1 · pi + 0 · (1 − pi ) = pi
Pr(ei | ci , ei+1 ) = 1 · 0 + 1 · 1 = 1
Pr(ei | ci , ei+1 ) = 1 · pi + 1 · (1 − pi ) = 1
Pr(en | cn ) = 1 · 0 + 0 · 1 = 0
Pr(en | cn ) = 1 · pn + 0 · (1 − pn ) = pn
n−1
Pr(e1 | c) = Pr(ei | ci , ei+1 ) · Pr(en | cn ) (5)
i=1
Table 1. For the two representations of the noisy-or model for a causal mechanism
M(n): the number of variables (#variables), the number of non-redundant probabilities
for the effect variable(s) (#probabilities), and of those, the number of free parameters
to be acquired (#free) and the number of zeroes and ones (#0/1).
– Where the noisy-or model has Pr(e | c) = pi for c including the single
present cause ci , we have in the cascading representation that the product
term contributed for the variable Ei has the probability Pr(ei | ci , ei+1 ) =
1 − pi or, in case i = n, Pr(en | cn ) = 1 − pn . As all other terms in the product
of Eq. 5 equal 1, we find that Pr(e1 | c) = 1 − pi and, hence, Pr(e1 | c) = pi .
– For any value combination c including multiple present causes,
with their
indices in Ic , the noisy-or model has Pr(e | c) = 1 − i∈Ic (1 − pi ). In
the cascading representation, the product term contributed by any Ej with
the term by any Ei with i ∈ Ic is 1 − pi . We thus find
j ∈ Ic equals 1 and
that Pr(e1 | c) = i∈Ic (1 − pi ) and, hence, Pr(e1 | c) = 1 − i∈Ic (1 − pi ).
From the three cases above, we conclude that the cascading representation indeed
correctly captures the noisy-or model and, hence, that the cascading representa-
tion is equivalent with the fully expanded probability table for the effect variable
E in a causal mechanism with a noisy-or model.
The cascading representation of the noisy-or model is a more efficient rep-
resentation than a causal mechanism M(n) with a full probability table for the
effect variable E, despite the increase in number of variables to 2·n compared
to the n + 1 variables in the standard representation. More specifically, the cas-
cading representation requires 4·(n − 1) + 2 conditional probability distributions
in total for the variables Ei , of which 3·(n − 1) + 1 are degenerate. For ease of
reference, Table 1 summarises a comparison of the size of the cascading repre-
sentation with that of the standard representation. We note that the cascading
representation is more concise when a causal mechanism would include n ≥ 4
cause variables for the effect variable of interest.
The cascading leaky noisy-or. We now briefly address the cascading represen-
tation of the noisy-or model in the presence of a leak probability, which differs
from that of the standard noisy-or model only in the specification of the prob-
ability table for the variable En , which is derived from Eq. 2 as
Pr(en | cn ) = pL
Pr(en | cn ) = pL + pn · (1 − pL ) = 1 − (1 − pL ) · (1 − pn )
captures the leaky noisy-or model, we use Eq. 5 again, now for the different
cases distinguished by the leaky noisy-or model. We observe that, while with
the noisy-or model, the variable En would contribute to the product either
Pr(en | cn ) = 1 or Pr(en | cn ) = 1 − pn , it contributes either Pr(en | cn ) = 1 − pL
or Pr(en | cn ) = (1 − pL ) · (1 − pn ) in the cascading representation of the leaky
noisy-or model. As a consequence
n−1
Pr(ei | c) = Pr(ei | c1 , ei+1 ) · Pr(ek | ck , ek+1 ) · Pr(en | cn )
e− ∈Ω(E− ) k=i+1
where Ω(E− ) now is the domain of E− = {Ei+1 , . . . , En }, and ek , ck are defined
as before. As each variable Ei in the cascading representation represents the
effect variable in a (leaky) noisy-or model with the cause variables Ci , . . . , Cn , it
has an intuitive meaning that allows for explicit embedding of the representation
in a network without hampering interpretation and probability elicitation.
[Pr(e | ci , ck )] (x) = a, with a = Pr(e | ci , ck , c− ) · Pr(c− | ci , ck ) (6)
c− ∈Ω(C− )
⎡ ⎤
Pr(e1 | ci , ck , c− ) (x) = ⎣(1 − pi ) · (1 − pk ) · (1 − pj )⎦ (x)
j∈Ic−
= (1 − x) · (1 − pk ) · (1 − pj )
j∈Ic−
where Ic− indexes all present causes in C− and, for ease of exposition, we again
focus on the value e1 for variable E1 . As a result, we find that
[Pr(e1 | ci , ck )] (x) = (1 − x) · (1 − pk ) · (1 − pj ) · Pr(c− | ci , ck )
c− ∈Ω(C− ) j∈Ic−
and conclude that the function [Pr(e1 | ci , ck )] (x) is in fact a linear function of
the form a · x + b with constants a, b, where
a = (1 − pk ) · Pr(c− | ci , ck ) (1 − pj )
c− ∈Ω(C− ) j∈Ic−
b= 1−a
50 S. Renooij and L. C. van der Gaag
References
1. Castillo, E., Gutiérrez, J.M., Hadi, A.S.: Sensitivity analysis in discrete Bayesian
networks. IEEE Trans. Syst. Man Cybern. 27, 412–423 (1997)
2. Chan, H., Darwiche, A.: Sensitivity analysis in Bayesian networks: From single
to multiple parameters. In: Halpern, J., Meek, C. (eds.) Proceedings of the 20th
Conference on Uncertainty in Artificial Intelligence, pp. 67–75 (2004)
The Hidden Elegance of Causal Interaction Models 51
3. Coupé, V.M.H., van der Gaag, L.C.: Properties of sensitivity analysis of Bayesian
belief networks. Ann. Math. Artif. Intell. 36, 323–356 (2002)
4. Dı́ez, F.J., Druzdzel, M.J.: Canonical Probabilistic Models for Knowledge Engi-
neering. Technical Report CISIAD-06-01 (2007)
5. Dı́ez, F.J., Galán, S.F.: Efficient computation for the noisy max. Int. J. Intell. Syst.
18, 165–177 (2003)
6. Frey, B.J., Patrascu, R., Jaakkola, T., Moran, J.: Sequentially fitting inclusive trees
for inference in noisy- or networks. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.)
Advances in Neural Information Processing Systems 13, pp. 493–499. MIT Press,
Cambridge (2001)
7. Getoor, L.: Learning Statistical Models from Relational Data. PhD Thesis.
Stanford University (2001)
8. Heckerman, D.: A tractable inference algorithm for diagnosing multiple diseases.
In: Henrion, M., Kanal, L., Lemmer, J., Shachter, R. (eds.) Proceedings of the 5th
Conference on Uncertainty in Artificial Intelligence, pp. 163–172 (1989)
9. Heckerman, D.: Causal independence for knowledge acquisition and inference. In:
Heckerman, D., Mamdani, E. (eds.) Proceedings of the 9th Conference on Uncer-
tainty in Artificial Intelligence, pp. 122–127 (1993)
10. Heckerman, D., Breese, J.: Causal independence for probability assessment and
inference using Bayesian networks. IEEE Trans. Syst. Man Cybern. 26, 826–831
(1996)
11. Henrion, M.: Some practical issues in constructing belief networks. In: Kanal, L.N.,
Levitt, T.S., Lemmer, J.F. (eds.) Uncertainty in Artificial Intelligence 3, pp. 161–
173. Elsevier (1989)
12. Huang, K., Henrion, M.: Efficient search-based inference for noisy- or belief net-
works: TopEpsilon. In: Horvitz, E., Jensen, F. (eds.) Proceedings of the 12th Con-
ference on Uncertainty in Artificial Intelligence, pp. 325–331 (1996)
13. Jesus, P., Baquero, C., Almeida, P.S.: A survey of distributed data aggregation
algorithms. IEEE Commun. Surv. Tutorials 17, 381–404 (2011)
14. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. The MIT Press, Cambridge (2009)
15. Li, W., Poupart, P., van Beek, P.: Exploiting structure in weighted model counting
approaches to probabilistic inference. J. Artif. Intell. Res. 40, 729–765 (2011)
16. Olesen, K.G., et al.: A MUNIN network for the median nerve: a case study on
loops. Appl. Artif. Intell. 3, 385–403 (1989)
17. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, Burlington (1988)
18. del Sagrado, J., Salmerón, A.: Representing canonical models as probability trees.
In: Conejo, R., Urretavizcaya, M., Pérez-de-la-Cruz, J.-L. (eds.) CAEPIA/TTIA
-2003. LNCS (LNAI), vol. 3040, pp. 478–487. Springer, Heidelberg (2004). https://
doi.org/10.1007/978-3-540-25945-9 47
19. Zhang, N.L., Yan, L.: Independence of causal influence and clique tree propagation.
Int. J. Approximate Reasoning 19, 335–349 (1998)
Computational Models for Cumulative
Prospect Theory: Application
to the Knapsack Problem Under Risk
1 Introduction
This leads us to propose a MIP formulation for the Knapsack problem under
risk. This model is tested on families of instances of different sizes. In Sect. 5 we
consider a special case where the probability weighting functions used in CPT
are piecewise linear with a bounded number of pieces. Under this assumption,
we propose another MIP formulation, more compact and easier to solve, for the
same problem.
2 Related Work
CPT was already used in AI, e.g., for developing a risk sensitive reinforcement
learning in a traffic signal control application [16]. CPT has also been used in a
number of decision support applications. For example, an application of CPT for
the multi-objective optimization of a bus network is proposed in [9]. However, in
this case study, the set of alternatives is explicitly defined and does not require
optimization techniques.
The Knapsack Problem (KP) under consideration in this paper consists in
selecting a subset of items under a budget constraint. This problem has some
links with the portfolio selection problem that can be seen as the continuous
relaxation of KP under risk. The application of CPT to portfolio selection and
insurance demand have been studied in finance (see e.g. [3]) with a computa-
tional model solvable under some specific assumptions (S-Shaped functions, risk
free reference point and/or linear utility functions). Beside CPT, several LP-
computational measures of dispersion are introduced to control the risk attached
to portfolios: let us mention the mean absolute deviation, the Gini’s mean dif-
ference (GMD) as basic LP computable risk measures, the worst realization
(Minimax) and the Conditional Value-at-Risk (CVaR) as basic LP computable
safety measures [10,11]. Moreover, in the latter reference, computational issues
related to the solution of portfolio models with integrity constraints are investi-
gated and a matheuristic called Kernel Search is proposed. These contributions
do not consider the use of bipolar valuation scales as in CPT.
In multicriteria analysis there is also an increasing interest for modeling dif-
ferent attitudes in the aggregation depending on whether evaluations are on the
positive or negative side. For example, the Choquet integral has been extended
to the bipolar case in [2,8] but optimization aspects attached to general bipo-
lar Choquet integral have not been investigated. Very recently, some LP-solvable
models have been proposed [12] for a subclass of bipolar Choquet integrals named
biOWA (for bipolar ordered weighted average). However, biOWA are symmetric
functions of their argument and do not allow to account for decision under risk
when scenarios have different probabilities. Finally an LP-solvable model was
proposed for a weighted extension of OWA operators [13] but does not consider
the case of bipolar scales. In this paper, we are going to introduce computational
models solvable by mixed-integer linear programming to determine CPT-optimal
solutions in implicit decision spaces.
Computational Models for CPT 55
This model clearly generalizes the Expected Utility model that can be
obtained for ϕ(p) = p for all p ∈ [0, 1]. Moreover it also includes the dual
model of EU known as Yaari’s model [25] as special case (when u is linear).
Nonetheless, this model is not always sufficient to account for decision behaviors
observed when decision makers think of outcomes relative to a certain reference
point. The utility scale is treated as an interval scale and preferences are not
impacted by positive affine transformations. Thus, 0 has no specific status in
the valuation scale, nor any other constant. This may prevent to account for
some sophisticated decision behaviors as illustrated in the following:
Example 2. We look for an optimal path from a source node to a sink node in a
network represented by a directed graph. The arcs of the graph are endowed with
56 H. Martin and P. Perny
vectors representing the algebraic payoff attached to the arc (which can represent
a gain or a loss) under two possible scenarios of equal probability. For example, the
valuation (−2, 3) means that the outcome will be a loss of 2 in scenario 1 and a
gain of 3 in scenario 2. Outcomes are assumed to be additive along a path and we
assume that u(z) = z. This problem can represent several situations (e.g., a path
planning problem or investment planning problem, both under uncertainty).
Let us consider two different instances of this problem, characterized by two
different graphs with nodes {s, a, b, t} and {s , c, d, t } respectively. The graphs
are presented below (Fig. 1).
On the left handside, the upper and lower s-t-paths have utilities (9, 3) and
(5, 5) respectively. We assume here that the DM prefers the former path because
she maximizes the expected outcome when all evaluations are positive. In the
instance given on the right handside, the upper and lower s -t -paths respectively
have utilities (−1, −7) and (−5, −5). Here the DM may exhibit a more cautious
attitude towards risk due to the presence of negative outcomes. Let us assume
that she prefers the latter solution due to the fact that the outcome in the worst
case scenario is better. Hence, to model these preferences with RDU we must ful-
fill the following constraints: fϕ (9, 3) > fϕ (5, 5) and fϕ (−7, −1) < fϕ (−5, −5).
The former inequality implies that 3+ϕ( 12 )×(9−3) > 5 and therefore ϕ( 12 ) > 13 .
Moreover the latter inequality implies −7 + ϕ( 12 ) × (−1 + 7) < −5 and therefore
ϕ( 12 ) < 13 which yields a contradiction. Hence RDU is not able to represent the
observed preferences.
To overcome the descriptive limitations illustrated in the above example,
we consider now the Cumulative Prospect Theory model (CPT for short), first
introduced in [7].
Definition 2. Let x ∈ Rn be the outcome vector such that x(1) ≤ . . . ≤ x(j−1) <
0 ≤ x(j) ≤ . . . ≤ x(n) with j ∈ {0, . . . , n}, the Cumulative Prospect Theory is
characterized by the following evaluation function:
⎧ n n
⎪
⎪
⎪ − p(k) ) if (i) ≥ (j)
n ⎪
⎨
ϕ( p (k) ) ϕ(
u k=i k=i+1
gϕ,ψ (x) = wi u(xi ) with wi = i i−1
⎪
⎪
i=1 ⎪
⎪ ψ( p ) − ψ( p(k) ) if (i) < (j)
⎩ (k)
k=1 k=1
(3)
Computational Models for CPT 57
where ϕ and ψ are two real-valued increasing functions from [0, 1] to [0, 1] that
assign 0 to 0 and 1 to 1, and u is a continuous and increasing real-valued utility
function such that u(0) = 0 (hence u(x) and x have the same sign).
It can easily be checked that whenever ϕ(p) = 1 − ψ(1 − p) for all p ∈ [0, 1]
(duality) then CPT boils down to RDU. The use of non-dual probability weight-
ing functions ϕ and ψ depending on the sign of the outcomes under consideration
enables to model shifts of behavior relatively to the reference point (here 0). Let
us come back to Example 2 under the assumption that u(z) = z for all z ∈ R,
we have: gϕ,ψ (9, 3) = [ϕ(1) − ϕ( 12 )]3 + [ϕ( 12 ) − ϕ(0)]9 = 3 + 6ϕ( 12 ) since ϕ(0) = 0
and ϕ(1) = 1. Similarly gϕ,ψ (5, 5) = [ϕ(1) − ϕ( 12 )]5 + [ϕ( 12 ) − ϕ(0)]5 = 5. Hence
gϕ,ψ (9, 3) > gϕ,ψ (5, 5) implies ϕ( 12 ) > 13 (*).
On the other hand we have gϕ,ψ (−7, −1) = [ψ( 12 ) − ψ(0)](−7) + [ψ(1) −
ψ( 2 )](−1) = −1 − 6ψ( 12 ) since ψ(0) = 0 and ψ(1) = 1. Similarly gϕ,ψ (−5, −5) =
1
−5. Hence gϕ,ψ (−7, −1) < gϕ,ψ (−5, −5) implies ψ( 12 ) > 23 , which does not yield
any contradiction. Thus, the DM’s preferences can be modeled with gϕ,ψ .
As CPT boils down to RDU when ϕ(p) = 1 − ψ(1 − p) for all p ∈ [0, 1] it
is interesting to note that under this additional constraint ψ( 12 ) > 23 implies
ϕ( 12 ) < 13 which is incompatible with the constraint denoted (*) above, derived
from gϕ,ψ (9, 3) > gϕ,ψ (5, 5). This again illustrates the fact that RDU is not able
to describe such preferences.
Strong Risk Aversion in CPT. In many situations decision makers are risk-
averse. It is therefore useful to further specify CPT for risk-averse agents. We
consider here strong risk-aversion that is standardly defined from second-order
stochastic dominance. For any random variable X, let GX be the tail distribution
defined by GX (x) = P (X > x), with P a probability function. Let X, Y be two
random variables,
x X stochastically
x dominates Y at the second order if and only
if for all x ∈ X, −∞ GX (t)dt ≥ −∞ GY (t)dt. From this dominance relation, the
concept of mean-preserving spread standardly used to define risk aversion can
be introduced as follows. Y is said to derive from X using a mean preserving
spread if and only if E(X) = E(Y ) and X stochastically dominates Y at the
second order. We have then the following definition of strong risk aversion [18]:
Definition 3. Let be a preference relation. Strong risk aversion holds for
if and only if X Y for all X and Y such that Y derives from X using a mean
preserving spread.
We recall now the set of conditions that CPT must fulfill to model strong
risk aversion. These conditions were first established in [21].
Theorem 1. Strong risk aversion holds in CPT if and only if ϕ is convex, ψ
is concave, u is concave for losses and also concave for gains, and the following
equation is satisfied:
δ δ
u(x) − u(x − ) ψ(q + s) − ψ(s) ≥ u(y + ) − u(y) ϕ(p + r) − ϕ(r) (4)
q q
for all x ≥ 0 ≥ y and p, q, r, s such as p + q + r + s ≤ 1, p, q > 0 and r, s ≥ 0.
58 H. Martin and P. Perny
We remark that, when u(z) = z for all z, condition (4) can be rewritten in
the following simpler form: ψ(q+s)−ψ(s)q ≥ ϕ(p+r)−ϕ(r)
p for all p, q, r, s such as
p + q + r + s ≤ 1, p, q > 0 and r, s ≥ 0. In terms of derivative, this means that
ψ (s) ≥ ϕ (r) for all r, s ≥ 0 such that r + s ≤ 1.
The above characterization of admissible forms of CPT for a risk-averse deci-
sion maker will be used in the next section to propose computational models for
the determination of CPT-optimal solutions on implicit sets. We conclude the
present section by making explicit a link between CPT and RDU model.
Linking RDU and CPT. Interestingly, CPT can be expressed as a difference
of two RDU values respectively applied to the positive and negative part of
the outcome vector x, using the two distinct probability weighting functions ϕ
and ψ. This reformulation is well known in the literature on rank-dependent
aggregation functions (see e.g., [2]) and reads as follows:
+ −
u
gϕ,ψ (x) = fϕu (x+ ) − fψu (x− ) (5)
where fϕ is the Yaari’s model obtained from fϕu when u(z) = z for all z. Similarly,
for a concave weighting function ψ the dual defined by ψ̄(p) = 1 − ψ(1 − p) for
all p ∈ [0, 1] is convex and has a non-empty core. Hence Proposition 1 can be
used again to establish the following result:
Proposition 2. If ψ is concave we have fψ (x) = max λ.x
λ∈core(ψ̄◦P )
Using Propositions 1 and 2 and Eq. (5) we obtain a new formulation of CPT,
when ϕ and ψ are convex and concave respectively.
Proposition 3. Let x ∈ Rn . If ϕ is convex and ψ is concave then we have:
Now, let us show that this new formulation can be used to optimize gϕ,ψ (x)
using linear programming. From Propositions 1 and 2 the values of fϕ (x) and
fψ (x) for any outcome vector x ∈ Rn can be obtained as the solutions of the two
following linear programs respectively:
n
n
min λi xi max λi xi
i=1
i=1
ϕ(P (A)) ≤ λi ∀A ⊆ N ψ(P (A)) ≥ λi ∀A ⊆ N
i∈A i∈A
λi ≥ 0, i = 1, .., n λi ≥ 0, i = 1, .., n
60 H. Martin and P. Perny
The left LP given above directly derives from Proposition 1. The right LP
given
the constraints ∀B ⊆
above derives from Proposition 2 after observing that
N, i∈B λi ≥ ψ̄(P (B)) are equivalent to ∀A ⊆ N, i∈A λi ≤ ψ(P (A)) (by
setting A = N \ B). Now, if we consider x as a variable vector, we consider the
dual formulations of the above LPs to get rid of the quadratic terms:
max ϕ(P (A)) × dA min ψ(P (A)) × dA
A⊆N
A⊆N
dA ≤ xi i = 1, .., n dA ≥ xi i = 1, .., n
A⊆N :i∈A A⊆N :i∈A
dA ≥ 0 ∀A ⊆ N dA ≥ 0 ∀A ⊆ N
Finally, we obtain program P1 given below to optimize gϕ,ψ , with the assump-
tions that ϕ is convex, ψ is concave and that u(x) = x for all x ∈ Rn .
max ϕ(P (A)) × d+ A − ψ(P (A)) × d−
A
A⊆N A⊆N
⎧
d+A ≤ xi
+
⎪ i = 1, . . . , n
⎪
⎪
⎪
⎪
A⊆N
:i∈A
⎪
⎪ d− ≥ x− i = 1, . . . , n
⎪
⎨ A⊆N :i∈A A i
(P1 ) xi = xi − xi
+ −
⎪ i = 1, . . . , n
⎪
⎪ ≤ +
≤ ×
⎪
⎪ 0 x i z i M i = 1, . . . , n
⎪
⎪ ≤ −
≤ − ×
⎪
⎩ 0 x i (1 z i ) M i = 1, . . . , n
x∈X
x− + + −
i , xi , dA , dA ≥ 0 i = 1, .., n, ∀A ⊆ N
zi ∈ {0, 1} i = 1, . . . , n
come this limitation, we will know present a second computational model with a
polynomial number of variables and constraints, which optimizes gϕ,ψ (x) under
some additional assumptions concerning ϕ and ψ (Table 1).
(−1)
Then we have Fx (u) = inf{y : Fx (y) ≥ u} returns the minimum perfor-
mance y such that the probability of scenarios whose performance is lower than
or equal to y is greater than or equal to u. Then, we define the tail function Gx ,
for all α ∈ [0, 1], by:
n
1 if xi > α
Gx (α) = pi δi (α) with δi (α) =
0 otherwise
i=1
(−1)
and Gx (u) = inf{y : Gx (y) ≤ u} returns the minimum performance y such
that the probability of scenarios whose performance level is greater than y is
lower than or equal to u. First, we observe that the following relation holds
(−1) (−1)
between Gx and Fx .
Proof. Let () be a permutation of scenarios such that x(1) ≤ x(2) ≤. . . ≤ x(n)
n
1 (−1)
and πi = k=i p(k) . Let E(x) = 0 Gx+ (u)ϕ (u) − Gx− (u)ψ (u) du. First,
(−1)
(−1)
(i) for all u ∈ [πi+1 , πi ]. We have:
with πn+1 = 0. We notice that Gx+ (u) = x+
n πi
n πi
= x+
(i) ϕ (u)du − x−
(i) ψ (u)du
i=1 πi+1 i=1 πi+1
n n
n
n n
n
= x+
(i) ϕ( p(k) ) − ϕ( p(k) ) − −
x(i) ψ( p(k) ) − ψ( p(k) )
i=1 k=i k=i+1 i=1 k=i k=i+1
= gϕ,ψ (x)
Then, the desired result can be obtained from another formulation of E(X):
1
Gx+ (u)ϕ (u) − Gx− (u)ψ (u) du
(−1) (−1)
E(x) =
0
t
αi βi
Gx+ (u)ϕ (u)du − Gx− (u)ψ (u)du
(−1) (−1)
=
i=1 αi−1 βi−1
We recall that ϕ (u) = d+i for all u ∈ [αi−1 , αi ] (and dt+1 = 0 for convenience)
+
− −
and ψ (u) = di for all u ∈ [βi−1 , βi ] (and dt+1 = 0 for convenience). We have:
t αi
(−1)
βi
(−1)
= d+
i Gx+ (u)du − d−
i Gx− (u)du
i=1 αi−1 βi−1
t αi
(−1)
βi
(−1)
= d+
i Fx+ (1 − u)du − d−
i Gx− (u)du (see Prop. 4)
i=1 αi−1 βi−1
t 1−αi−1
(−1)
βi
(−1)
= d+
i Fx+ (v)dv − d−
i Gx− (u)du (with v = 1 − u)
i=1 1−αi βi−1
Computational Models for CPT 63
t 1−αi−1
(−1)
1−αi
(−1)
βi
(−1)
= d+
i Fx+ (v)dv − Fx+ (v)dv − d−
i Gx− (u)du
i=1 0 0 βi−1
t
1−αi βi βi−1
(−1) (−1) (−1)
= (d+ +
i+1 − di ) Fx+ (v)dv − d−
i Gx− (u)du − Gx− (u)du
i=1 0 0 0
t
1−αi βi
(−1) (−1)
= (d+
i+1 − d+
i ) Fx+ (v)dv − (d−
i − d−
i+1 ) Gx− (v)dv
i=1 0 0
1−αk (−1)
α (−1)
Fx (v)dv and 0 k Gx (v)dv, for a fixed x and k. The lineariza-
0
p (−1)
tion of 0 Fx (v)dv has been first proposed in [13] and is here extended to
p (−1)
0
Gx (v)dv:
n
n
min xi mi max xi mi
⎧ ni=1 ⎧ n i=1
⎨
m = (1 − α ) ⎨
m =α
i k i k
⎩ i=1 ⎩ i=1
mi ≤ p i i = 1, . . . , n mi ≤ p i i = 1, . . . , n
mi ≥ 0, i = 1, . . . , n mi ≥ 0, i = 1, . . . , n
t
n
t
n
max dk+ ((1 − αk ) × rk+ − p+l blk ) −
+
dk− (αk × rk− + p− −
l blk )
⎧ + k=1 l=1 k=1 l=1
⎪
⎪ r − b+ ik ≤ xi
+
i = 1, . . . , n, k = 1, . . . , t
⎪ k−
⎪ − −
⎪
⎪ r + b ≥ x i i = 1, . . . , n, k = 1, . . . , t
⎨ k ik
−
xi = x+ i − x i i = 1, . . . , n
(P2 )
⎪
⎪ 0 ≤ x+
i ≤ z i × M i = 1, . . . , n
⎪
⎪ −
⎪
⎪ 0 ≤ xi ≤ (1 − z i ) × M i = 1, . . . , n
⎩
x∈X
−
i , xi , bik ≥ 0, i = 1, . . . , n, k = 1, . . . , t
x+
zi ∈ {0, 1}, i = 1, . . . , n
− − −
with dk+ = d+ k+1 − dk and dk = dk − dk+1 for all k = 1, . . . , t. The integer
+
P1 . Table 2 gives the results obtained for the CPT-optimal knapsack problem.
Functions ϕ and ψ are chosen piecewise linear with n breakpoints; these functions
are randomly drawn to satisfy the conditions of Proposition 1. Average times
given in Table 2 are computed over 20 runs, with a timeout set to 1200 s.
m n = 3 n = 5 n = 7 n = 10
100 0.01 0.03 0.07 0.12
500 0.04 0.13 0.19 28.22
750 0.03 0.18 2.76 107.36
1000 0.04 0.27 9.027 191.84
The linearization presented here for the case where u(z) = z for all z can
easily be extended to deal with piecewise linear concave utility functions u for
gains and for losses (admitting a bounded number of pieces). In this case, the
utility function can indeed be defined on gains as the minimum of a finite set of
linear utilities which enables a linear reformulation (the same holds for losses).
Note also that having a concave utility over gains and over losses is consistent
with the risk-averse attitude under consideration in the paper.
6 Conclusion
CPT is a well known model in the context of decision making under risk used
to overcome some descriptive limitations of both EU and RDU. In this paper,
we have proposed two mixed integer programs for the search of CPT-optimal
solutions on implicit sets of alternatives. We tested these computational mod-
els on randomly generated instances of the Knapsack problem involving up to
1000 objects and 10 scenarios. The second MIP formulation proposed performs
significantly better due to the additional restriction to piecewise linear utility
functions.
A natural extension of this work could be to address the exponential aspect of
our first formulation with a Branch&Price approach. Another natural extension
of this work could be to propose a similar approach for a general bipolar Choquet
integral where the capacity is not necessarily defined as a weighted probability.
It can easily be shown that the first linearization proposed in the paper still
applies to bi-polar Choquet integrals.
References
1. Chateauneuf, A.: On the use of capacities in modeling uncertainty aversion and
risk aversion. J. Math. Econ. 20(4), 343–369 (1991)
2. Grabisch, M., Marichal, J.L., Mesiar, R., Pap, E.: Aggregation Functions, vol. 127.
Cambridge University Press, Cambridge (2009)
Computational Models for CPT 65
3. He, X.D., Zhou, X.Y.: Portfolio choice under cumulative prospect theory: an ana-
lytical treatment. Manag. Sci. 57(2), 315–331 (2011)
4. Hines, G., Larson, K.: Preference elicitation for risky prospects. In: Proceedings of
the 9th International Conference on Autonomous Agents and Multiagent Systems:
volume 1, vol. 1, pp. 889–896. International Foundation for Autonomous Agents
and Multiagent Systems (2010)
5. Jaffray, J., Nielsen, T.: An operational approach to rational decision making based
on rank dependent utility. Eur. J. Oper. Res. 169(1), 226–246 (2006)
6. Jeantet, G., Spanjaard, O.: Computing rank dependent utility in graphical models
for sequential decision problems. Artif. Intell. 175(7–8), 1366–1389 (2011)
7. Kahneman, D., Tversky, A.: Prospect theory: an analysis of decision under risk.
Econometrica 47(2), 263–292 (1979)
8. Labreuche, C., Grabisch, M.: Generalized choquet-like aggregation functions for
handling bipolar scales. Eur. J. Oper. Res. 172(3), 931–955 (2006)
9. Li, X., Wang, W., Xu, C., Li, Z., Wang, B.: Multi-objective optimization of urban
bus network using cumulative prospect theory. J. Syst. Sci. Complex. 28(3), 661–
678 (2015)
10. Mansini, R., Ogryczak, W., Speranza, M.G.: Twenty years of linear programming
based portfolio optimization. Eur. J. Oper. Res. 234(2), 518–535 (2014)
11. Mansini, R., Ogryczak, W., Speranza, M.G.: Linear and Mixed Integer Program-
ming for Portfolio Optimization. EATOR. Springer, Cham (2015). https://doi.org/
10.1007/978-3-319-18482-1
12. Martin, H., Perny, P.: Biowa for preference aggregation with bipolar scales: appli-
cation to fair optimization in combinatorial domains. In: IJCAI (2019)
13. Ogryczak, W., Śliwiński, T.: On efficient wowa optimization for decision support
under risk. Int. J. Approximate Reasoning 50(6), 915–928 (2009)
14. Perny, P., Spanjaard, O., Storme, L.X.: State space search for risk-averse agents.
In: IJCAI, pp. 2353–2358 (2007)
15. Perny, P., Viappiani, P., Boukhatem, A.: Incremental preference elicitation for
decision making under risk with the rank-dependent utility model. In: Proceedings
of Uncertainty in Artificial Intelligence (2016)
16. Prashanth, L., Jie, C., Fu, M., Marcus, S., Szepesvári, C.: Cumulative prospect
theory meets reinforcement learning: prediction and control. In: International Con-
ference on Machine Learning, pp. 1406–1415 (2016)
17. Quiggin, J.: Generalized Expected Utility Theory - The Rank-dependent Model.
Kluwer Academic Publisher, Dordrecht (1993)
18. Rothschild, M., Stiglitz, J.E.: Increasing risk: I. A definition. J. Econ. Theory 2(3),
225–243 (1970)
19. Savage, L.J.: The Foundations of Statistics. J. Wiley and Sons, New-York (1954)
20. Schmeidler, D.: Integral representation without additivity. Proc. Am. Math. Soc.
97(2), 255–261 (1986)
21. Schmidt, U., Zank, H.: Risk aversion in cumulative prospect theory. Manag. Sci.
54(1), 208–216 (2008)
22. Shapley, L.: Cores of convex games. Int. J. Game Theory 1, 11–22 (1971)
23. Tversky, A., Kahneman, D.: Advances in prospect theory: cumulative representa-
tion of uncertainty. J. Risk Uncertainty 5(4), 297–323 (1992)
24. Von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior,
2nd edn. Princeton University Press, Princeton (1947)
25. Yaari, M.: The dual theory of choice under risk. Econometrica 55, 95–115 (1987)
On a New Evidential C-Means Algorithm
with Instance-Level Constraints
1 Introduction
2 Background
subject to
mik + mi∅ = 1 and mik ≥ 0 ∀i ∈ {1, . . . , n}. (4)
Ak ⊆Ω,Ak =∅
Several evidential C-Means based algorithms have already been proposed [1–
4,8,13] to deal with background knowledge. For each of them, constraints are
The LPECM Semi-clustering Algorithm 69
expressed in the framework of a belief function and a term penalizing the con-
straints violation is incorporated in the objective function of the ECM algorithm.
In [2,3], labeled data constraints are introduced in the algorithms, i.e. the
expert can express the uncertainty about the label of an object by assigning it to
a subset. Objective functions of the algorithms are written in such a way that any
mass function which partially or fully respects a constraint on a specific subset
has a high weighted plausibility given to a singleton included in the subset.
r
|Aj ∩ Al | 2
Tij = Ti (Aj ) = r mil , ∀i ∈ {1 . . . n}, Al ⊆ Ω, (6)
|Al |
Aj ∩Al =∅
r
|A ∩A | 2
where r ≥ 0 is a fixed parameter. Notice that if r = 0, then j|Al |rl = 1, which
implies that Tij is identical to the plausibility plij .
In [4], authors assumed that pairwise constraints (i.e. must-link and cannot-
link constraints) are available. A plausibility to belong or not to the same class
is then defined. This plausibility allows us to add a penalty term having high
values when there exists a high plausibility that two objects are (respectively are
not) in the same cluster although they have a must-link constraint (respectively
a cannot-link constraint).
pll×j (θ) = ml×j (Al × Aj )
{Al ×Aj ⊆Ω 2 |(Al ×Aj )∩θ=∅}
(7)
= ml (Al )mj (Aj ),
Al ∩Aj =∅
where, θ denotes the event that objects xi and xj belong to the same class
corresponds to the subset {(ω1 , ω1 ), (ω2 , ω2 ), . . . , (ωk , ωk )} within Ω 2 , whereas
θ denotes the event that objects xi and xj do not belong to the same class
corresponds to its complement.
JLP ECM (M, V, S) = ξJECM (M, V, S) + γJM (M) + ηJC (M) + δJL (M), (9)
70 J. Xie and V. Antoine
where bik denotes whether the ith instance belongs to the subset Ak or not:
1 if xi is constrained to subset Ak ,
bik = (13)
0 otherwise.
It should be emphasized that in this study, unlike [2], each labeled object
is constrained to only one subset. Indeed, it makes more coherent the set of
constraints retrieved from the background knowledge. Constraints are gathered
in three different sets such that M corresponds to the set of must-link con-
straints, C to the set of cannot-link constraints and L denotes the labeled data
constraints set. The JM function returns the sum of the plausibilities that must-
link constrained objects to belong to the same class. Similarly, JC returns the
sum of the plausibilities that cannot-link constrained objects are not in the same
class. The JL term calculates for each labeled object a weighted plausibility to
belong to the label.
3.2 Optimization
The objective function is minimized as the ECM algorithm, i.e. by carrying out
an iterative scheme where first V and S are fixed to optimize M, second M and
S are fixed to optimize V and finally M and V are fixed to optimize S.
Centroids Optimization. It can be observed from (9) that the three penalty
terms included in the objective function of the LPECM algorithm do not depend
on the cluster centroids. Hence, the update scheme of V is identical to the ECM
algorithm [14].
The LPECM Semi-clustering Algorithm 71
where nM denotes
M the number of must-link constraints, FM is a vector of size
2c and ΔM = δkl corresponds to a matrix (2c × 2c ) such that:
⎧
⎨ 1 if Ak = ∅ or Al = ∅,
FTM = [−1, 0, . . . , 0] and δkl M
= −1 if Ak = Al and |Ak | = |Al | = 1,
⎩
2c
0 otherwise.
(17)
The penalty term associated to cannot-link constraints is:
JC (M) = mTi ΔC mj , (18)
(xi ,xj )∈C
C
where ΔC = δkl is a matrix (2c × 2c ) such that:
C 1 if Ak ∩ Al
= ∅,
δkl = (19)
0 otherwise.
Finally, the penalty term for the labeled data constraints is denoted as
follows:
n
JL (M) = nL − FTL mi , (20)
i=1
72 J. Xie and V. Antoine
where expression (xi , Ak ) ∈ L means that the labeled data constraint on object
i is the subset Ak . Function vikl = {0, 1} equals to 1 for subsets Al that has an
intersection with Ak knowing
the constraint xi ∈ Ak .
Now, let us define mT = mT1 , . . . , mTn the vector of size n2c containing the
masses for each object and each subset, H a matrix of size (n2c × n2c ) and F a
vector of size n2c such that:
⎛ 1 ⎞
Φ Δ12 · · · Δ1n ⎧
⎜ Δ21 Φ2 · · · ⎟ ⎪ΔM , if (xi , xj ) ∈ M ,
⎨
⎜ ⎟
H=⎜ . .. . . . ⎟ , where Δij = ΔC , else if (xi , xj ) ∈ C ,
⎝ .. . . .. ⎠ ⎪
⎩
0, otherwise.
Δn1 · · · Φn
(24)
FT = F1 · · · Fi · · · Fn , where Fi = ti FM − bi FL , (25)
1, if xi ∈ M , 1, if xi ∈ L ,
ti = , and bi = . (26)
0, otherwise. 0, otherwise.
Finally, the objective function (9) can be rewritten as follows:
4 Experiments
4.1 Experimental Protocols
Performances and time consumption of the LPECM algorithm have been tested
on a toy data set and several classical data sets from UCI Machine Learning
Repository [9]. For the Letters data set, we kept only the three letters {I,J,L}
as done in [6]. As in [14], fixed parameters associated to the ECM algorithm
The LPECM Semi-clustering Algorithm 73
Fig. 1. Hard credal partition obtained Fig. 2. Hard credal partition obtained
on Toy data set with the ECM algo- on Toy data set with the LPECM algo-
rithm rithm
Figure 2 presents the hard credal partition obtained. The magenta dashed line
describes cannot-link constraints, the light green solid line represents must-link
constraints and the circled point corresponds to the labeled data constraints .
Figure 3 illustrates, for the execution of the LPECM algorithm, the mass
distribution for singletons with respect to the point numbers, allowing us a more
distinct sight of the masses allocations. Table 1 displays the accuracy as well as
time consumption for the ECM algorithm and the LPECM algorithm when first
only the cannot-link constraint is incorporated, second when the cannot-link
and the must-link constraint are introduced (Cannot-Must-Link line in Table 1),
finally when all constraints are added (Cannot-Must-Labeled line in Table 1).
Our results demonstrate that the combination of pairwise constraints and labeled
data constraints improved the performance of the semi-clustering algorithm with
tolerable time consumption. As expected, the more constraints are added, the
better are the performance.
The LPECM algorithm has been tested on three known data sets from the
UCI Machine Learning Repository namely Iris, Glass, and Wdbc and a derived
Letters data set from UCI. Table 2 indicates for each data set its number of
objects, its number of attributes and its number of classes.
For each data set, we randomly created 5%, 8%, and 10% of each type of
constraints out of the whole objects, leading to a total of 15%, 24%, and 30%
of constraints. As an example, Fig. 4 shows the hard credal partition obtained
with the Iris data set after executing the LPECM algorithm with a Mahalanobis
distance and 24% of constraints in total. As can be observed, all the constrained
objects are clustered with certainty in a singleton. Ellipses represent the covari-
ance matrices obtained for each cluster.
The LPECM Semi-clustering Algorithm 75
ARI Time(s)
ECM 0.60 0.07
LPECM-Cannot 0.68 0.41
LPECM-C-Must 0.85 0.29
LPECM-C-M-labeled 1.00 0.22
Fig. 3. Mass curve obtained on Toy data
set with the LPECM algorithm
Tables 3 and 4 illustrate for all data sets the accuracy results with a Euclidean
and a Mahalanobis distance respectively when the different percentage of con-
straints are employed. Mean and standard deviation are calculated over 20 sim-
ulations. As it can be observed, incorporating constraints lead most of the time
to significant improvement of the clustering solution. Using a Mahalanobis dis-
tance particularly help to achieve better accuracy than using a Euclidean dis-
tance. Indeed, the Mahalanobis distance corresponds to an adaptive metric giv-
ing more freedom than a Euclidean distance to respect the constraints while
finding a coherent data structure.
For the time consumption, as it can be observed from Fig. 5, (1) Adding con-
straints gives higher computation time than no constraints. (2) most of the time,
the more constraints are added, the less time is needed to finish the computation.
76 J. Xie and V. Antoine
ECM LPECM
5.00% 8.00% 10.00%
Iris 0.59 ± 0.00 0.70 ± 0.01 0.71 ± 0.00 0.70 ± 0.01
Letters 0.04 ± 0.01 0.09 ± 0.03 0.09 ± 0.04 0.10 ± 0.02
Wdbc 0.67 ± 0.00 0.71 ± 0.00 0.71 ± 0.01 0.71 ± 0.00
Glass 0.59 ± 0.07 0.60 ± 0.07 0.62 ± 0.06 0.65 ± 0.08
ECM LPECM
5.00% 8.00% 10.00%
Iris 0.67 ± 0.01 0.71 ± 0.05 0.82 ± 0.01 0.83 ± 0.04
Letters 0.08 ± 0.01 0.45 ± 0.03 0.47 ± 0.02 0.60 ± 0.05
Wdbc 0.73 ± 0.02 0.74 ± 0.03 0.75 ± 0.02 0.77 ± 0.05
Glass 0.56 ± 0.03 0.60 ± 0.03 0.65 ± 0.02 0.65 ± 0.03
Fig. 5. Time consumption (CPU) of the LPECM algorithm with Euclidean distance
5 Conclusion
the uncertainties about the class memberships of the objects. Experiments show
that the LPECM algorithm does obtain better accuracy with the introduction
of constraints, particularly with a Mahalanobis distance. Further investigations
have to be performed to fine-tune parameters and to study the influence of
the constraints on the clustering solution. The LPECM algorithm can also be
applied for a real application to show the interest in gathering various types
of constraints. In this framework, active learning schemes, which automatically
retrieve few informative constraints with the help of an expert, are interesting
to study. Finally, in order to scale and fast the LPECM algorithm, a new mini-
mization process can be developed by relaxing some optimization constraints.
References
1. Antoine, V., Quost, B., Masson, M.H., Denœux, T.: CEVCLUS: evidential clus-
tering with instance-level constraints for relational data. Soft Comput. - Fusion
Found. Methodol. Appl. 18(7), 1321–1335 (2014)
2. Antoine, V., Gravouil, K., Labroche, N.: On evidential clustering with partial
supervision. In: Destercke, S., Denoeux, T., Cuzzolin, F., Martin, A. (eds.) BELIEF
2018. LNCS (LNAI), vol. 11069, pp. 14–21. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-99383-6 3
3. Antoine, V., Labroche, N., Vu, V.V.: Evidential seed-based semi-supervised clus-
tering. In: International Symposium on Soft Computing & Intelligent Systems,
Kitakyushu, Japan (2014)
4. Antoine, V., Quost, B., Masson, M.H., Denœux, T.: CECM: constrained evidential
C-means algorithm. Comput. Stat. Data Anal. 56(4), 894–914 (2012)
5. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy C-means clustering algorithm.
Comput. Geosci. 10(2–3), 191–203 (1984)
6. Bilenko, M., Basu, S., Mooney, R.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proceedings of the Twenty-First International
Conference on Machine Learning. ACM New York, NY, USA (2004)
7. Coleman, T.F., Li, Y.: A reflective newton method for minimizing a quadratic
function subject to bounds on some of the variables. SIAM J. Optim. 6(4), 1040–
1058 (1996)
8. Denœux, T.: Evidential clustering of large dissimilarity data. Knowl.-Based Syst.
106(C), 179–195 (2016)
9. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.
edu/ml
10. Grira, N., Crucianu, M., Boujemaa, N.: Active semi-supervised fuzzy clustering.
Pattern Recogn. 41(5), 1834–1844 (2008)
11. Gustafson, D.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix.
In: IEEE Conference on Decision & Control Including the Symposium on Adaptive
Processes, New Orleans, LA (2007)
12. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
13. Li, F., Li, S., Denoeux, T.: k-CEVCLUS: constrained evidential clustering of large
dissimilarity data. Knowl.-Based Syst. 142, 29–44 (2018)
14. Masson, M.H., Denœux, T.: ECM: an evidential version of the fuzzy C-means
algorithm. Pattern Recogn. 41(4), 1384–1397 (2008)
15. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
78 J. Xie and V. Antoine
16. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66, 191–234
(1994)
17. Vu, V.V., Do, H.Q., Dang, V.T., Do, N.T.: An efficient density-based clustering
with side information and active learning: a case study for facial expression recog-
nition task. Intell. Data Anal. 23(1), 227–240 (2019)
18. Wagstaff, K., Cardie, C., Rogers, S., Schrœdl, S.: Constrained k-means cluster-
ing with background knowledge. In: Proceedings of the Eighteenth International
Conference on Machine Learning (ICML), Williamstown, MA, USA, vol. 1, pp.
577–584 (2001)
19. Zhang, H., Lu, J.: Semi-supervised fuzzy clustering: a kernel-based approach.
Knowl.-Based Syst. 22(6), 477–481 (2009)
Hybrid Reasoning on a Bipolar
Argumentation Framework
1 Introduction
An argumentation framework (AF) is a powerful tool in the context of incon-
sistent knowledge [15,21]. There are several possible application areas of AFs,
including law [4,20]. To date, research on applications has focused principally
on AF updating to yield an acceptable set of facts when a new argument is
presented, and strategies to win the argumentation when all of the dialog paths
are known. However, in real legal cases, an AF representing a law in its entirety
is usually incompletely grasped at the initial stage. Thus, it is more realistic to
construct the AF incrementally; recognized facts are added in combination with
AF reasoning.
For example, consider a case in which a person leased her house to another
person, and the lessee then sub-leased a room to his sister; the lessor now wants
to cancel the contract. (This is a simplified version of the case discussed in
Satoh et al. [23].) The lessor decides to prosecute the lessee. The lessor knows
that there was a lease, that they handed over the house to the lessee, and that
the room was handed over by the lessee to the sublessee. However, if the lessor
is not familiar with the law, she does not know what law might be applicable to
her circumstances or what additional facts should be proven to make it effective.
In addition, laws commonly include exceptions; that is, a law is effective if certain
conditions are satisfied provided there is no exception.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 79–92, 2019.
https://doi.org/10.1007/978-3-030-35514-2_7
80 T. Kawasaki et al.
A BAF can be regarded as a directed graph where the nodes and edges
correspond to the arguments and the relations, respectively. Below, we represent
a BAF graphically; a simple solid arrow indicates a support relation, and a
straight arrow with a cutting edge indicates an attack relation. The dashed
rectangle shows a set of arguments supporting a certain argument; it is sometimes
omitted if the supporting set is a singleton.
For a BAF AR, ATT , SUP , let → be a binary relation over AR as follows:
We define semantics for the BAF based on labeling [9]. Usually, labeling is
a function from a set of arguments to {in, out, undec}, but undec is unneces-
sary here because we consider only acyclic BAFs. An argument labeled in is
considered an acceptable argument.
Definition 5 (complete labeling). For a BAF baf = AR, ATT , SUP , label-
ing L is complete iff the following conditions are satisfied: for any argument
A ∈ AR, (i) L(A) = in if A is a leaf or (∀B ∈ AR; (B, A) ∈ ATT ⇒ L(B) =
out) ∧ (∃A ∈ 2AR ; (A, A) ∈ SUP ∧ L(A) = in), (ii) L(A) = out otherwise.
The resulting set of conclusions is the set of arguments that are acceptable,
and no more conclusions can be drawn from the currently known facts.
Example 4 (Cont’d). Let Ex be {ex(a1), ex(b1), ex(c1), ex(d1)}, and ubaf be a
BAF in Fig. 3. Then, the BAF can be constructed using the process shown in
Fig. 4(a) and (b); finally, baf 1 is obtained, and Concl (Ex) = {a, e} is derived
as the set of conclusions. The complete labeling of the BAF baf 1 is shown in
Fig. 4(c).
Example 5 (Cont’d). For baf 1 , we find differential support pair ({f, g}, l),
because {e, f, g} ∩ AR = {e} = ∅ and ({e, f, g}, l) ∈ USUP (Fig. 5).
For a BAF baf = AR, ATT , SUP and an argument A ∈ AR, we detect a set
of facts that satisfies L(A) = in. For an argument A, we check the conditions for
labeling of the arguments that attack A and the sets of arguments that support
A. This is achieved by repeatedly applying the following two algorithms: PC (A)
and NC (A), which are shown in Algorithms 2 and 3, respectively. Note that
there is no argument that both lacks support and is attacked.
Then, discovery of the required facts proceeds using the algorithm shown in
Algorithm 4.
As a result, a set of ex/ab arguments is generated. An existence argument
ex(A) shows that the fact is required if L(A) = in is to hold, whereas an absence
argument ab(A) shows that the evidence is an obstacle to prove L(A) = in.
Example 7 (Cont’d). For a set of required facts {ex(f ), ex(h1), ab(j1)}, assume
that a user has confirmed the existence of f and h1, and the absence of j1.
Then, we construct a new BAF baf 2 in a bottom-up manner from this set. Part
of the labeling of baf 2 is shown in Fig. 10. Finally, we obtain a new conclusion
set Concl = {a, i, l}.
The hybrid algorithm is nondeterministic at several steps and there are mul-
tiple possible solutions.
4.5 Correctness
Definition 9. For the acyclic universal BAF ubaf , the height of an argument
A is defined as follows:
It is easy to show that the heights of arguments are definable when ubaf is
acyclic.
Here, we prove two specifications, one for a BUP, and the other for a TDN.
For a BUP, the built BAF includes arguments pertaining to the evidential facts
that the user recognizes. Notably, the acceptability of such arguments is the
same as that of the universal BAF.
Theorem 1. Assume that ubaf is acyclic. Let baf be built by BUP from Ex .
When UEx is defined as {ex(A)|ex(A) ∈ Ex } ∪ {ab(A)|A is a leaf of ubaf ∧
ex(A) ∈ Ex }, and LU is a complete labeling for UAR ∪ UEx , UATT ∪
{(ab(A), A)|ab(A) ∈ UEx }, USUP ∪{({ex(A)}, A)|ex(A) ∈ UEx }, for any argu-
ment A ∈ UAR, A ∈ AR ∧ L(A) = LU (A), or A ∈ AR ∧ LU (A) = out.
Fig. 10. The BAF obtained after the second round of bottom-up reasoning: baf 2 .
Fig. 11. The BAF obtained after the second round of bottom-up reasoning: baf 2 .
then A ∈ AR but L(A) = out, and, otherwise A ∈ AR. Both cases satisfy the
proposition.
Assume that A is not a leaf. If LU (A) = in, then there are some supports
(A, A) ∈ USUP such that LU (A) = in, and any attacks (B, A) ∈ UATT ,
LU (B) = out. From the induction hypothesis, for any C ∈ A, C ∈ AR, and
L(C) = in; and for any attackers B of A, L(B) = out or B ∈ AR. The definition
of BUP immediately shows that A ∈ AR, and therefore L(A) = in = LU (A).
Assume that LU (A) = out. If A ∈ AR, the proposition is satisfied. Otherwise,
A ∈ AR, and from the definition of BUP, there are some supports (A, A) such
that A ⊆ AR, so A is not a leaf of baf . From LU (A) = out, there are some attacks
(B, A) ∈ UATT such that LU (B) = in, or for any supports (A, A) ∈ USUP ,
LU (A) = out (i.e., there exists C ∈ A such that LU (C) = out). From the
induction hypothesis, there are some attacks (B, A) ∈ UATT such that L(B) =
in, or for any supports (A, A) ∈ USUP , there exists C ∈ A such that C ∈ AR or
L(C) = out. For the former case, (B, A) ∈ ATT , and therefore L(A) = out. For
the latter case, for any (A, A) ∈ SUP , L(A) = out, and therefore L(A) = out.
90 T. Kawasaki et al.
For a TDN, the facts found by PC (A) make the argument A acceptable.
Proof. We prove this by induction on the height of A. For the former case, assume
that Ex ∪ PC (A) is consistent. When A is a leaf (thus of height 0), PC (A) =
{ex(A)} (i.e., baf includes A and ex(A)), and therefore, L(A) = in. Other-
wise, for some A satisfying (A, A) ∈ USUP , PC (A) = (B,A)∈UATT NC (B) ∪
C∈A PC (C). For each B such that (B, A) ∈ ATT , NC (B) ⊆ PC (A), and
Ex ∪ NC (B) is thus consistent. As the height of B is less than that of A, from
the induction hypothesis, B ∈ AR or B ∈ AR but L(B) = out. In a similar
fashion, for each C ∈ A, C ∈ AR and L(C) = in, and therefore L(A) = in.
From the definitions of BUP and complete labeling, A ∈ AR and L(A) = in.
The proof for the case of NC (A) is the same.
5 Related Works
Support relations play important roles in our approach. Such relations can be
interpreted in several ways [12]. Cayrol et al. defined several types of indirect
attacks by combining attacks with supports, and defined several types of exten-
sions in BAF [10]. Boella et al. revised the semantics by introducing different
meta-arguments and meta-supports [6]. Noueioua et al. developed a BAF that
considered a support relation to be a “necessity” relation [18]. C̆yras et al. consid-
ered that several semantics of a BAF could be captured using assumption-based
argumentation [13]. Brewka et al. developed an abstract dialectical framework
(ADF) as a generalization of Dung’s AF [7,8]; a BAF was represented using an
ADF. These works focus on acceptance of arguments. Here, we define a support
relation and develop semantics that can represent a law.
Several authors have studied changes in AFs when arguments are added or
deleted [14]. Cayrol et al. investigated changes in acceptable arguments when an
argument was added to a current AF [11]. Baumann et al. developed a strat-
egy for AF diagnosis and repair, and explored the computational complexity
thereof [3]. Most research has focused on semantics, and changes in acceptable
sets when arguments are added/deleted. The computational complexity associ-
ated with AF updating via argument addition/deletion is a significant issue [1].
Here, we propose the reasoning based on an incrementally constructed BAF,
potentially broadening the applications of such frameworks. Complexity is not
of concern; we do not need to consider all possibilities since solutions can be
derived from a given universal BAF. However, it is possible to use efficient com-
putational methods when executing our algorithm.
Hybrid Reasoning on a Bipolar Argumentation Framework 91
6 Conclusion
References
1. Alfano, G., Greco, S., Parisi, F.: A meta-argumentation approach for the efficient
computation of stable and preferred extensions in dynamic bipolar argumentation
frameworks. Intelligenza Artificiale 12(2), 193–211 (2018)
2. Amgoud, L., Cayrol, C., Lagasquie-Schiex, M.C., Livet, P.: On bipolarity in argu-
mentation frameworks. Int. J. Intell. Syst. 23(10), 1062–1093 (2008)
92 T. Kawasaki et al.
1 Introduction
specify the parameters of the aggregation function. Given a decision model, exact
choices can often be derived from a partial specification of weighting parame-
ters. Dealing with partially specified parameters requires the development of
solution methods that can determine an optimal or near optimal solution with
such partial information. This is the aim of incremental preference elicitation,
that consists on interleaving the elicitation with the exploration of the set of
alternatives to adapt the elicitation process to the considered instance and to
the DM’s answers. Thus, the elicitation effort is focused on the useful part of
the preference information. The purpose of incremental elicitation is not to learn
precisely the values of the parameters of the aggregation function but to specify
them sufficiently to be able to determine a relevant recommendation.
Incremental preference elicitation is the subject of several contributions in
various contexts, see e.g. [3,4,7,16]. Starting from the entire set of possible
parameter values, incremental elicitation methods are based on the reduction
of the uncertainty about the parameter values by iteratively asking the DM to
provide new preference information (e.g., with pairwise comparisons between
alternatives). Any new information is translated into a hard constraint that
allows to reduce the parameter space. In this way, preference data are collected
until a necessarily optimal or near optimal solution can be determined, i.e., a
solution that is optimal or near optimal for all the possible parameter values.
These methods are very efficient because they allow a fast reduction of the
parameter space. Nevertheless, they are very sensitive to possible mistakes of
the DM in her answers. Indeed, in case of a wrong answer, the definitive reduc-
tion of the parameter space will exclude the wrong part of the set of possible
parameter values, which is likely to exclude the optimal solution from the set of
possibly optimal solutions (i.e., solutions that are optimal for at least one possible
parameter value). Consequently, the relevance of the recommendation may be
significantly impacted if there is no possible backtrack. A way to overcome this
drawback is to use probabilistic approaches that allow to model the uncertainty
about the DM’s answers, and thus to give her the opportunity to contradict
herself without impacting too much the quality of the recommendation. In such
methods, the parameter space remains unchanged throughout the algorithm and
the uncertainty about the real parameter values (which characterize the DM’s
preferences) is represented by a probability density function that is updated
when new preference statements are collected.
This idea has been developed in the literature. In the context of incremental
elicitation of utility values, Chajewska et al. [8] proposed to update a proba-
bility distribution over the DM’s utility function to represent the belief about
the utility value. The probability distribution is incrementally adjusted until the
expected loss of the recommendation is sufficiently small. This method does not
apply in our setting because we consider that the utility values of the alternatives
on every criterion are known and that we elicit the values of the weighting coef-
ficients of the aggregation function. Sauré and Vielma [15] introduced a method
based on maintaining a confidence ellipsoid region using a multivariate Gaussian
distribution over the parameter space. They use mixed integer programming to
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 95
select a preference query that is the most likely to reduce the volume of the
confidence region. In a recent work [5], the uncertainty about the parameter
values is represented by a Gaussian distribution over the parameter space of
rank-dependent aggregation functions. Preference queries are selected by min-
imizing expected regrets to update the density function using Bayesian linear
regression. As the updating of a continuous density function is computationally
cumbersome (especially when analytical results for the obtention of the poste-
rior density function do not exist), data augmentation and sampling techniques
are used to approximate the posterior density function. These methods are time
consuming and require to make a tradeoff between computation time and accu-
racy of the approximation. In addition, the information provided by a continuous
density function may be much richer than the information really needed by the
algorithm to conclude. Indeed, it is generally sufficient to know that the true
parameter values belong to a given restricted area of the parameter space to
be able to identify an optimal solution without ambiguity. Thus, we introduce
in this paper a new model-based incremental elicitation algorithm based on a
discretization of the parameter space. We partition the parameter space into
optimality polyhedra and we define a probability distribution over the partition.
After each query, this distribution is updated using Bayes’ rule.
The paper is organised as follows. Section 2 recalls some background on
weighted sums and ordered weighted averages. We also introduce the optimality
polyhedra we use in our method and we discuss our contribution with regard to
related works relying on the optimality polyhedra. We present our incremental
elicitation method in Sect. 3. Finally, some numerical tests showing the interest
of the proposed approach are provided in Sect. 4.
Example 1. Let x = (14, 9, 10), y = (10, 12, 10) and z = (9, 16, 6) be three perfor-
mance vectors to compare, and assume that the weighting vector is w = ( 14 , 12 , 14 ).
Applying Eq. (2), we obtain: OWAw (x) = 10.75 > OWAw (y) = 10.5 >
OWAw (z) = 10.
Note that OWA includes the minimum (w1 = 1 and wi = 0, ∀i ∈ 2, p), the
maximum (wp = 1 and wi = 0, ∀i ∈ 1, p − 1), the arithmetic mean (wi = p1 , ∀i ∈
1, p) and all other order statistics as special cases.
If w is chosen with decreasing components (i.e., the greatest weight is
assigned to the worst performance), the OWA function is concave and well-
balanced performance vectors are favoured. We indeed have, for all x ∈ X ,
OWAw ((x1 , . . . , xi − ε, . . . , xj + ε, . . . , xp )) ≥ OWAw (x) for all i, j and ε > 0
such that xi − xj ≥ ε. Depending on the choice of the weighting vector w, a
concave OWA function allows to define a wide range of mean type aggregation
operators between the minimum and the arithmetic mean. In the remainder of
the paper, we only consider concave OWA functions. For the sake of brevity, we
will say OWA for concave OWA.
Related Works. The idea of partitioning the parameter space is closely related
to Stochastic Multiobjective Acceptability Analysis (SMAA for short). The
SMAA methodology has been introduced by Charnetski and Soland under the
name of multiple attribute decision making with partial information [9]. Given
a set of utility vectors and a set of linear constraints characterizing the feasible
parameter space for a weighted sum (partial information elicited from the DM),
they assume that the probability of optimality for each alternative is proportional
to the hypervolume of its optimality polyhedron (the hypervolume reflects how
likely an alternative is to be optimal). Lahdelma et al. [12] developed this idea
in the case of imprecision or uncertainty in the input data (utilities of the alter-
natives according to the different criteria) by considering the criteria values as
probability distributions. They defined the acceptability index for an alternative,
that measures the variety of different valuations which allow for that alterna-
tive to be optimal, and is proportional to the expected volume of its optimality
polyhedron. They also introduced a confidence factor, that measures if the input
data is accurate enough for making an informed decision. The methodology has
been adapted to the 2-additive Choquet integral model by Angilella et al. [2].
These works consider that the uncertainty comes from the criterion values or
98 N. Bourdache et al.
from the variation in the answers provided by different DMs. They also consider
that some prior preference information is given and that there is no opportunity
to ask the DM for new preference statements. Our work differentiates from these
works in the following points:
– the criterion values are accurately known and only the parameter values of
the aggregation function must be elicited;
– the uncertainty comes from possible errors in the DM’s answers to preference
queries;
– the elicitation process is incremental.
In other words, the PER defines the expected worst utility loss incurred by
recommending an alternative x instead of an alternative y, and PMR(x, y, W)
is the worst utility loss in recommending alternative x instead of alternative y
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 99
given that w belongs to W. The use of the PMR within a polyhedron is justified
by the complete ignorance about the probability distribution in the polyhedron,
thereby, the worst case is considered.
In other words, the MER value defines the worst utility loss incurred by
recommending an alternative x ∈ X and the MMER value defines the minimal
MER value over X .
The notion of regret expresses a measure of the interest of an alternative.
At any step of the algorithm, the solution achieving the MMER value is a rel-
evant recommendation because it minimizes the expected loss in the current
state of knowledge. It also allows to determine an informative query to ask.
Various query selection strategies based on regrets and expected regrets have
indeed been introduced in the literature, see e.g. [6] in a deterministic con-
text (current solution strategy) and [11] in a probabilistic context (a probabil-
ity distribution is used to model the uncertainty about the parameter values).
Adapting the current solution strategy to our probabilistic setting, we propose
here a strategy that consists in asking the DM to compare the current rec-
ommendation x∗ = arg minx∈X MER(x, X , P ) to its best challenger defined by
y ∗ = arg maxy∈X PER(x∗ , y, P ). The current probability distribution is then
updated according to the DM’s answer, as explained hereafter. The procedure
can be iterated until the MMER value drops below a predefined threshold ε.
The approach proposed in this paper consists in interleaving preference
queries and Bayesian updating of the probability distribution based on the DM’s
answers. The elicitation procedure is detailed in Algorithm 1. At each step i of
the algorithm, we ask the DM to compare two alternatives x(i) and y (i) . The
answer is denoted by ai , where ai = 1 if x(i) is preferred to y (i) and ai = 0 other-
wise. From each answer ai , the conditional probability P (.|a1 , . . . , ai−1 ) over the
set of alternatives is updated in a Bayesian manner (Line 13 of Algorithm 1).
The likelihood function P (ai |z) is the conditional probability that the answer is
ai given that z is optimal. Let us denote by Wx(i) y(i) the subset of W containing
all vectors w such that fw (x(i) ) ≥ fw (y (i) ); the likelihood function is defined as:
⎧
⎨δ if Wz ⊆ Wx(i) y(i)
P (ai = 1|z) = 1 − δ if Wz ∩ Wx(i) y(i) = ∅
⎩
P (ai = 1) otherwise
Fig. 2. The polyhedron is Wz . The non-hatched area is the half-space Wx(i) y(i) .
ratio ρ
We now show that ρ ≥ 1 for δ > 12 . Let us denote by Yδ the subset of alternatives
y ∈ Y such that P (ai = α|y) = δ. We have:
P (ai = α|y)P (y|a1 , . . . , ai−1 )
y∈Y
= δ P (y|a1 , . . . , ai−1 ) + (1 − δ) P (y|a1 , . . . , ai−1 )
y∈Yδ y∈Y1−δ
≤ δ P (y|a1 , . . . , ai−1 ) + δ P (y|a1 , . . . , ai−1 )
y∈Yδ y∈Y1−δ
1
because δ > (the only case of equality is when Y1−δ = ∅)
2
= δ P (y|a1 , . . . , ai−1 )
y∈Y
4 Experimental Results
Algorithm 1 has been implemented in Python using the polytope library to man-
age optimality polyhedra, and tested on randomly generated instances. We per-
formed the tests on an Intel(R) Core(TM) i7-4790 CPU with 15 GB of RAM.
Simulation of the Interactions with the DM. To simulate the DM’s answer
to query i, we represent the intensity of preference between alternatives x(i) and
y (i) by the variable u(i) = fw (x(i) ) − fw (y (i) ) + ε(i) where ε(i) ∼ N (0, σ 2 ) is
a Gaussian noise modelling the possible DM’s error, with σ determining how
wrong the DM can be. The DM states that x(i) y (i) if and only if u(i) ≥ 0.
Fig. 5. Mean rank vs. queries (WS) Fig. 6. Mean rank vs. queries (OWA)
Fig. 7. Rank vs. error rate (WS) Fig. 8. Rank vs. error rate (OWA)
worst case, alternatives that are ranked around 90 while it is less than 40 for
Algorithm 1. More precisely, when σ = 0.3 (for both WS and OWA), in more
than 75% of instances, Algorithm 1 recommends an alternative with a better
rank than the mean rank obtained in the deterministic case.
5 Conclusion
of Chebyshev balls. The answer is not straightforward because, on the one hand,
the use of ellipsoids indeed refines the approximation of the polyhedra, but on
the other hand, this requires the use of matrix calculations to establish whether
or not an ellipsoid is cut by the constraint induced by a preference statement.
Another natural research direction is to extend our approach to more sophisti-
cated aggregation functions admitting a linear representation, such as Weighted
OWAs and other Choquet integrals, to improve our descriptive possibilities.
References
1. Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response
data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)
2. Angilella, S., Corrente, S., Greco, S.: Stochastic multiobjective acceptability anal-
ysis for the choquet integral preference model and the scale construction problem.
Eur. J. Oper. Res. 240(1), 172–182 (2015)
3. Benabbou, N., Perny, P.: Incremental weight elicitation for multiobjective state
space search. In: AAAI-15, pp. 1093–1099 (2015)
4. Bourdache, N., Perny, P.: Active preference elicitation based on generalized gini
functions: application to the multiagent knapsack problem. In: AAAI 2019, pp.
7741–7748 (2019)
5. Bourdache, N., Perny, P., Spanjaard, O.: Incremental elicitation of rank-dependent
aggregation functions based on Bayesian linear regression. In: IJCAI 2019, pp.
2023–2029 (2019)
6. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
7. Braziunas, D., Boutilier, C.: Minimax regret based elicitation of generalized addi-
tive utilities. In: Proceedings of UAI-07, pp. 25–32 (2007)
8. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility
elicitation. In: Proceedings of AAAI-00, pp. 363–369 (2000)
9. Charnetski, J.R., Soland, R.M.: Multiple-attribute decision making with partial
information: the comparative hypervolume criterion. Nav. Res. Logist. Q. 25(2),
279–288 (1978)
10. Grabisch, M., Labreuche, C.: A decade of application of the Choquet and Sugeno
integrals in multi-criteria decision aid. Ann. OR 175(1), 247–286 (2010)
11. Guo, S., Sanner, S.: Multiattribute Bayesian preference elicitation with pairwise
comparison queries. In: NIPS, pp. 396–403 (2010)
12. Lahdelma, R., Hokkanen, J., Salminen, P.: SMAA - stochastic multiobjective
acceptability analysis. Eur. J. Oper. Res. 106(1), 137–143 (1998)
13. Li, D.: Convexification of a noninferior frontier. J. Optim. Theory Appl. 88(1),
177–196 (1996)
14. Nowak, R.: Noisy generalized binary search. In: Bengio, Y., Schuurmans, D.,
Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information
Processing Systems, vol. 22, pp. 1366–1374. Curran Associates, Inc. (2009)
15. Sauré, D., Vielma, J.P.: Ellipsoidal methods for adaptive choice-based conjoint
analysis. Oper. Res. 67, 315–338 (2019)
16. Wang, T., Boutilier, C.: Incremental utility elicitation with the minimax regret
decision criterion. IJCAI 3, 309–316 (2003)
17. Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria
decision making. IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988)
Selecting Relevant Association Rules
From Imperfect Data
1 Introduction
Association rule mining (ARM) is a well-known data mining technique designed
to extract interesting patterns in databases. It has been introduced in the context
of market basket analysis [1], and has received a lot of attention since then [15].
An association rule is usually formally defined as an implication between an
antecedent and a consequent, being conjunctions of attributes in a database, e.g.
“People who have age-group between 20 and 30 and a monthly income greater
than $2k are likely to buy product X”. Such rules are interesting for extracting
simple intelligible knowledge from a database; they can also further be used
in several applications, e.g. recommendation, customer or patient analysis. A
large literature is dedicated to the study of ARM, and numerous algorithms
have been defined for efficiently extracting rules handling a large range of data
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 107–121, 2019.
https://doi.org/10.1007/978-3-030-35514-2_9
108 C. L’Héritier et al.
The main challenge in ARM is to extract interesting rules from a large search
space, e.g., n and m are large. In this context, defining the interestingness of a
rule is central.
the initial belief function and used in a revision process [9]. Thus, for A, B ⊆ Θ,
such that Bel(A) > 0, we will further consider:
Bel(A ∩ B) P l(A ∩ B)
Bel(B|A) = , P l(B|A) =
Bel(A ∩ B) + P l(A ∩ B) P l(A ∩ B) + Bel(A ∩ B)
We denote by R the set of rules defined by Formula (1). The problem addressed
here is to reduce R by selecting only the relevant rules.
Rule Pruning. Most of the approaches use thresholds to select rules - only using
support and confidence most often allows drastically reducing the number of
rules in traditional ARM [1]. A post-mining step is generally performed to rank
the remaining rules according to one specific interestingness measure -the mea-
sure used is generally selected according to the application domain and context-
specific measure properties [23,27]. Nevertheless, processing this way does not
1
Note that the simplification of the mining process here refers to a reduction of com-
plexity in terms of the number of rules analysed, i.e. search space size. Algorithmic
contributions and therefore complexity analyses regarding efficient implementations
of the proposed approach are left for future work.
Selecting Relevant Association Rules From Imperfect Data 111
enable selecting rules when conflicting interestingness measures are used, e.g.
maximizing both support and specificity of rules. This is the purpose of MCDA
methods. Some works propose to take advantage of MCDA methods [3–6,17]
in the context of ARM. Those works can be divided into two categories: (1)
those incorporating the end-user’s preferences using Analytic Hierarchy Process
(AHP) and Electre II [6], or using Electre tri [3]; and (2) those that do not incor-
porate such information and use Data Envelopment Analysis (DEA) [5,26], or
Choquet integral [17]. Our approach is hybrid and falls within the two categories.
First, selection is made based only on database information as in Bouker et al.
[4]. Second, if the set of selected rules is large, a trade-off based on end-user’s
preferences is used within an appropriate MCDA method. As our aim is to select
a subset of interesting rules, Electre I [18] seems to be the most appropriate.
ARM and Imperfect Data. Several frameworks have been studied to deal with
imperfect data in ARM. The assumptions entailed in the approaches based on
probabilistic models do not preserve imprecision and might lead to unreliable
inferences [13]. Uncertainty theories have also been investigated for imperfect
data in ARM using fuzzy logic [14], or using possibility theory [8]. In the case of
missing and incomplete data, evidential theory seems the appropriate setting to
handle ARM problem [13,19,24,25]. Our approach is adopting this setting. In
addition to studying a richer modelling that enables incorporating more infor-
mation, we propose to combine it with a selection process taking advantage of
an MCDA method, namely Electre I, to assess rules interestingness consider-
ing different viewpoints. Although some works previously mentioned tackle rule
selection using MCDA, and few approaches have been addressing ARM problem
using evidence theory, none of them is addressing both issues simultaneously.
We also present how to benefit from a priori knowledge about attribute val-
ues -organised into taxonomies- for improving the rule selection process, and
reducing the increase of complexity induced by the proposed extension of mod-
ellings used so far in existing ARM approaches suited for imperfect data.
3 Proposed Approach
This section presents our ARM approach for imperfect data. We first introduce
how rule interestingness is evaluated by presenting the selected measures and
their formalization in the evidence theory framework. Then, the main steps of
the proposed approach for selecting rules based on these measures are detailed.
since in our context rules are imprecise, and since very imprecise rules are most
often considered useless, the (iv) degree of imprecision embedded in the mined
rules is also evaluated. These four notions of interest considered in the study are
defined below. For convenience, we consider
that we are computing
measures to
evaluate a rule r : A → B where A = Ai , Ai ⊆ Θi and B = Bj , Bj ⊆ Θj
i∈I1 j∈I2
with I1 ∪ I2 = N . In our context, since we consider n = |N | attributes, the
set functions
mass m, belief Bel and plausibility P l are defined on subsets of
Θ= Θi .
i∈N
Note that the belief function is monotone, then, the rules composed of the
most imprecise attribute values will necessarily be the most supported.
Let us remind the starting set R -see Formula (1)- of rules from which a small
subset R∗ of interesting rules should be selected:
R = {r : A → B | A = Ai , Ai ⊆ Θi , B = Bj , Bj ⊆ Θj }
i∈I1 j∈I2
We assume that I1 and I2 are fixed before starting the ARM process.
To simplify notations in the rest of the paper, we will denote by rA,B the
rule r : A → B where A and B are as in the Formula (1). Two restrictions are
proposed below:
Step 2: Pruning Using Electre I. When Rr,t,d remains too large to be man-
ually analyzed, a subjective pruning procedure based on the selection procedure
Electre I is applied. This MCDA method enables expressing subjectivity through
parameters that can be given by decision makers [18]. We use it for finding the
final set of rules R∗ ⊆ Rr,t,d . Electre I builds an outranking relation between
pairs of rules allowing to select a subset of the best rules: R∗ . This subset is such
that (i) any rules excluded from Rr,t,d is outranked by at least one rule from
R∗ , (ii) rules from R∗ do not outrank each other. To do so, Electre I procedure
(a) constructs outranking relationships through pairwise comparisons of rules,
to further (b) exploit those relationships to build R∗ .
The set of model parameters that have to be defined for applying the subjective
reduction based on Electre I are: weights wk , ∀k ∈ K, and the concordance and
discordance thresholds, 3 The choice of parameter values will be further
c, d.
discussed in the illustration Sect. 4.
4 Illustration
As an illustration, we consider the context of humanitarian projects carried out
for answering to emergency situations. A dataset of observations describes these
emergency situations according to four attributes: (1) the type of disaster faced,
(2) the season, (3) the environment in which it occurred, and (4) an evaluation
of the situation w.r.t. the human cost. We further refer to these attributes using
3
Evaluating support and confidence of A → B and A → B can lead to undefined
values, e.g. evaluating A → B, we have Bel(A × B) = 0 when A has never been
observed, leading to Bel(B|A) being undefined. However, pruning using dominance
and Electre I requires the same measures to be defined. Undefined values are thus
substituted by an arbitrary value that neither favor nor penalize the evaluation of
the rule A → B. The median of Bel(A × B) (resp. Bel(A × B)) has been chosen.
Note that A → B is not concerned since evaluating A → B implies evidence on A.
Selecting Relevant Association Rules From Imperfect Data 117
their number, considering that they respectively take discrete values in: Θ1 =
{tsunami, earthquake, epidemic, conflict, pop.displacement}, Θ2 = {spring, sum-
mer, autumn, winter }, Θ3 = {urban, rural }, Θ4 = {low, medium, high, very-
High}. Besides, for each attribute, prior knowledge is defined into ontologies
determining the values of interest. In this specific case study, the purpose of
association rules is to highlight the influence of a situation contextual features
on its evaluation according to the Human Cost, a useful information for project
planning. Thus the searched rules r : A → B will imply the attributes in the
following set I1 = {1, 2, 3} in the antecedent and in I2 = {4} for the consequent.
Table 5. Final sets of rules (R∗ ) obtained with Electre I pruning using four parameter
settings (a to e).
interest w.r.t. the initial set of observations, such as r16 and r13 . In the initial
dataset -see Table 3- the imprecise information {spring, summer} for the sea-
son or {high, veryHigh} for the Human Cost are frequently observed. Indeed,
selecting the imprecise rule r13 : {epidemic} ∧ {spring, summer} ∧ {urban} →
{high, veryHigh} in R∗ is not surprising. As an interpretation of this rule,
we say that the analysis of the database tends to relate the occurrence of epi-
demics in urban areas to a specific season, spring or summer, and human cost.
In particular, the rule seems valid at least for one the conjunction “summer and
high human cost”, “summer and a very High human cost”, “spring and high”
or “spring and veryHigh”. In this illustration, different sets of parameters and
their results on rule selection have been presented. However, these parameters
have to be set by the end-user.
To further discuss these results, it is interesting to note that all the selected
measures for rules comparison, except the IC, are based on observations fre-
quency. In order to counterbalance the preponderance of this factor, it might
be relevant to add subjective measures and not only data-driven ones. Subjec-
tive interestingness measures have been studied in the literature. Relying on
these works, we could include here measures based for example on user expected
rules or expected conjunction of attribute values. Furthermore, investigating the
dependencies among frequency based measures, and considering them in the
selection process will be valuable. Nevertheless, considering additional measures
(especially data-driven), as the ones proposed for classical ARM, is not neces-
sarily straightforward within the evidence theory framework. It indeed implies
to define their right expression and meaning in this framework.
Mining association rules from imperfect data is a key challenge for real-world
applications dealing with imperfect data, e.g., imprecise, missing data, etc. The
ARM approach introduced in this paper enables to deal with imprecise data and
derive imprecise rules under specific conditions (e.g. fixing both antecedent and
consequent). Relying on evidence theory and Multiple Criteria Decision Anal-
ysis, this new framework enriches expressivity of existing works while provid-
ing a novel selection procedure for identifying most interesting rules according
to several viewpoints. To this aim, several interestingness measures have been
proposed, and used in a two-step selection procedure based on dominance rela-
tionships and Electre I. A restriction using a priori knowledge has also been
proposed to focus and ease the mining process by incorporating symbolic knowl-
edge defined into domain ontologies. To further improve the approach, additional
measures of interestingness could be added. Future work related to subjective
measures (e.g., user-oriented) would be particularly relevant to enrich the set of
frequency-based measures that are currently involved in the approach. Studying
the interactions between the measures would also be of interest. Finally, only
an illustration using a simplified case study related to humanitarian projects
analysis has been presented in this paper. Thorough algorithmic complexity and
120 C. L’Héritier et al.
References
1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of
items in large databases. In: ACM SIGMOD Record, vol. 22, pp. 207–216. ACM
(1993)
2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In:
Proceedings of 20th International Conference on Very Large Data Bases, VLDB,
vol. 1215, pp. 487–499 (1994)
3. Ait-Mlouk, A., Gharnati, F., Agouti, T.: Multi-agent-based modeling for extract-
ing relevant association rules using a multi-criteria analysis approach. Vietnam J.
Comput. Sci. 3(4), 235–245 (2016)
4. Bouker, S., Saidi, R., Yahia, S.B., Nguifo, E.M.: Ranking and selecting association
rules based on dominance relationship. In: 2012 IEEE 24th International Confer-
ence on Tools with Artificial Intelligence, vol. 1, pp. 658–665. IEEE (2012)
5. Chen, M.C.: Ranking discovered rules from data mining with multiple criteria by
data envelopment analysis. Expert Syst. Appl. 33(4), 1110–1116 (2007)
6. Choi, D.H., Ahn, B.S., Kim, S.H.: Prioritization of association rules in data mining:
multiple criteria decision approach. Expert Syst. Appl. 29(4), 867–878 (2005)
7. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
Ann. Math. Stat. 38, 325–339 (1967)
8. Djouadi, Y., Redaoui, S., Amroun, K.: Mining association rules under imprecision
and vagueness: towards a possibilistic approach. In: 2007 IEEE International Fuzzy
Systems Conference, pp. 1–6. IEEE (2007)
9. Dubois, D., Denoeux, T.: Conditioning in dempster-shafer theory: prediction vs.
revision. In: Denoeux, T., Masson, M.H. (eds.) Belief Functions: Theory and Appli-
cations, pp. 385–392. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-29461-7 45
10. Fagin, R., Halpern, J.Y.: A new approach to updating beliefs. In: Proceedings of
the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI 1990,
pp. 347–374. Elsevier Science Inc., New York, NY, USA (1991). http://dl.acm.org/
citation.cfm?id=647233.760137
11. Figueira, J., Roy, B.: Determining the weights of criteria in the electre type methods
with a revised simos’ procedure. Eur. J. Oper. Res. 139(2), 317–326 (2002)
12. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM
Comput. Surv. 38(3), 9-es (2006)
13. Hewawasam, K., Premaratne, K., Subasingha, S., Shyu, M.L.: Rule mining and
classification in imperfect databases. In: 2005 7th International Conference on
Information Fusion, vol. 1, p. 8. IEEE (2005)
14. Hong, T.P., Lin, K.Y., Wang, S.L.: Fuzzy data mining for interesting generalized
association rules. Fuzzy Sets Syst. 138(2), 255–269 (2003)
15. Kotsiantis, S., Kanellopoulos, D.: Association rules mining: a recent overview.
GESTS Int. Trans. Comput. Sci. Eng. 32(1), 71–82 (2006)
16. Liu, B., Hsu, W., Chen, S., Ma, Y.: Analyzing the subjective interestigness of
association rules. IEEE Intell. Syst. 15(5), 47–55 (2000). https://doi.org/10.1109/
5254.889106
Selecting Relevant Association Rules From Imperfect Data 121
17. Nguyen Le, T.T., Huynh, H.X., Guillet, F.: Finding the most interesting association
rules by aggregating objective interestingness measures. In: Richards, D., Kang, B.-
H. (eds.) PKAW 2008. LNCS (LNAI), vol. 5465, pp. 40–49. Springer, Heidelberg
(2009). https://doi.org/10.1007/978-3-642-01715-5 4
18. Roy, B.: Classement et choix en présence de points de vue multiples. Revue
française d’informatique et de recherche opérationnelle 2(8), 57–75 (1968)
19. Samet, A., Lefèvre, E., Yahia, S.B.: Evidential data mining: precise support and
confidence. J. Intell. Inf. Syst. 47(1), 135–163 (2016)
20. Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic
similarity in wordNet. In: Ecai, vol. 16, p. 1089 (2004)
21. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
22. Silberschatz, A., Tuzhilin, A.: What makes patterns interesting in knowledge dis-
covery systems. IEEE Trans. Knowl. Data Eng. 8(6), 970–974 (1996)
23. Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for
association patterns. In: Proceedings of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 32–41. ACM (2002)
24. Tobji, M.B., Yaghlane, B.B., Mellouli, K.: A new algorithm for mining frequent
itemsets from evidential databases. Proc. IPMU 8, 1535–1542 (2008)
25. Bach Tobji, M.A., Ben Yaghlane, B., Mellouli, K.: Frequent itemset mining from
databases including one evidential attribute. In: Greco, S., Lukasiewicz, T. (eds.)
SUM 2008. LNCS (LNAI), vol. 5291, pp. 19–32. Springer, Heidelberg (2008).
https://doi.org/10.1007/978-3-540-87993-0 4
26. Toloo, M., Sohrabi, B., Nalchigar, S.: A new method for ranking discovered rules
from data mining by dea. Expert Syst. Appl. 36(4), 8503–8508 (2009)
27. Vaillant, B., Lenca, P., Lallich, S.: A clustering of interestingness measures. In:
Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 290–297.
Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30214-8 23
Evidential Classification of Incomplete
Data via Imprecise Relabelling:
Application to Plastic Sorting
1 Introduction
Plastic recycling is a promising alternative to landfills for dealing with the
fastest growing waste stream in the world [8]. However, for physiochemical rea-
sons related to non-miscibility between plastics, most plastics must be recycled
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 122–135, 2019.
https://doi.org/10.1007/978-3-030-35514-2_10
Evidential Classification of Incomplete Data 123
better represent the learning data by mapping overlaps in the feature space. The
resulting imprecise label can be naturally treated in the belief functions theory
context. Indeed, belief functions theory [26] is an interesting framework for rep-
resenting imprecise and uncertain data by allowing the allocation of a probability
mass for imprecise data. Thus, imprecision and ignorance is better captured in
this framework compared to the probability framework where equiprobability
and imprecision are confused. The recent growing interest in this theory has
allowed techniques to be developed for resolving a diverse range of problems
such as estimation [12,17], standard classification [10,32], or even hierarchical
classification [1,23].
Our proposed approach, called Evidential CLAssification of incomplete data
via Imprecise Relabelling (ECLAIR), is based on a relabelling procedure of the
training examples that enables better representation of the missing information
about some data features. Then a classifier is trained on the relabelled data
producing a posterior mass function. With imprecise relabelling we try to quan-
tify, using a mass function, the extend to which a subsets of classes is reliable
and relevant as output for a new data. In other words, we look for the set of
classes which any more precise subset output would lead inevitably to an error.
The resulting classification algorithm can enhance the classification accuracy as
well as cope with difficult examples by allowing less precise but more reliable
classification output which will optimize the recycling process.
The remainder of this paper is organized as follows: Sect. 2 sets down the
main notations and provides a reminder on supervised classification and elements
of belief functions theory; in Sect. 3 we present the proposed approach; Sect. 4
briefly describes the related works; Sect. 5 presents results of experimentation
on the sorting problem of four plastics.
2 Theoretical Background
Classification is a technique allowing to assign objects to categories from the
observations of several of their characteristics. A classifier is a function that maps
an object represented by its values of characteristics on a finite set of variables,
to a category represented by a value of a categorical variable. More precisely, let
us consider a set of n categories represented by a set Θ = {θ1 , θ2 , . . . , θn }, also
refereed as a set of labels or classes. In the framework of belief functions Θ is
called a frame of discernment. Each θj , j ∈ {1, ..., n} denotes a singleton which
represents the lowest level of discernible information in Θ. Let us denote by
X1 , X2 , . . . , Xp , p variables where the taken values represent the characteristics,
also called attributes or features, of the objects, to be classified. In the rest of
the paper we refer to Θ as a set of classes and to (X1 , X2 , . . . , Xp ) as a vector
of features where ∀i ∈ {1, . . . , p}, Xi refers both to the name of the feature and
to the space of the values taken by the feature, i.e., Xi ⊆ R. For an object x
p
belonging to X = Xi ⊆ Rp , let θ(x) ∈ Θ denote the unknown label that
i=1
should be associated to x.
Evidential Classification of Incomplete Data 125
Pignistic Level. In the transferable belief model [29], the decision is made in
the pignistic level. The evidential information is transferred into a probabilistic
framework by means
of the pignistic probability distribution betPm , for θ ∈ Θ,
betPm (θ) = m(A)/|A|, where |A| denotes the number of elements in A.
A⊆Θ,A θ
Decision Rule. The risk associated with a decision rule is adaptable for the
evidential framework [9,13,27]. In the case of imprecise data, the set of actions
A is 2Θ \ {∅}. In order to decide between the elements of A according to the
chosen loss function L, it is possible to adopt different strategies. Two strategies
are proposed in the literature: the optimistic strategy by minimizing rδΘ or the
pessimistic strategy by minimizing rδΘ which are defined as follows:
r(A) = m(B) min L(A, θ), r(A) = m(B) max L(A, θ). (1)
θ∈B θ∈B
B⊆Θ B⊆Θ
Evidential Classification of Incomplete Data 127
Let us consider the supervised classification problem where the available training
examples that are precisely labelled (case 1 ) (xi , θi )1≤i≤N , xi ∈ X and θi ∈ Θ are
such that (i) the labels θi=1,...,N are trusted. They may derive from expertise on
other features x∗i=1,...,N which contain more complete information than xi=1,...,N ,
(ii) this loss of information induces overlapping on some examples. In other
words, ∃i, j ∈ {1, ..., N } such that the characteristics of xi are very close to
those of xj but θi = θj . When a standard classifier is trained on such data, it
will commit inevitable errors. The problem that we handle in this paper is how
to improve the learning step to better consider this type of data and get better
performances and reliable predictions.
that were trusted. The fact that several (C) classifiers are used to express the
imprecision permits a better objectivity on the real imprecision of the features,
i,e, the example is difficult not only for a single classifier. We denote by A ⊆ 2Θ
the set of the new training labels Ai , i = 1, ..., N .
Note that we limited the new labels Ai to have at most two elements except
when expressing ignorance Ai = Θ. This is done for avoiding too unbalanced
training sets, but more general relabelling could be considered. Once all the
training examples are relabelled, a classifier δ2Θ can be trained.
Learning δ 2Θ . As indicated throughout this paper, δ2Θ is learnt using the new
labels ignoring the relations that might exist between the elements of A. Rein-
forcing the idea of independence of treatment between the classes, LDA is applied
to the relabelled training set (xi , Ai )i=1,...,N . This results to the reduction of the
space dimension from p to |A| − 1 which better expresses the repartition of rela-
belled training examples. For the training example i ∈ {1, ..., N }, let xi ∈ R|A|−1
be the new projection of xi on this |A| − 1 dimension space. The classifier δ2Θ is
finally taught on (xi , Ai )i=1,...,N .
Decision Problem. As recalled in Subsects. 2.2 and 2.3, the decision to assign a
new object x to a single class or a set of classes usually relies on the minimisation
of the risk function which is associated to a loss function L : 2Θ \ {∅} × Θ → R.
As mentioned in the introduction to this paper, the application of our work
concerns situations where errors may have serious consequences. It would then
be legitimate to consider the pessimistic strategy by minimizing rδΘ . Further-
more, in the definition of rδΘ , Eq. (1), the quantity max L(A, θ) concerns the loss
θ∈B
incurred by choosing A ⊆ Θ, when the true nature is comprised in B ⊆ Θ. On
the basis of this fact, we proposed a new definition of the loss function, L(A, B),
A, B ⊆ Θ, which directly takes into account the relations between A and B.
This is actually a generalisation of the definition proposed in [7] that is based
on F-measure, recall and precision for imprecise classification. Let us consider
A, B ∈ 2Θ \ {∅}, where A = θ(x) is the prediction for the object x and B is its
state of nature. Recall is defined as the proportion of relevant classes included
in the prediction θ(x). We define the recall of A and B as:
|A ∩ B|
R(A, B) = . (2)
|B|
Precision is defined as the proportion of classes in the prediction that are rele-
vant. We define the precision of A and B as:
|A ∩ B|
P (A, B) = . (3)
|A|
(1 + β 2 )P R (1 + β 2 )|A ∩ B|
Fβ (A, B) = = . (4)
β2P + R β 2 |B| + |A|
130 L. Jacquin et al.
4 Related Works
Regarding relabelling procedures, much research has been carried out to identify
suspect examples with the intention to suppress or relabel them into a concurrent
more appropriate class [16,20]. This is generally done to enhance the performance.
Other approaches consist in relabelling into imprecise classes. This has been done
to test the evidential classification approach on imprecise labelled data in [37]. But,
as already stated, our relabelling serves a different purpose, better mapping over-
laps in the feature space. Concerning the imprecise classification, several works
have been dedicated to tackle this problem. Instead of the term “imprecise clas-
sification” that is adopted in our article, authors use terms like “nondeterminis-
tic classification” [7], “reliable classification” [24], “indeterminate classification”
[6,36], “set-valued classification” [28,31] or “conformal prediction” [3] (see [24] for
a short state of the art). In [36], the Naive Credal Classifier (NCC) is proposed as
the extension of Naive Bayes Classifier (NBC) to sets of probability distributions.
In [24] the authors propose an approach that starts from the outputs of a binary
classification [25] using classifier that are trained to distinguish aleatoric and epis-
temic uncertainty. The outputs are epistemic uncertainty, aleatoric uncertainty
and two preference degrees in favor of the two concurrent classes. [24] generalizes
this approach to the multi-class and providing set of classes as output. Closer to
our approach are approaches of [5] and [7]. The approach in [7] is based on a poste-
rior probability distribution provided by a probabilistic classifier. The advantage of
such approach and ours is that any standard probabilistic classifier may be used to
perform an imprecise classification. Our approach distinguishes itself by the rela-
belling step and by the way probabilities are allowed on sets of classes. To the best
of our knowledge existing works algorithms do not train a probabilistic classifier
on partially labelled data to quantify the body of evidence. Although we insisted
for the use of standard probabilistic classifier δ2Θ unaware of relations between the
sets, it is possible to run our procedure with an evidential classifier as the evidential
k-NN [5].
Evidential Classification of Incomplete Data 131
5 Illustration
5.1 Settings
We performed experiments on the classification problem of four plastic categories
designated plastics A, B, C and D on the basis of industrially acquired spectra.
The total of 11540 available data examples is summarized in Table 1. Each plastic
example was identified by experts on the basis of laboratory measure of atten-
uated total reflectance spectra (ATR) which is considered as a reliable source
of information for plastic category’s determination. As a consequence, original
training classes are trusted and were not questioned. However data provided by
the industrial devices may be challenged. These data consist in spectra composed
of the reflectance intensity of 256 different wavelengths. Therefore and for the
enumerated reasons in Sect. 1, the features are subject to ambiguity. Prior to
experiments, all the feature vectors, i.e., spectra, were corrected by the standard
normal variate technique to avoid light scattering and spectral noise effects. We
implemented our approach and compared it to the approaches in [5] and [7]. The
implementation is made using R packages, using existing functions for the appli-
cation of the following 8 classifiers naive Bayes classifier: (nbayes), k-Nearest
Neighbour (k-NN), decision tree (tree), random forest (rf), linear discriminant
analysis (lda), partial least squares discriminant analysis (pls-da), support vector
machine (svm) and neural networks (nnet).1
Table 1. Number of spectra of each original class in learning and testing bases.
5.2 Results
In order to apply our procedure, we must primary choose a set of classifiers
to perform the relabelling. These classifiers are not necessarily probabilistic
but producing point prediction. Thus, for every experimentation, our algorithm
ECLAIR was performed with the ensemble relabelling using 7 classifiers: nbayes,
k-NN, tree, rf, lda, svm, nnet2 . Then, we are able to perform the ECLAIR impre-
cise version of a selected probabilistic classifier. Figure 2, shows the recall and
precision scores of the probabilistic classifier nbayes to show the role of β. We see
the same influence of β as mentioned in [7]. Indeed, (cf Subsect. 3.3), with small
1
Experiments concerning these learning algorithm rely on the following functions
(and R packages) : naiveBayes (e1071), knn3 (caret), rpart (rpart), randomForest
(randomForest), lda (MASS), plsda (caret), svm (e1071), nnet (nnet).
2
In order to limit unbalanced classes, we choose to exclude form the learning base
examples which new labels count less than 20 examples.
132 L. Jacquin et al.
0.95
0.90
0.85
0.80
0.75
Precision
0.70
Recall
0 2 4 6 8 10 12
beta
Fig. 2. Recall and precision of ECLAIR using nbayes, i.e. δ2Θ is nbayes, against β.
0.90
0.90
0.85
0.85
Precision score
Precision score
0.80
0.80
0.75
0.75
ND (raw data) ND (raw data)
ND (LDA extraction) ND (LDA extraction)
ND (ECLAIRE features) ND (ECLAIRE features)
0.70
0.70
ECLAIRE ECLAIRE
0.86 0.88 0.90 0.92 0.94 0.96 0.98 0.86 0.88 0.90 0.92 0.94 0.96 0.98
Recall score Recall score
6 Conclusion
In this article, a method of evidential classification of incomplete data via impre-
cise relabelling was proposed. For any probabilistic classifier, our approach pro-
poses an adaptation to get more cautious output. The benefit of our approach
was illustrated on the problem of sorting plastics and showed competitive per-
formances. Our algorithm is generic it can be applied in any other context where
incomplete data on the features are presents. In future works we plan to exploit
our procedure to provide cautious decision-making for the problem of plastic
sorting. This application requires high reliability of the decision for preserving
the physiochemical properties of the recycle product. At the same time, the deci-
sion shall ensure reasonable relevance to guarantee financial interest, indeed the
more one plastic category is finely sorted the more benefice the industrial gets.
We also plan to strengthen our approach evaluation by confronting it with other
state of the art imprecise classifiers and by preforming experiments on several
datasets from machine learning repositories.
134 L. Jacquin et al.
References
1. Alshamaa, D., Chehade, F.M., Honeine, P.: A hierarchical classification method
using belief functions. Signal Process. 148, 68–77 (2018)
2. Ambroise, C., Denoeux, T., Govaert, G., Smets, P.: Learning from an imprecise
teacher: probabilistic and evidential approaches. Appl. Stoch. Models Data Anal.
1, 100–105 (2001)
3. Balasubramanian, V., Ho, S.S., Vovk, V.: Conformal Prediction for Reliable
Machine Learning: Theory, Adaptations and Applications. Newnes, Oxford (2014)
4. Buckley, J.J., Hayashi, Y.: Fuzzy neural networks: a survey. Fuzzy Sets Syst. 66(1),
1–13 (1994)
5. Côme, E., Oukhellou, L., Denoeux, T., Aknin, P.: Learning from partially super-
vised data using mixture models and belief functions. Pattern Recogn. 42(3), 334–
348 (2009)
6. Corani, G., Zaffalon, M.: Learning reliable classifiers from small or incomplete data
sets: the naive credal classifier 2. J. Mach. Learn. Res. 9(Apr), 581–621 (2008)
7. Coz, J.J.D., Dı́ez, J., Bahamonde, A.: Learning nondeterministic classifiers. J.
Mach. Learn. Res. 10(Oct), 2273–2293 (2009)
8. Cucchiella, F., D’Adamo, I., Koh, S.L., Rosa, P.: Recycling of weees: an economic
assessment of present and future e-waste streams. Renew. Sustain. Energy Rev.
51, 263–272 (2015)
9. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
In: Yager, R.R., Liu, L. (eds.) Classic Works of the Dempster-Shafer Theory of
Belief Functions. Studies in Fuzziness and Soft Computing, vol. 219. Springer,
Heidelberg (2008). https://doi.org/10.1007/978-3-540-44792-4 3
10. Denoeux, T.: A k-nearest neighbor classification rule based on dempster-shafer
theory. IEEE Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
11. Denoeux, T.: A neural network classifier based on dempster-shafer theory. IEEE
Trans. Syst. Man Cybern. Part A Syst. Hum. 30(2), 131–150 (2000)
12. Denoeux, T.: Maximum likelihood estimation from uncertain data in the belief
function framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
13. Denoeux, T.: Logistic regression, neural networks and dempster-shafer theory: a
new perspective. Knowl.-Based Syst. 176, 54–67 (2019)
14. Dubois, D., Prade, H.: Possibility theory. In: Meyers, R. (ed.) Computational Com-
plexity. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-1800-9
15. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting
compact well-separated clusters. J. Cybern. 3, 32–57 (1973)
16. Kanj, S., Abdallah, F., Denoeux, T., Tout, K.: Editing training data for multi-label
classification with the k-nearest neighbor rule. Pattern Anal. Appl. 19(1), 145–161
(2016)
17. Kanjanatarakul, O., Kaewsompong, N., Sriboonchitta, S., Denoeux, T.: Estimation
and prediction using belief functions: Application to stochastic frontier analysis.
In: Huynh, V.N., Kreinovich, V., Sriboonchitta, S., Suriya, K. (eds.) Econometrics
of Risk. Studies in Computational Intelligence, vol. 583. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-13449-9 12
18. Kassouf, A., Maalouly, J., Rutledge, D.N., Chebib, H., Ducruet, V.: Rapid dis-
crimination of plastic packaging materials using mir spectroscopy coupled with
independent components analysis (ICA). Waste Manage. 34(11), 2131–2138 (2014)
19. Keller, J.M., Gray, M.R., Givens, J.A.: A fuzzy k-nearest neighbor algorithm. IEEE
Trans. Syst. Man Cybern. 4, 580–585 (1985)
Evidential Classification of Incomplete Data 135
20. Lallich, S., Muhlenbach, F., Zighed, D.A.: Improving classification by removing
or relabeling mislabeled instances. In: Hacid, M.-S., Raś, Z.W., Zighed, D.A.,
Kodratoff, Y. (eds.) ISMIS 2002. LNCS (LNAI), vol. 2366, pp. 5–15. Springer,
Heidelberg (2002). https://doi.org/10.1007/3-540-48050-1 3
21. Lee, H.K., Kim, S.B.: An overlap-sensitive margin classifier for imbalanced and
overlapping data. Expert Syst. Appl. 98, 72–83 (2018)
22. Leitner, R., Mairer, H., Kercek, A.: Real-time classification of polymers with NIR
spectral imaging and blob analysis. Real-Time Imaging 9(4), 245–251 (2003)
23. Naeini, M.P., Moshiri, B., Araabi, B.N., Sadeghi, M.: Learning by abstraction:
hierarchical classification model using evidential theoretic approach and Bayesian
ensemble model. Neurocomputing 130, 73–82 (2014)
24. Nguyen, V.L., Destercke, S., Masson, M.H., Hüllermeier, E.: Reliable multi-class
classification based on pairwise epistemic and aleatoric uncertainty. In: Interna-
tional Joint Conference on Artificial Intelligence (2018)
25. Senge, R., et al.: Reliable classification: learning classifiers that distinguish aleatoric
and epistemic uncertainty. Inf. Sci. 255, 16–29 (2014)
26. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
27. Shafer, G.: Constructive probability. Synthese 48(1), 1–60 (1981)
28. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res.
9(Mar), 371–421 (2008)
29. Smets, P.: Non-Standard Logics for Automated Reasoning. Academic Press,
London (1988)
30. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66(2), 191–234
(1994)
31. Soullard, Y., Destercke, S., Thouvenin, I.: Co-training with credal models. In:
Schwenker, F., Abbas, H.M., El Gayar, N., Trentin, E. (eds.) ANNPR 2016. LNCS
(LNAI), vol. 9896, pp. 92–104. Springer, Cham (2016). https://doi.org/10.1007/
978-3-319-46182-3 8
32. Sutton-Charani, N., Imoussaten, A., Harispe, S., Montmain, J.: Evidential bagging:
combining heterogeneous classifiers in the belief functions framework. In: Medina,
J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 297–309. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-91473-2 26
33. Walley, P.: Towards a unified theory of imprecise probability. Int. J. Approximate
Reasoning 24(2–3), 125–148 (2000)
34. Xiong, H., Li, M., Jiang, T., Zhao, S.: Classification algorithm based on nb for
class overlapping problem. Appl. Math 7(2L), 409–415 (2013)
35. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1(1),
3–28 (1978)
36. Zaffalon, M.: The naive credal classifier. J. Stat. Plan. Infer. 105(1), 5–21 (2002)
37. Zhang, J., Subasingha, S., Premaratne, K., Shyu, M.L., Kubat, M., Hewawasam,
K.: A novel belief theoretic association rule mining based classifier for handling
class label ambiguities. In: Proceeidngs of Workshop Foundations of Data Mining
(FDM 2004), International Conferenece on Data Mining (ICDM 2004) (2004)
38. Zheng, Y., Bai, J., Xu, J., Li, X., Zhang, Y.: A discrimination model in waste
plastics sorting using nir hyperspectral imaging system. Waste Manage. 72, 87–98
(2018)
39. Zouhal, L.M., Denoeux, T.: An evidence-theoretic k-NN rule with parameter opti-
mization. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 28(2), 263–271
(1998)
An Analogical Interpolation Method
for Enlarging a Training Dataset
1 Introduction
Analogical proportions are statements of the form “a is to b as c is to d”. In the
Nicomachean Ethics, Aristotle makes an explicit parallel between such state-
ments and geometric proportions of the form “ ab = dc ”, where a, b, c, d are num-
bers. It also parallels arithmetic proportions, or difference proportions, which
are of the form “a − b = c − d”. The logical modeling of an analogical proportion
as a quaternary connective between four Boolean items appears to be a logical
counterpart of such numerical proportions [15]. It has been extended to items
described by vectors of Boolean, nominal or numerical values [2].
A particular case of such statements, named continuous analogical propor-
tions, is obtained when the two central components are equal, namely they are
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 136–152, 2019.
https://doi.org/10.1007/978-3-030-35514-2_11
An Analogical Interpolation Method for Enlarging a Training Dataset 137
a : b :: c : d = (¬a ∧ b ≡ ¬c ∧ d) ∧ (¬b ∧ a ≡ ¬d ∧ c)
See [13,16] for justifications. This expression is true for only 6 patterns of values
for abcd, namely {0000, 0011, 0101, 1111, 1100, 1010}. This extends to nominal
values where a : b :: c : d holds true if and only if abcd is one of the following
patterns ssss, stst, or sstt, where s and t are two possible distinct values of
items a, b, c and d.
Regarding continuous analogical proportions, it can be easily checked that
the unique solutions of equations 1 : x :: x : 1 and 0 : x :: x : 0 are respectively
x = 1 and x = 0, while 1 : x :: x : 0 or 0 : x :: x : 1 have no solution in the
138 M. Bounhas and H. Prade
the vectorial case [10] in this paper. Namely, we shall say x is between a and c
defined as:
Then we can define the set Between(a, c) of vectors between two vectors a and
c. For instance, we have Between(01000, 11010) = {01000, 11000, 01010, 11010}.
Note that in case of Boolean values, the betweenness condition can also be
written as ∀i = 1, · · · , m, (ai ∧ ci → xi ) ∧ (xi → ai ∨ ci ) = 1.
3 Related Work
The idea of generating, or completing, a third example from two examples can be
encountered in different settings. An option, quite different from interpolation, is
the “feature knock out” method [23], where a third example is built by modifying
a randomly chosen feature of the first example with that of the second one. A
somewhat related idea can be found in a recent proposal [3] which introduces
a measure of oddness with respect to a class that is computed on the basis of
pairs made of two nearest neighbors in the same class; this amounts to replace
the two neighbors by a fictitious representative of the class.
Reasoning with a system of fuzzy if-then rules provides an interpolation
mechanism [14], which, from these rules and an input “in-between” their con-
dition parts, yields a new conclusion “in-between” their conclusion parts, by
taking advantage of membership functions that can be seen as defining fuzzy
“neighborhoods”.
Moreover, several approaches based on the use of interpolation and analog-
ical proportions have been developed in the past decade. In [17], the problem
considered is to complete a set of parallel if-then rules, represented by a set of
condition variables associated to a conclusion variable. The values of the vari-
ables are assumed to belong to finite sets of ordered labels. The basic idea is
to apply analogical proportion inference in order to induce missing rules from
an initial set of rules, when an analogical proportions hold between the variable
labels of several parallel rules. Although this approach may seem close to the
analogical interpolation-based approach proposed in this paper, our goal is not to
predict just the conclusion part of an incomplete rule, but rather a whole exam-
ple including its attribute-based description and its class. Moreover, we restrict
our study to the use of pairs of examples for this prediction, while in [17] the
authors use both pairs or triples of rules for completing rules. An extended ver-
sion of the above-mentioned work has been presented in [22] where the authors
also propose a more cautious method that makes explicit the basic assumptions
under which rule conclusions are produced from analogical proportions. Along
the same line, see also [21] on interpolation between default rules.
Let us also mention the general approach proposed by Schockaert and Prade
[20] to interpolative and extrapolative reasoning from incomplete generic knowl-
edge represented by sets of symbolic rules, handled in a purely qualitative man-
ner, where labels are represented in conceptual spaces. This work is an extended
140 M. Bounhas and H. Prade
The basic idea is to find pairs of examples (a, c) ∈ E 2 with known labels such
that the analogical proportion (3) is solvable attribute by attribute i.e., there
exists x such that aj : xj :: xj : cj = 1 for each attribute j = 1, ..., m, and the
class equation has cl(x) as a solution, i.e., cl(a) : cl(x) :: cl(x) : cl(c) = 1.
As mentioned before in Sect. 2, the solution for the previous equation aj :
xj :: xj : cj = 1 in the numerical case is just the midpoint xj = (aj + cj )/2 for
each attribute j = 1, ..., m. We are interested in the case of ordered nominal val-
ues in this paper. Moreover, we assume that the distances between any two suc-
cessive values in such an ordered set of values are the same. Let V = {v1 , · · · , vk }
be an ordered set of nominal values, then, vi will be regarded as the midpoint of
vi−j and vi+j with j ≥ 1, provided that both vi−j and vi+j exist. For instance,
if V = {1, · · · , 5}, the analogical proportions 1 : 3 :: 3 : 5 or 2 : 3 :: 3 : 4 hold,
while 2 : x :: x : 5 = 1 has no solution. So it is clear that some pairs (a, c) will
not lead to any solution since we restrict the search space to the pairs for which
the midpoint (attribute by attribute) exists.
This condition may be too restrictive especially for datasets with high number
of attributes which may reduce the set of predicted examples. In case of success,
the predicted example x = {x1 , ..., xj , ...xm } will be assigned to the predicted
class label cl(x) and saved in a candidate set.
Since different voting pairs may predict the same example x more than once
(x may be the midpoint of more than one pair (a, c)), a candidate example may
have different class labels. Then has to perform a vote on class labels for each
candidate example classified differently in the candidate set. This leads to the
final predicted set of examples where each example is classified uniquely.
This process can be described by the following procedure:
1. Find solvable pairs (a, c) such that Eq. 3 has a unique non null solution x.
2. In case of ties (an example x is predicted with different class labels), apply
voting on all its predicted class labels and assign to x the success label.
3. Add x to the set of predicted examples (together with cl(x)).
In the next section, we first present a basic algorithm applying the process
described above, then we propose two options that may help to improve the
search space for the voting pairs.
4.2 Algorithms
The simplest way is to systematically consider all pairs (a, c) ∈ E 2 , for which
Eq. 3 is solvable, as candidate pairs for prediction. Algorithm 1 implements a
basic Analogical Interpolation-based Predictor, denoted AIPstd , without applying
any filter on the voting pairs.
Considering all pairs (a, c) for prediction may seem unreasonable especially
when the domain of attribute values is large since this may blur prediction
results. A first improvement of Algorithm 1 is to restrict the search for pairs to
those that are among the nearest neighbors (in terms of Hamming distance) to
the example to be predicted.
142 M. Bounhas and H. Prade
Let us consider two different pairs (a, c) and (d, e) ∈ E 2 . We assume that
a : x :: x : c = 1 produces as solution an example b and d : x :: x : e = 1
produces an other example b = b. If b is closest to (d, e) than b is to (a, c)
in terms of Hamming distance, it is more reasonable to consider only the pair
(d, e) for prediction. This means that example b will be predicted while b will be
rejected. We denote AIPN N this second improved Algorithm 2 in the following.
Algorithm 3 (that we denote AIPN N,SC ) is exactly the same as Algorithm 2 in
all respects, except that we look for only pairs (a, c) belonging to the same class
in this case. Note that the two algorithms only differ for non binary classification
problems, since s : x :: x : t = 1 has no solution in {0, 1} for s = t.
As can be seen in the next section, searching for the best pairs (described in
Algorithms 2 and 3) limits the number of accepted voting pairs. Moreover, there
is a second constraint to be satisfied, that is limiting the solutions of Eq. 3 to
the values of x that are the midpoint of a and c which is hard to be satisfied
in the ordered nominal setting. To relax this last constraint, we may think to
use the “betweenness” definition given in Eq. 4. In this definition, the equation
between(a, x, c) = 1 has, as a solution, any x such that x is between a and c
for each attribute j ∈ 1, ..., m. This last option is implemented by the algorithm
denoted AIPBtw which is exactly the same as Algorithm 3 except that we use
the definition (4) to solve the analogical interpolation.
An Analogical Interpolation Method for Enlarging a Training Dataset 143
In this section, we aim to evaluate the efficiency of the proposed algorithms for
predicting new examples. For this purpose, we first run different standard ML
classifiers on the original dataset, then we apply each AI-Predictor to generate a
new set of predicted examples that is used to enlarge the original data set. This
leads us to four different enlarged datasets, one for each proposed algorithm.
Finally, we re-evaluate again ML classifiers on each of these completed datasets.
For both original and enlarged datasets, we apply the testing protocol presented
in the next sub-section.
In this experimentation, we tested with the following standard ML classifiers:
• IBk: a k-NN classifier, we use the Manhattan distance and we tune the
classifier on different values of the parameter k = 1, 2, ..., 11.
• C4.5: generating a pruned or unpruned C4.5 decision tree. We tune the
classifier with different confidence factors used for pruning C = 0.1, 0.2, ..., 0.5.
• JRip: propositional rule learner, Repeated Incremental Pruning to Produce
Error Reduction (RIPPER). We tune the classifier for different values of
optimization runs O = 2, 4, ...10 and we apply pruning.
144 M. Bounhas and H. Prade
(computed as the ratio of the number of correct predictions to the total number
of test examples) for each fold. However, each ML classifier requires a parameter
to be tuned before performing this cross-validation.
Results of ML-Classifiers. Accuracy results for IBk, C4.5 and JRIP are
obtained by using the free implementation of Weka software to the enlarged
datasets obtained from AI-Predictors. To run IBk, C4.5 and JRIP, we first opti-
mize the corresponding parameter for each classifier, using the meta CVPa-
rameterSelection class provided by Weka using a cross-validation applied to the
training set only. This enables us to select the best value of the parameter for
each dataset, then we train and test the classifier using this selected value of this
parameter.
Table 2 provides classification results of ML classifiers obtained with a 10-
fold cross validation and for the best/optimized value of the tuned parameter
(denoted p in this table).
Results in the previous table show that:
Table 2. (continued)
interpolation which treat equally all attributes, is not compatible with this
kind of classification.
– The good improvement observed for Monk2 dataset confirms our previous
intuition since, contrary to Monk1 and Mon3, in Monk2 all attributes are
involved in defining the class label in this dataset.
– The standard Algorithm 1 outperforms other algorithms in case of Cancer and
Breast Cancer datasets. It is important to note that only these two datasets
include attributes with large range of values (with maximum of 10 different
values for Cancer and 13 different values for Breast Cancer). Moreover the
number of attributes is also high if compared to other datasets. We expect
that, in case ordered nominal data is represented by a large scale, using only
nearest neighbor pairs for prediction seems too restrictive and leads to a local
search for new examples.
– There is no particular algorithm that provides the best results for all datasets.
– We computed the average accuracy for each proposed algorithm and for each
ML classifier over all datasets. Results are given at the end of Table 2. We can
note that IBk classifier performs the best accuracy when using the enlarged
data built from the AIPN N,SC Algorithm. While C4.5 and JRIP perform
better when applied to the dataset built from AIPBtw Algorithm.
– Overall, the IBK classifier shows the highest classification accuracy over all
datasets.
In this first study, the improved results of ML classifiers when applied to enlarged
datasets show the ability of the proposed algorithms (especially, AIPN N,SC and
AIPBtw ) to predict examples that are labeled with the suitable class.
– In seven among ten datasets, the proportion of predicted examples that are
successfully classified is 100%. This means that all predicted examples that
match the original set are assigned to the correct class label and thus are
f ully compatible with the original set (see for example Monk2, Breast Cancer,
Hayes Roth and Nursery).
– Predicting accurate examples in these datasets may explain why ML classi-
fiers show high classification improvement when applied to the new enlarged
dataset.
– Although AIPN N,SC Algorithm succeeds to predict accurate examples, the
number of predicted examples is very reduced for some datasets such as for
An Analogical Interpolation Method for Enlarging a Training Dataset 149
Breast Cancer, Voting and Cancer. This due to the fact that we restrict the
search for only nearest neighbors pairs belonging to the same class in this
Algorithm. It is important to note that these datasets contains large number
of attributes which make the process of pairs filter more constraining.
– As can be seen in Table 3, the size of the predicted sets is considerably
increased, for these three datasets, when applying AIPStd Algorithm which is
less constraining than AIPN N,SC (520 examples instead of 46 are predicted
for Cancer dataset). In Table 2, we also noticed that, only for these three cited
datasets, IBK performs considerably better when applied to the datasets built
from the standard algorithm AIPStd (producing larger sets). It is clear that
in case the predicted set is very reduced, the enlarged dataset remains similar
to the original set that’s why the improvement percentage of ML classifiers
cannot be clearly noticed in the case of datasets predicted from AIPN N,SC
Algorithm.
– Lastly for some datasets such as Monk1 and Monk3, the proportion of pre-
dicted examples that are compatible with the original set is low if compared
to other datasets. As explained before, in the original sets, the classification
function involves only 2 among 6 attributes which seems incompatible with
continuous analogical interpolation assuming that all attributes as well as
class label are the midpoint of the attributes and the class label of the pair
used for prediction.
Table 3. Nbr. of predicted examples, proportion of predicted examples that are com-
patible with the original set
Table 4. Results for ML classifiers obtained with the enlarged datasets and comparison
with AP-Classifier [2]
6 Conclusion
This paper has studied the idea of enlarging a training set using analogical
proportions as in [4], with two main differences: we only consider pairs of exam-
ples by using continuous analogical proportions which contribute to reduce the
complexity to be quadratic instead of cubic, and we test with ordered nominal
datasets instead of Boolean one.
On the one hand the results obtained by classical machine learning methods
on the enlarged training set generally improve those obtained by applying these
methods to the original training sets. On the other hand, these results, obtained
with a smaller level of complexity, are often not so far from those obtained by
directly applying the analogical proportion-based classification method on the
original training set [2].
References
1. Bayoudh, S., Mouchère, H., Miclet, L., Anquetil, E.: Learning a classifier with very
few examples: analogy based and knowledge based generation of new examples for
character recognition. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S.,
Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 527–
534. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74958-5 49
2. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approximate Reasoning 91, 36–55 (2017)
3. Bounhas, M., Prade, H., Richard, G.: Oddness-based classification: a new way of
exploiting neighbors. Int. J. Intell. Syst. 33(12), 2379–2401 (2018)
4. Couceiro, M., Hug, N., Prade, H., Richard, G.: Analogy-preserving functions: a way
to extend Boolean samples. In: Proceedings 26th International Joint Conference
on Artificial Intelligence, IJCAI 2017, Melbourne, 19–25 August, pp. 1575–1581
(2017)
5. Derrac, J., Schockaert, S.: Inducing semantic relations from conceptual spaces: a
data-driven approach to plausible reasoning. Artif. Intell. 228, 66–94 (2015)
6. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
7. Goodfellow, I., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M.,
Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Informa-
tion Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014)
8. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification.
Technical report, Department of Computer Science, National Taiwan University
(2010)
9. Inoue, H.: Data augmentation by pairing samples for images classification. CoRR
abs/1801.02929 (2018). http://arxiv.org/abs/1801.02929
10. Lieber, J., Nauer, E., Prade, H., Richard, G.: Making the best of cases by approxi-
mation, interpolation and extrapolation. In: Cox, M.T., Funk, P., Begum, S. (eds.)
ICCBR 2018. LNCS (LNAI), vol. 11156, pp. 580–596. Springer, Cham (2018).
https://doi.org/10.1007/978-3-030-01081-2 38
11. Mertz, J., Murphy, P.: UCI repository of machine learning databases (2000).
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
152 M. Bounhas and H. Prade
12. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
13. Miclet, L., Prade, H.: Handling analogical proportions in classical logic and fuzzy
logics settings. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI),
vol. 5590, pp. 638–650. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-02906-6 55
14. Perfilieva, I., Dubois, D., Prade, H., Esteva, F., Godo, L., Hodáková, P.: Inter-
polation of fuzzy data: analytical approach and overview. Fuzzy Sets Syst. 192,
134–158 (2012)
15. Prade, H., Richard, G.: From analogical proportion to logical proportions. Logica
Universalis 7(4), 441–505 (2013)
16. Prade, H., Richard, G.: Analogical proportions: from equality to inequality. Int. J.
Approximate Reasoning 101, 234–254 (2018)
17. Prade, H., Schockaert, S.: Completing rule bases in symbolic domains by analogy
making. In: Galichet, S., Montero, J., Mauris, G. (eds.) Proceedings 7th Conference
European Society for Fuzzy Logic and Technology (EUSFLAT), Aix-les-Bains, 18–
22 July, pp. 928–934. Atlantis Press (2011)
18. Schockaert, S., Prade, H.: Interpolation and extrapolation in conceptual spaces: a
case study in the music domain. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011.
LNCS, vol. 6902, pp. 217–231. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-23580-1 16
19. Schockaert, S., Prade, H.: Qualitative reasoning about incomplete categorization
rules based on interpolation and extrapolation in conceptual spaces. In: Benferhat,
S., Grant, J. (eds.) SUM 2011. LNCS (LNAI), vol. 6929, pp. 303–316. Springer,
Heidelberg (2011). https://doi.org/10.1007/978-3-642-23963-2 24
20. Schockaert, S., Prade, H.: Interpolative and extrapolative reasoning in proposi-
tional theories using qualitative knowledge about conceptual spaces. Artif. Intell.
202, 86–131 (2013)
21. Schockaert, S., Prade, H.: Interpolative reasoning with default rules. In: Rossi, F.
(ed.) IJCAI 2013, Proceedings 23rd International Joint Conference on Artificial
Intelligence, Beijing, 3–9 August, pp. 1090–1096 (2013)
22. Schockaert, S., Prade, H.: Completing symbolic rule bases using betweenness and
analogical proportion. In: Prade, H., Richard, G. (eds.) Computational Approaches
to Analogical Reasoning: Current Trends. SCI, vol. 548, pp. 195–215. Springer,
Heidelberg (2014). https://doi.org/10.1007/978-3-642-54516-0 8
23. Wolf, L., Martin, I.: Regularization through feature knock out. MIT Computer
Science and Artificial Intelligence Laboratory (CBCL Memo 242) (2004)
Towards a Reconciliation Between
Reasoning and Learning - A Position
Paper
IRIT - CNRS, 118 route de Narbonne, 31062 Toulouse Cedex 09, France
{dubois,prade}@irit.fr
1 Introduction
What is artificial intelligence (AI) about? What are the research topics that
belong to AI? What are the topics that stand outside? In other words, what
are the contours of AI? Answers to these questions may have evolved with time,
as did the issue of the proper way (if any) of doing AI. Indeed over time, AI
has been successively dominated by logical approaches (until the mid 1990’s)
giving birth to the so-called “symbolic AI”, then by (Bayesian) probabilistic
approaches, and since recently by another type of numerical approach, artificial
neural networks. This state of facts has contributed to developing antagonistic
feelings between different schools of thought, including claims of supremacy of
some methods over others, rather than fostering attempts to understand the
A preliminary version of this paper was presented at the 2018 IJCAI-ECAI workshop
“Learning and Reasoning: Principles & Applications to Everyday Spatial and Temporal
Knowledge”, Stockholm, July 13–14.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 153–168, 2019.
https://doi.org/10.1007/978-3-030-35514-2_12
154 D. Dubois and H. Prade
1
One would notice the word ‘logical’ in the title of this pioneering paper.
156 D. Dubois and H. Prade
had already built a neural network learning machine (he was also a friend of
Rosenblatt [79] the inventor of perceptrons).
The interests of the six other participants can be roughly divided into rea-
soning and learning concerns, they were on the one hand Simon (1916–2001),
Newell (1927–1992) [63] (together authors with John Clifford Shaw (1922–1991)
of a program The Logic Theorist able to prove theorems in mathematical logic),
and More [60] (a logician interested in natural deduction at that time), and
on the other hand Samuel (1901–1990) [81] (author of programs for checkers,
and later chess games), Selfridge (1926–2008) [84] (one of the fathers of pattern
recognition), and Solomonoff (1926–2009) [90] (already author of a theory of
probabilistic induction).
Interestingly enough, as it can be seen, these ten participants, with differ-
ent backgrounds ranging from psychology to electrical engineering, physics and
mathematics, were already the carriers of a large variety of research directions
that are still present in modern AI, from machine learning to knowledge repre-
sentation and reasoning.
Pinkas [68,69] where the idea of penalty logic (related to belief functions [31]) has
been developed in relation with neural networks, where penalty weights reflect
priorities attached to logical constraints to be satisfied by a neural network
[70]. Penalty logics and Markov logic [77] are also closely related to possibilistic
logic [30].
Another intriguing question would be to explore possible relations between
spikes neurons [12], which are physiologically more plausible than classical artifi-
cial neural networks, and fire when conjunctions of thresholds are reached, with
Sugeno integrals (then viewed as a System 1-like black box) and their logical
counterparts [29] (corresponding to a System 2-like representation).
6 Conclusion
Knowledge representation and reasoning on the one hand, and machine learning
on the other hand, have been developed largely as independent research trends
in artificial intelligence in the last three decades. Yet, reasoning and learning
are two basic capabilities of the human mind that do interact. Similarly the two
corresponding AI research areas may benefit from mutual exchanges. Current
learning methods derive know-how from data in the form of complex functions
involving many tuning parameters, but they should also aim at producing artic-
ulated knowledge, so that repositories, storing interpretable chunks of informa-
tion, could be fed from data. More precisely, a number of logical-like formalisms,
whose explanatory capabilities could be exploited, have been developed in the
last 30 years (non-monotonic logics, modal logics, logic programming, probabilis-
tic and possibilistic logics, many-valued logics, etc.) that could be used as target
languages for learning techniques, without restricting to first-order logic, nor to
Bayes nets.
Interfacing classifiers with human users may require some ability to provide
high level explanations about recommendations or decisions that are understand-
able by an end-user. Reasoning methods should handle knowledge and informa-
tion extracted from data. The joint use of (supervised or unsupervised) machine
learning techniques and of inference machineries raises new issues. There is a
number of other points, worth mentioning, which have not be addressed in the
above discussions:
This paper has especially advocated the interest of a cooperation between two
basic areas of AI: knowledge representation and reasoning on the one hand and
machine learning on the other hand, reflecting the natural cooperation between
two modes, respectively reactive and deliberative, of human intelligence. It is also
a plea for maintaining a unified view of AI, all facets of which have been present
from the very beginning, as recalled in Sect. 2 of this paper. It is time that AI
comes of age as a genuine science, which means ending unproductive rivalries
between different approaches, and fostering a better shared understanding of the
basics of AI through open-minded studies bridging sub-areas in a constructive
way. In the same spirit, a plea for a unified view of computer science can be found
in [6]. Mixing, bridging, hybridizing advanced ideas in knowledge representation,
reasoning, and machine learning or data mining should renew basic research in
AI and contribute in the long term to a more unified view of AI methodology.
The interested reader may follow the work in progress of the group “Amel”
[2] aiming at a better mutual understanding of research trends in knowledge
representation, reasoning and machine learning, and how they could cooperate.
References
1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets
of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings 1993
ACM SIGMOD International Conference on Management of Data, Washington,
DC, 26–28 May 1993, pp. 207–216. ACM Press (1993)
2. Amel, K.R.: From shallow to deep interactions between knowledge representation,
reasoning and machine learning. In: BenAmor, N., Theobald, M. (eds.) Proceedings
13th International Conference Scala Uncertainity Mgmt (SUM 2019), Compiègne,
LNCS, 16–18 December 2019. Springer, Heidelberg (2019)
3. Augustin, T., Coolen, F.P.A., De Cooman, G., Troffaes, M.C.M.: Introduction to
Imprecise Probabilities. Wiley, Hoboken (2014)
4. Baader, F., Horrocks, I., Lutz, C., Sattler, U.: An Introduction to Description
Logic. Cambridge University Press, Cambridge (2017)
5. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random
fields and probabilistic soft logic. J. Mach. Learn. Res. 18, 109:1–109:67 (2017)
6. Bajcsy, R., Reynolds, C.W.: Computer science: the science of and about informa-
tion and computation. Commun. ACM 45(3), 94–98 (2002)
7. Balkenius, C., Gärdenfors, P.: Nonmonotonic inferences in neural networks. In:
Proceedings 2nd International Conference on Principle of Knowledge Representa-
tion and Reasoning (KR 1991), Cambridge, MA, pp. 32–39 (1991)
164 D. Dubois and H. Prade
8. Benferhat, S., Dubois, D., Prade, H.: Possibilistic and standard probabilistic
semantics of conditional knowledge bases. J. Log. Comput. 9(6), 873–895 (1999)
9. Benferhat, S., Dubois, D., Lagrue, S., Prade, H.: A big-stepped probability app-
roach for discovering default rules. Int. J. Uncert. Fuzz. Knowl.-based Syst.
11(Suppl.–1), 1–14 (2003)
10. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
11. Besold, T.R., Garcez, A.D.A., Stenning, K., van der Torre, L., van Lambalgen,
M.: Reasoning in non-probabilistic uncertainty: logic programming and neural-
symbolic computing as examples. Minds Mach. 27(1), 37–77 (2017)
12. Bichler, O., Querlioz, D., Thorpe, S.J., Bourgoin, J.-P., Gamrat, C.: Extraction
of temporally correlated features from dynamic vision sensors with spike-timing-
dependent plasticity. Neural Netw. 32, 339–348 (2012)
13. Bounhas, M., Pirlot, M., Prade, H., Sobrie, O.: Comparison of analogy-based
methods for predicting preferences. In: BenAmor, N., Theobald, M. (eds.) Pro-
ceedings 13th International Conference on Scala Uncertainity Mgmt (SUM 2019),
Compiègne, LNCS, 16–18 December. Springer, Heidelberg (2019)
14. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approx. Reasoning 91, 36–55 (2017)
15. Brabant, Q., Couceiro, M., Dubois, D., Prade, H., Rico, A.: Extracting decision
rules from qualitative data via sugeno utility functionals. In: Medina, J., Ojeda-
Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager,
R.R. (eds.) IPMU 2018. CCIS, vol. 853, pp. 253–265. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-91473-2 22
16. Cohen, W.W.: TensorLog: a differentiable deductive database. CoRR,
abs/1605.06523 (2016)
17. Cohen, W.W., Yang, F., Mazaitis, K.: TensorLog: deep learning meets probabilistic
DBs. CoRR, abs/1707.05390 (2017)
18. Couceiro, M., Hug, N., Prade, H., Richard, G.: Analogy-preserving functions: a way
to extend Boolean samples. In: Proceedings 26th International Joint Conference
on Artificial Intelligence, (IJCAI 2017), Melbourne, 19–25 August, pp. 1575–1581
(2017)
19. Couso, I., Dubois, D.: A general framework for maximizing likelihood under incom-
plete data. Int. J. Approx. Reasoning 93, 238–260 (2018)
20. d’Alché-Buc, F., Andrés, V., Nadal, J.-P.: Rule extraction with fuzzy neural net-
work. Int. J. Neural Syst. 5(1), 1–11 (1994)
21. Darwiche, A.: Human-level intelligence or animal-like abilities?. CoRR,
abs/1707.04327 (2017)
22. d’Avila Garcez, A.S., Broda, K., Gabbay, D.M.: Symbolic knowledge extraction
from trained neural networks: a sound approach. Artif. Intell. 125(1–2), 155–207
(2001)
23. d’Avila Garcez, A.S., Gabbay, D.M., Lamb, L.C.: Value-based argumentation
frameworks as neural-symbolic learning systems. J. Logic Comput. 15(6), 1041–
1058 (2005)
24. d’Avila Garcez, A.S., Lamb, L.C., Gabbay, D.M.: Connectionist modal logic: repre-
senting modalities in neural networks. Theor. Comput. Sci. 371(1–2), 34–53 (2007)
25. Donadello, I., Serafini, L., Garcez, A.D.A.: Logic tensor networks for semantic
image interpretation. In: Sierra, C. (ed) Proceedings 26th International Joint Con-
ference on Artificial Intelligence (IJCAI 2017), Melbourne, 19–25 August 2017, pp.
1596–1602 (2017)
Towards a Reconciliation Between Reasoning and Learning 165
26. Dubois, D., Godo, L., Prade, H.: Weighted logics for artificial intelligence - an
introductory discussion. Int. J. Approx. Reasoning 55(9), 1819–1829 (2014)
27. Dubois, D., Prade, H.: Soft computing, fuzzy logic, and artificial intelligence. Soft
Comput. 2(1), 7–11 (1998)
28. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
29. Dubois, D., Prade, H., Rico, A.: The logical encoding of Sugeno integrals. Fuzzy
Sets Syst. 241, 61–75 (2014)
30. Dubois, D., Prade, H., Schockaert, S.: Generalized possibilistic logic: foundations
and applications to qualitative reasoning about uncertainty. Artif. Intell. 252, 139–
174 (2017)
31. Dupin de Saint-Cyr, F., Lang, J., Schiex, T.: Penalty logic and its link with
Dempster-Shafer theory. In: de Mántaras, R.L., Poole, D. (eds.) Proceedings 10th
Annual Conference on Uncertainty in Artificial Intelligence (UAI 1994), Seattle,
29–31 July, pp. 204–211 (1994)
32. Fahandar, M.A., Hüllermeier, E.: Learning to rank based on analogical reasoning.
In: Proceedings 32th National Conference on Artificial Intelligence (AAAI 2018),
New Orleans, 2–7 February 2018 (2018)
33. Fakhraei, S., Raschid, L., Getoor, L.: Drug-target interaction prediction for drug
repurposing with probabilistic similarity logic. In: SIGKDD 12th International
Workshop on Data Mining in Bioinformatics (BIOKDD). ACM (2013)
34. Farnadi, G., Bach, S.H., Moens, M.F., Getoor, L., De Cock, M.: Extending PSL
with fuzzy quantifiers. In: Papers from the 2014 AAAI Workshop Statistical Rela-
tional Artificial Intelligence, Québec City, 27 July, pp. WS-14-13, 35–37 (2014)
35. Gilboa, I., Schmeidler, D.: Case-based decision theory. Q. J. Econ. 110, 605–639
(1995)
36. Hájek, P., Havránek, T.: Mechanising Hypothesis Formation - Mathematical Foun-
dations for a General Theory. Springer, Heidelberg (1978). https://doi.org/10.
1007/978-3-642-66943-9
37. Hebb, D.O.: The Organization of Behaviour. Wiley, Hoboken (1949)
38. Heitjan, D., Rubin, D.: Ignorability and coarse ckata. Ann. Statist. 19, 2244–2253
(1991)
39. Hobbes, T.: Elements of philosophy, the first section, concerning body. In:
Molesworth, W. (ed.) The English works of Thomas Hobbes of Malmesbury, vol.
1. John Bohn, London, 1839. English translation of ”Elementa Philosophiae I. De
Corpore” (1655)
40. Hohenecker, P., Lukasiewicz, T.: Ontology reasoning with deep neural networks.
CoRR, abs/1808.07980 (2018)
41. Hüllermeier, E.: Inducing fuzzy concepts through extended version space learning.
In: Bilgiç, T., De Baets, B., Kaynak, O. (eds.) IFSA 2003. LNCS, vol. 2715, pp.
677–684. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44967-1 81
42. Hüllermeier, E.: Learning from imprecise and fuzzy observations: data disambigua-
tion through generalized loss minimization. Int. J. Approx. Reasoning 55(7), 1519–
1534 (2014)
43. Jaeger, M.: Ignorability in statistical and probabilistic inference. JAIR 24, 889–917
(2005)
44. Kahneman, D.: Thinking, Fast and Slow. Farrar, Straus and Giroux, New York
(2011)
45. Kolodner, J.L.: Case-Based Reasoning. Morgan Kaufmann, Burlington (1993)
46. Kotlowski, W., Slowinski, R.: On nonparametric ordinal classification with mono-
tonicity constraints. IEEE Trans. Knowl. Data Eng. 25(11), 2576–2589 (2013)
166 D. Dubois and H. Prade
47. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential mod-
els and cumulative logics. Artif. Intell. 44, 167–207 (1990)
48. Kuzelka, O., Davis, J., Schockaert, S.: Encoding Markov logic networks in pos-
sibilistic logic. In: Meila, M., Heskes, T. (eds.) Proceedings 31st Conference on
Uncertainty in Artificial Intelligence (UAI 2015), Amsterdam, 12–16 July 2015,
pp. 454–463. AUAI Press (2015)
49. Kuzelka, O., Davis, J., Schockaert, S.: Learning possibilistic logic theories from
default rules. In: Kambhampati, S. (ed.) Proceedings 25th International Joint Con-
ference on Artificial Intelligence (IJCAI 2016), New York, 9–15 July 2016, pp.
1167–1173 (2016)
50. Kuzelka, O., Davis, J., Schockaert, S.: Induction of interpretable possibilistic logic
theories from relational data. In: Sierra, C. (ed.) Proceedings 26th International
Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, 19–25 August
2017, pp. 1153–1159 (2017)
51. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444
(2015)
52. Mamdani, E.H., Assilian, S.: An experiment in linguistic synthesis with a fuzzy
logic controller. Int. J. Man-Mach. Stu. 7, 1–13 (1975)
53. Marquis, P., Papini, O., Prade, H.: Eléments pour une histoire de l’intelligence
artificielle. In: Panorama de l’Intelligence Artificielle. Ses Bases Méthodologiques,
ses Développements, vol. I, pp. 1–39. Cépaduès (2014)
54. Marquis, P., Papini, O., Prade, H.: Some elements for a prehistory of Artificial
Intelligence in the last four centuries. In: Proceedings 21st Europoen Conference
on Artificial Intelligence (ECAI 2014), Prague, pp. 609–614. IOS Press (2014)
55. McCarthy, J., Minsky, M., Roch-ester, N., Shannon, C.E.: A proposal for the Dart-
mouth summer research project on artificial intelligence, august 31, 1955. AI Mag.
27(4), 12–14 (2006)
56. McCulloch, W.S., Pitts, W.: A logical calculus of ideas immanent in nervous activ-
ity. Bull. Math. Biophys. 5, 115–133 (1943)
57. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
58. Miclet, L., Prade, H.: Handling analogical proportions in classical logic and fuzzy
logics settings. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI),
vol. 5590, pp. 638–650. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-02906-6 55
59. Mitchell, T.: Version spaces: an approach to concept learning. Ph.D. thesis, Stan-
ford (1979)
60. More, T.: On the construction of Venn diagrams. J. Symb. Logic 24(4), 303–304
(1959)
61. Mushthofa, M., Schockaert, S., De Cock, M.: Solving disjunctive fuzzy answer set
programs. In: Calimeri, F., Ianni, G., Truszczynski, M. (eds.) LPNMR 2015. LNCS
(LNAI), vol. 9345, pp. 453–466. Springer, Cham (2015). https://doi.org/10.1007/
978-3-319-23264-5 38
62. Narodytska, N.: Formal analysis of deep binarized neural networks. In: Lang, J.
(ed.) Proceedings 27th International Joint Conference Artificial Intelligence (IJCAI
2018), Stockholm, 13–19 July 2018, pp. 5692–5696 (2018)
63. Newell, A., Simon, H.A.: The logic theory machine. a complex information pro-
cessing system. In: Proceedings IRE Transactions on Information Theory(IT-2),
The Rand Corporation, Santa Monica, Ca, 1956. Report P-868, 15 June 1956, pp.
61-79, September 1956
Towards a Reconciliation Between Reasoning and Learning 167
64. Nilsson, N.J.: The Quest for Artificial Intelligence : A History of Ideas andAchieve-
ments. Cambridge University Press, Cambridge (2010)
65. Nin, J., Laurent, A., Poncelet, P.: Speed up gradual rule mining from stream data!
A B-tree and OWA-based approach. J. Intell. Inf. Syst. 35(3), 447–463 (2010)
66. Pearl, J.: Causality, vol. 2000, 2nd edn. Cambridge University Press, Cambridge
(2009)
67. Perfilieva, I., Dubois, D., Prade, H., Esteva, F., Godo, L., Hodáková, P.: Inter-
polation of fuzzy data: analytical approach and overview. Fuzzy Sets Syst. 192,
134–158 (2012)
68. Pinkas, G.: Propositional non-monotonic reasoning and inconsistency in symmetric
neural networks. In: Mylopoulos, J., Reiter, R. (eds.) Proceedings 12th Interna-
tional Joint Conference on Artificial Intelligence, Sydney, 24–30 August 1991, pp.
525–531. Morgan Kaufmann (1991)
69. Pinkas, G.: Reasoning, nonmonotonicity and learning in connectionist networks
that capture propositional knowledge. Artif. Intell. 77(2), 203–247 (1995)
70. Pinkas, G., Cohen, S.: High-order networks that learn to satisfy logic constraints.
FLAP J. Appl. Logics IfCoLoG J. Logics Appl. 6(4), 653–694 (2019)
71. Prade, H.: Reasoning with data - a new challenge for AI? In: Schockaert, S., Senel-
lart, P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 274–288. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-45856-4 19
72. Prade, H., Richard, G.: From analogical proportion to logical proportions. Logica
Universalis 7(4), 441–505 (2013)
73. Prade, H., Rico, A., Serrurier, M.: Elicitation of sugeno integrals: a version space
learning perspective. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS
2009. LNCS (LNAI), vol. 5722, pp. 392–401. Springer, Heidelberg (2009). https://
doi.org/10.1007/978-3-642-04125-9 42
74. Prade, H., Rico, A., Serrurier, M., Raufaste, E.: Elicitating sugeno integrals:
methodology and a case study. In: Sossai, C., Chemello, G. (eds.) ECSQARU
2009. LNCS (LNAI), vol. 5590, pp. 712–723. Springer, Heidelberg (2009). https://
doi.org/10.1007/978-3-642-02906-6 61
75. Prade, H., Serrurier, M.: Bipolar version space learning. Int. J. Intell. Syst. 23,
1135–1152 (2008)
76. Raufaste, E.: Les Mécanismes Cognitifs du Diagnostic Médical : Optimisation et
Expertise. Presses Universitaires de France (PUF), Paris (2001)
77. Richardson, M., Domingos, P.M.: Markov logic networks. Mach. Learn. 62(1–2),
107–136 (2006)
78. Rocktäschel, T., Riedel, S.: End-to-end differentiable proving. In: Guyon, I., et al.
(eds.) Proceedings 31st Annual Conference on Neural Information Processing Sys-
tems (NIPS 2017), Long Beach, 4–9 December 2017, pp. 3791–3803 (2017)
79. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and
organization in the brain. Psychol. Rev. 65(6), 386–408 (1958)
80. Rückert, U., De Raedt, L.: An experimental evaluation of simplicity in rule learn-
ing. Artif. Intell. 172(1), 19–28 (2008)
81. Samuel, A.: Some studies in machine learning using the game of checkers. IBM J.
3, 210–229 (1959)
82. Schockaert, S., Prade, H.: Interpolation and extrapolation in conceptual spaces: a
case study in the music domain. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011.
LNCS, vol. 6902, pp. 217–231. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-23580-1 16
168 D. Dubois and H. Prade
83. Schockaert, S., Prade, H.: Interpolative and extrapolative reasoning in proposi-
tional theories using qualitative knowledge about conceptual spaces. Artif. Intell.
202, 86–131 (2013)
84. Selfridge, O.G.: Pandemonium: a paradigm for learning. In: Blake, D.V., Uttley,
A.M. (ed) Symposium on Mechanisation of Thought Processes, London, 24–27
November 1959, vol. 1958, pp. 511–529 (1959)
85. Serafini, L., Garcez, A.S.A.: Logic tensor networks: deep learning and logical rea-
soning from data and knowledge. In: Besold, T.R., Lamb, L.C., Serafini, L., Tabor,
W. (eds.) Proceedings 11th International Workshop on Neural-Symbolic Learning
and Reasoning (NeSy 2016), New York City, 16–17 July 2016, vol. 1768 of CEUR
Workshop Proceedings (2016)
86. Serafini, L., Donadello, I., Garcez, A.S.A.: Learning and reasoning in logic tensor
networks: theory and application to semantic image interpretation. In: Seffah, A.,
Penzenstadler, B., Alves, C., Peng, X. (eds.) Proceedings Symposium on Applied
Computing (SAC 2017), Marrakech, 3–7 April 2017, pp. 125–130. ACM (2017)
87. Serrurier, M., Dubois, D., Prade, H., Sudkamp, T.: Learning fuzzy rules with their
implication operators. Data Knowl. Eng. 60(1), 71–89 (2007)
88. Serrurier, M., Prade, H.: Introducing possibilistic logic in ILP for dealing with
exceptions. Artif. Intell. 171(16–17), 939–950 (2007)
89. Shannon, C.E.: Programming a computer for playing chess. Philos. Mag. (7th
series) XLI (314), 256–275 (1950)
90. Solomonoff, R.J.: An inductive inference machine. Tech. Res. Group, New York
City (1956)
91. Turing, A.M.: Intelligent machinery. Technical report, National Physical Labora-
tory, London, 1948. Also. In: Machine Intelligence, vol. 5, pp. 3–23. Edinburgh
University Press (1969)
92. Turing, A.M.: Computing machinery and intelligence. Mind 59, 433–460 (1950)
93. Ughetto, L., Dubois, D., Prade, H.: Implicative and conjunctive fuzzy rules - a
tool for reasoning from knowledge and examples. In: Hendler, J., Subramanian,
D. (eds.) Proceedings 16th National Confernce on Artificial Intelligence, Orlando,
18–22 July 1999, pp. 214–219 (1999)
94. Walley, P.: Statistical Reasoning with Imprecise Probabilities. Chapman and Hall,
London (1991)
95. Walley, P.: Measures of uncertainty in expert systems. Artif. Intell. 83(1), 1–58
(1996)
96. Wiener, N.: Cybernetics or Control and Communication in the Animal and the
Machine. Wiley, Hoboken (1949)
97. Zadeh, L.A.: Thinking machines - a new field in electrical engineering. Columbia
Eng. Q. 3, 12–13 (1950)
98. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and
decision processes. IEEE Trans. Syst. Man Cybern. 3(1), 28–44 (1973)
CP-Nets, π-pref Nets, and Pareto
Dominance
Abstract. Two approaches have been proposed for the graphical han-
dling of qualitative conditional preferences between solutions described
in terms of a finite set of features: Conditional Preference networks (CP-
nets for short) and more recently, Possibilistic Preference networks (π-
pref nets for short). The latter agree with Pareto dominance, in the sense
that if a solution violates a subset of preferences violated by another
one, the former solution is preferred to the latter one. Although such an
agreement might be considered as a basic requirement, it was only con-
jectured to hold as well for CP-nets. This non-trivial result is established
in the paper. Moreover it has important consequences for showing that
π-pref nets can at least approximately mimic CP-nets by adding explicit
constraints between symbolic weights encoding the ceteris paribus pref-
erences, in case of Boolean features. We further show that dominance
with respect to the extended π-pref nets is polynomial.
1 Introduction
Ceteris Paribus Conditional Preference Networks (CP-nets, for short) [5,6] were
introduced in order to provide a convenient tool for the elicitation of multidi-
mensional preferences and accordingly compare the relative merits of solutions
to a problem. They are based on three assumptions: only ordinal information is
required; the preference statements deal with the values of single decision vari-
ables in the context of fixed values for other variables that influence them; pref-
erences are provided all else being equal (ceteris paribus). CP-nets were inspired
by Bayesian networks (they use a dependency graph, most of the time a directed
acyclic one, whose vertices are variables) but differ from them by being quali-
tative, by their use of the ceteris paribus assumption, and by the fact that the
variables in a CP-net are decision variables rather than random variables. In
the most common form of CP-nets, each preference statement in the prefer-
ence graph translates into a strict preference between two solutions (i.e., value
assignment to all decision variables) differing on a single variable (referred to as
a worsening flip) and the dominance relation between solutions is the transitive
closure of this worsening flip relation.
Another kind of conditional preference network, called π-pref nets, has been
more recently introduced [1], and is directly inspired by the counterpart of
Bayesian networks in possibility theory, called possibilistic networks [3]. A Π-pref
net shares with CP-nets its directed acyclic graphical structure between decision
variables, and conditional preference statements attached to each variable in the
contexts defined by assignments of its parent variables in the graph. The prefer-
ence for one value against another is captured by assigning degrees of possibility
(here interpreted as utilities) to these values. When the only existing prefer-
ences are those expressed by the conditional statements (there are no preference
statements across contexts or variables), it has been shown that the dominance
relation between solutions is obtained by comparing vectors of symbolic utility
values (one per variables) using Pareto-dominance.
Some results comparing the preference relations between solutions obtained
from CP-nets and π-pref nets with Boolean decision variables are given in [1].
This is made easy by the fact that CP-nets and π-pref nets share the graph struc-
ture and the conditional preference tables. It was shown that the two obtained
dominance relations between solutions cannot conflict with each other (there
is no preference reversal between them), and that ceteris paribus information
can be added to π-pref nets in the form of preference statements between spe-
cific products of symbolic weights. One pending question was to show that the
dominance relation between solutions obtained from a CP-net refines the prefer-
ence relation obtained from the corresponding π-pref net. In the case of Boolean
variables, the π-pref net ordering can be viewed as a form of Pareto ordering:
each assignment of a decision variables is either good (= in agreement with
the preference statement) or bad. The pending question comes down to prove
a monotonicity condition for the preference relation on solutions, stating that
as soon as a solution contains more (in the sense of inclusion) good variable
assignments than another solution, it should be strictly preferred by the CP-
net. Strangely enough this natural question has hardly been addressed in the
literature so far (see [2] for some discussion). The aim of this paper is to solve
this problem, and more generally to compare the orderings of solutions using the
two preference modeling structures.
We further show that dominance with respect to extended π-pref nets can
be computed in polynomial time, using linear programming; it thus forms a
polynomial upper approximation for the CP-net dominance relation.
The paper is structured as follows: In Sect. 2 we define a condition, that we
call local dominance, that is shown to be a sufficient condition for dominance
in a CP-net. The follow two sections, Sects. 3 and 4, make use of this sufficient
condition in producing results that show that a form of Pareto ordering is a
lower bound for a lower bound for CP-net dominance. Section 5 then uses the
results of Sect. 4 to show that π-pref nets dominance is a lower bound for CP-net
dominance. We also show there that the extended π-pref nets dominance, which
is an upper bound for CP-net dominance, can be computed in polynomial time.
Section 6 concludes.
CP-Nets, π-pref Nets, and Pareto Dominance 171
Proof. Let k = |Δ(w, v)|, which is greater than zero because w = v. Let us
label the elements of Δ(w, v) as X1 , . . . , Xk in such a way that if i < j then
Xi is not an ancestor of Xj with respect to the CP-net directed graph; this
is possible because of the acyclicity assumption on Σ. To prove (i), beginning
with outcome w, we flip variables of w to v in the order X1 , . . . , Xk , so that we
first change w(X1 ) to v(X1 ), and then change w(X2 ) to v(X2 ), and so on. The
choice of variable ordering means that when we flip variable Xi the assignment
to the parents UXi of Xi is just w(UXi ). It can be seen that this is a sequence
of worsening flips from w to v, and thus, w cp-dominates v w.r.t. Σ.
Part (ii) is very similar, except that we start with v, and iteratively change
Xi from v(Xi ) to w(Xi ) in the order i = 1, . . . , k. The assumption behind part
(ii) implies that we obtain an improving flipping sequence from v to w. 2
Proof. Define outcome u by u(X) = v(X) if X is such that w(X) > v(X) given
w (so X ∈ Δ(w, v)), and u(X) = w(X) otherwise. Then Δ(w, v) is the disjoint
union of Δ(w, u) and Δ(u, v).
CP-Nets, π-pref Nets, and Pareto Dominance 173
For all X ∈ Δ(w, u), w(X) > u(X) given w, because u(X) = v(X) and
w(X) > v(X) given w. Lemma 1 implies that w cp-dominates u w.r.t. Σ.
For all X ∈ Δ(u, v), u(X) > v(X) given v, since u(X) = w(X) and u(X) >
v(X) given w. Lemma 1 implies that u cp-dominates v w.r.t. Σ. Thus, w cp-
dominates v w.r.t. Σ. 2
expresses a very strong form of Pareto-dominance, since it requires that not only
w = w and fuX (w(X)) ≥ fuX (w (X)), but also that either fuX (w(X)) = 1 or
fuX (w(X)) = 0, ∀X ∈ V.
w1 (X) is fully dominating in X given w1 , and thus, w1 (X) > w2 (X) given w1 ;
therefore we have w1 >Σ LD w2 .
Clearly if >Σsp and > Σ Σ Σ Σ
cp are equal then the inclusions >sp ⊆ >LD ⊆ >cp imply
that >Σ Σ Σ Σ
sp and >LD are equal. Conversely, assume that >sp and >LD are equal.
Σ Σ
We then have that >LD is transitive (since >sp is transitive), and thus it is equal
to its transitive closure, which equals >Σ cp by Proposition 2. 2
is a worsening flip from w w.r.t. CP-net Σ, with X being the variable on which
they differ. Then w >Σ
sp w if and only if (a) either w(X) is fully dominating in
X given w w.r.t. Σ, or w (X) is fully dominated in X given w w.r.t. Σ; and (b)
for all Y ∈ V \ {X},
(i) if Y is not a child of X then w(Y ) is either fully dominated or fully domi-
nating in Y given w w.r.t. Σ; and
(ii) if Y is a child of X then w(Y ) is either fully dominating in Y given w
w.r.t. Σ or fully dominated in Y given w w.r.t. Σ.
The above considerations lead to the following result.
Lemma 3. Consider any X ∈ V, and any assignment u to the parents of X, and
any values x, x ∈ Dom(X) such that x >X Σ
u x . Assume that w >sp w whenever
(w, w ) is an associated worsening flip, i.e., if w(X) = x and w (X) = x , and
w and w agree on all other variables, and w extends u. Let (v, v ) be one such
associated worsening flip.
If variable Z is not a child of X and z is any element of Dom(Z) then z
is either fully dominated or fully dominating in X given v w.r.t. Σ. We have
|Dom(Z)| ≤ 2.
If variable Y is a child of X and y is any element of Dom(Y ) then y is either
fully dominating given v or fully dominated given v . We have |Dom(Y )| ≤ 2.
176 N. Wilson et al.
Note that the condition |Dom(Z)| ≤ 2 follows since there can be at most one
fully dominated and at most one fully dominating element in X given v.
Lemma 3 implies that for any variable X, every other variable has at most two
values, which immediately implies that every domain has at most two elements:
If X is not a true parent of Y then >Yu does not depend on X. For any
CP-net Σ we can generate an equivalent CP-net (i.e., that generates the same
ordering on outcomes) such that every parent of every variable is a true parent.
Proof. Suppose that y >Yu y . Let v be any outcome extending u and let v be
any outcome extending u . Lemma 4 implies that X has at most two values. If X
had only one value then it is trivially not a true parent of Y , so we can assume
that Dom(X) = 2. X is unconditional so it has no parents. Our definition of
a CP-net implies that the relation >X is non-empty, so we have x1 >X x2 , for
some labelling x1 and x2 of the values of X. We first consider the case in which
u(X) = x1 . Now, y is not fully dominating given u and so, by Lemma 3, y is
fully dominated given u , which implies y >Yu y .
We now consider the other case in which u(X) = x2 . Then, y is not fully
dominated given u, and so, by Lemma 3, y is fully dominating given u , and
thus, also y >Yu y . 2
Proof. w >Σ LD v if and only if for each X ∈ Δ(w, v) either w(X) > v(X) given
w, or w(X) > v(X) given v. For X ∈ Δ(w, v), we have w(X) > v(X) given w if
and only if X ∈ / Fw ; and we have w(X) > v(X) given v if and only if X ∈ Fv .
Thus, w >Σ LD v if and only if for each X ∈ Δ(w, v) [ X ∈ Fw ⇒ X ∈ Fv ], which
is if and only if Fw ∩ Δ(w, v) ⊆ Fv . 2
We define the irreflexive binary relation >Σ
par on outcomes as follows.
Proof. For different w and v, w >Σ sp v if and only if for all X ∈ V either v(X) is
fully dominated in X given v, or w(X) is fully dominating in X given w.
Suppose that w >Σsp v and consider any X ∈ V. If X ∈ Fw then w(X) is not
fully dominating in X given w, and so v(X) is fully dominated in X given v,
which implies that X ∈ Fv . We have shown that Fw ⊆ Fv .
Conversely, assume that Fw ⊆ Fv , and consider any X ∈ V. such that v(X) is
not fully dominated in X given v. Because Σ is a Boolean locally totally ordered
CP-net this implies that X is not bad for v. Since Fw ⊆ Fv , this implies that X
is not bad for w, and so, w(X) is fully dominating in X given w. This proves
that w >Σ sp v. 2
The CP-net relation contains the Pareto relation, with the local dominance
relation being between the two.
178 N. Wilson et al.
Theorem 1. Let Σ be a Boolean locally totally ordered CP-net. Relation >Σ par
is transitive, and is contained in >Σ Σ Σ
LD , i.e., w >par v implies w >LD v, and
thus >Σ Σ Σ Σ Σ
par ⊆ >LD ⊆ >cp . Furthermore, we have >par and >cp are equal (i.e., are
Σ Σ
the same relation) if and only if >par and >LD are equal, which happens only if
every variable of the CP-net is unconditional.
Proof. Theorem 1 follows immediately from Propositions 3 and 4 and Lemma 7.
2
As a consequence, we get that CP-nets are in agreement with Pareto ordering
in the case of Boolean locally totally ordered variables: for any variable X and
any configuration u of its parents, consider the mapping fuX : Dom(X) → {0, 1}
such that fuX (x∗ ) = 1 and fuX (xu∗ ) = 0. For any two distinct outcomes w and w ,
X
we have that ∀X ∈ V, fw(U X)
(w(X)) ≥ fwX (UX ) (w (X)) if and only Fw ⊆ Fw ,
which is Pareto-ordering >Σ par .
We emphasise the following part of the theorem:
Corollary 1. Let Σ be a Boolean locally totally ordered CP-net, w >Σ
par w
Σ
implies w >cp w .
As shown in the previous section, it does not seem straightforward to extend
this Pareto ordering in a natural way to non-Boolean variables without using
scaling functions that map all partial orders (Dom(X), >X u ), u ∈ UX to a com-
mon value scale, unless the variables are all preferentially independent from one
another. In this case, UX = ∅, ∀X, and >X X
u = > , ∀X ∈ Dom(X). We could then
define the Pareto dominance relation >par on outcomes as w >Σ
Σ
par w if and only
X
if w = w and w(X) > w (X) or w(X) = w (X) for all X ∈ V.
5 π-pref Nets
Possibility theory [8] is a theory of uncertainty devoted to the representation of
incomplete information. It is maxitive (addition is replaced by maximum) in con-
trast with probability theory. It ranges from purely ordinal to purely numerical
representations. Possibility theory can be used for representing preferences [9]. It
relies on the idea of a possibility distribution π, i.e., a mapping from a universe
of discourse Ω to the unit interval [0, 1]. Possibility degrees π(w) estimate to
what extent the solution w is not unsatisfactory. π-pref nets are based on possi-
bilistic networks [3], using conditional possibilities of the form π(x|u) = Π(x∧u)
Π(u) ,
for u ∈ Dom(UX ), where Π(ϕ) = maxw|=ϕ π(w). The use of product-based con-
ditioning rather than min-based conditioning leads to possibilistic nets that are
more similar to Bayesian nets.
The ceteris paribus assumption of CP-nets is replaced in possibilistic net-
works by a chain rule like in Bayesian networks. It enables one to compute,
using an aggregation function, the degree of possibility of solutions. However it
is supposed that these numerical values are unknown and represented by sym-
bolic weights. Only ordering between symbolic values or products thereof can
CP-Nets, π-pref Nets, and Pareto Dominance 179
π-pref nets induce a partial ordering between solutions based on the comparison
of their degrees of possibility in the sense of a joint possibility
distribution com-
puted using the product-based chain rule: π(xi , . . . , xn ) = i=1,...,n π(xi |ui ).
The preferences between solutions are of the form w π w if and only if
π(w) > π(w ) for all instantiations of the symbolic weights.
It is then known that the π-pref net ordering between solutions induced by the
preference tables is refined by comparing the sets Fw of bad variables for w:
w π w ⇒ Fw ⊂ Fw
since if two solutions contain variables having bad assignments in the sense of the
preference tables, the corresponding symbolic values may differ if the contexts
for assigning a value to this variable differ. It has been shown that if the weights
u
αX reflecting the satisfaction level due to assigning the bad value to Xi in the
context ui do not depend on this context, then we have an equivalence in the
above implication:
u
If ∀X ∈ V, αX = αX , ∀ui ∈ Dom(UX ), then w π w ⇐⇒ w >Σ
par w .
In the following, we highlight local constraints between each node and its
children that enable ceteris paribus to be simulated. Ceteris paribus constraints
are of the form w >Σ
cp w where w and w differ by one flip. For each such
statement (one per variable), we add the constraint on possibility degrees π(w) >
π(w ). Using the chain rule, it corresponds to comparing products of symbolic
weights. Let Dom(UX ) = ×Xi ∈UX Dom(Xi ) denote the Cartesian product of
u
domains of variables in UX , αX = π(x− |u), where x− is bad for X and γYu =
−
π(y |u ). Suppose a CP-net and a π-pref net built from the same preference
statements. It has been shown in [2] that the worsening flip constraints are all
induced by the conditions: ∀ X ∈ V s.t. X has children Ch(X) = ∅:
u
max αX <
min γYu
u∈Dom(UX ) u ∈Dom(UY )
Y ∈Ch(X)
Let +π be the resulting preference ordering built from the preference tables and
applying constraints of the above format between symbolic weights, then, it is
clear that ω cp ω ⇒ ω +
π ω : relation π is a bracketing from above of the
+
CP-net ordering.
5.2 Relation +
π as a Polynomial Upper Bound for CP-Net
Dominance
Z(Σ) be the set of weights vectors associated with symbolic weights comparisons
for each ceteris paribus statement, plus for each i = 1, . . . , m, the element z (i) .
Similarly, every solution is associated with a product of symbolic weights, so
a comparison w > w between solutions w and w corresponds to a statement
pertaining to a weights vector z . The definitions lead easily to the following
characterisation of this form of dominance.
CP-Nets, π-pref Nets, and Pareto Dominance 181
Theorem 2. Consider any CP-net with associated set of weights vectors Z(Σ),
and let w and w be two different solutions, where w > w has associated vector
z . We have that w +
π w if and
only if there exist non-negative real numbers rz
for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z . Hence, whether or not w +π w
holds can be checked in polynomial time.
Proof. As argued above, w + π w holds if and only if for vectors λ, the set of
inequalities {z · λ > 0 : z ∈ Z(Σ)} implies z · λ > 0. We need to show that this
holds if and only if there exist non-negative real numbers rz for each z ∈ Z(Σ)
such that z∈Z(Σ) rz z = z . Firstly, let us assume that there exist non-negative
real numbers rz for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z . Consider any
vector λ such that z · λ > 0 for all z ∈ Z(Σ). Then z · λ = z∈Z(Σ) rz z · λ which
is greater than zero since each rz is non-negative, and at least some rz > 0 (else
z is the zero vector, which would contradict w = w ).
Conversely, let us assume that there do not exist non-negative real numbers
rz for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z . To prove that the set of
inequalities {z · λ > 0 : z ∈ Z(Σ)} does not imply z · λ > 0, we will show that
there exists a vector λ withz · λ > 0 for all z ∈ Z(Σ) but z · λ ≤ 0. Let C be the
set of vectors of the form z∈Z(Σ) rz z over all choices of non-negative reals rz .
Now, C is a convex and closed set, which by the hypothesis does not intersect
with {z } (i.e., does not contain z ). Since {z } is closed and compact we can use
a hyperplane separation theorem to show that there exists a vector λ and real
numbers c1 < c2 such that for all x ∈ C, x · λ > c2 and z · λ < c1 . Because C
is closed under strictly positive scalar multiplication (i.e., x ∈ C implies rx ∈ C
for all real r > 0) we must have c2 ≤ 0, and x · λ ≥ 0 for all x ∈ C, and in
particular z · λ ≥ 0 for all z ∈ Z(Σ). Also, z · λ < c1 < c2 ≤ 0 so z · λ ≤ 0, as
required.
The last part follows since linear programming is polynomial. 2
182 N. Wilson et al.
References
1. Ben Amor, N., Dubois, D., Gouider, H., Prade, H.: Possibilistic preference networks.
Inf. Sci. 460–461, 401–415 (2018)
2. Ben Amor, N., Dubois, D., Gouider, H., Prade, H.: Expressivity of possibilistic
preference networks with constraints. In: Moral, S., Pivert, O., Sánchez, D., Marı́n,
N. (eds.) SUM 2017. LNCS (LNAI), vol. 10564, pp. 163–177. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-67582-4 12
3. Benferhat, S., Dubois, D., Garcia, L., Prade, H.: On the transformation between
possibilistic logic bases and possibilistic causal networks. Int. J. Approx. Reasoning
29(2), 135–173 (2002)
4. Boutilier, C., Bacchus, F., Brafman, R.I.: UCP-networks: a directed graphical rep-
resentation of conditional utilities. In: Proceedings of the 17th Conference on Uncer-
tainty in AI, Seattle, Washington, USA, pp. 56–64 (2001)
5. Boutilier, C., Brafman, R.I., Hoos, H.H., Poole, D.: Reasoning with conditional
ceteris paribus preference statements. In: Proceedings of the 15th Conference on
Uncertainty in AI, Stockholm, Sweden, pp. 71–80 (1999)
6. Boutilier, C., Brafman, R.I., Domshlak, C., Hoos, H.H., Poole, D.: CP-nets: a tool for
representing and reasoning with conditional ceteris paribus preference statements.
J. Artif. Intell. Res. 21, 135–191 (2004)
7. Brafman, R.I., Domshlak, C., Kogan, T.: Compact value-function representations
for qualitative preferences. In: Proceedings of the 20th Conference on Uncertainty
in AI, Banff, Canada, pp. 51–59 (2004)
CP-Nets, π-pref Nets, and Pareto Dominance 183
Yakoub Salhi(B)
1 Introduction
In this work, we are interested in quantifying conflicts for better analyzing the
nature of the inconsistency in a knowledge base. Plenty of proposals for incon-
sistency measures have been defined in the literature (e.g. see [3,7,9,14,15]),
and it has been shown that they can be applied in different domains, such as
e-commerce protocols [4], integrity constraints [6], databases [13], multi-agent
systems [10], spatio-temporal qualitative reasoning [5].
In the literature, an inconsistency measure is defined as a function that asso-
ciates a non negative value to each knowledge base. In particular, the authors
in [9] have proposed different rationality postulates for defining inconsistency
measures that allow capturing important aspects related to inconsistency in the
case of classical propositional logic. Furthermore, objections to some of them
and many new postulates have also been proposed in [1]. The main advantage of
the approach based on rationality postulates for defining inconsistency measures
is its flexibility in the sense that the appropriate measure in a given context can
be chosen through the desired properties from the existing postulates.
In [11,12], the authors have proposed a general framework for reasoning under
inconsistency by forgetting propositional variables to restore consistency. Using
the variable forgetting approach of this framework, an inconsistency measure
has been proposed in [2]. The main idea consists in quantifying the amount of
inconsistency as the minimum number of variable occurrences that have to be
forgotten to restore consistency. We here propose a new approach for defining
inconsistency measures that can be seen as a generalization of the previous app-
roach. Indeed, our main idea consists in measuring the amount of inconsistency
by considering sets of subformula occurrences that we need to forget to restore
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 184–191, 2019.
https://doi.org/10.1007/978-3-030-35514-2_14
Measuring Inconsistency Through Subformula Forgetting 185
consistency. To the best of our knowledge, we here provide the first approach
that takes into account in a syntactic way the internal structure of the formulas.
In this work, we propose rationality postulates for measuring inconsistency
that are based on reasoning about subformula occurrences. In particular, the
postulate stating that forgetting any subformula occurrence does not increase
the amount of inconsistency. Finally, we propose several inconsistency measures
that are based on forgetting subformula occurrences. These measures are defined
by considering the number of modified formulas and the size of the forgotten
subformula occurrences to restore consistency. For instance, one of the proposed
inconsistency measure quantifies the amount of inconsistency as the minimum
size of the subformula occurrences that have to be forgotten to obtain consis-
tency. It is worth mentioning that we show that two of the described incon-
sistency measures correspond to two measures existing in the literature: that
introduced in [2] based on forgetting variables and that introduced in [7] based
on consistent subsets.
2 Preliminaries
2.1 Classical Propositional Logic
We here consider that every piece of information is represented using classical
propositional logic. We use Prop to denote the set of propositional variables.
The set of propositional formulas is denoted Form. We use the letters p, q, r, s to
denote the propositional variables, and the Greek letters φ, ψ and χ to denote
the propositional formulas. Moreover, given a syntactic object o, we use P(o) to
denote the set of propositional variables occurring in o. Given a set of variables
S such that P(φ) ⊆ S, we use M od(φ, S) to denote the set of all the models of
φ defined over S.
Given a formula φ, the size of a formula φ, denoted s(φ), is inductively defined
as follows: s(p) = s(⊥) = s() = 1; s(¬ψ) = 1 + s(ψ); s(ψ ⊗ χ) = 1 + s(ψ) + s(χ)
for ⊗ = ∧, ∨, →. In other words, the size of a formula is defined as the number
of the occurrences of propositional variables, constants and logical connectives
that appear in it.
Similarly, the set of the subformulas of φ, denoted SF (φ), is inductively
defined as follows: SF (p) = {p}; SF (⊥) = {⊥}; SF () = {}; SF (¬ψ) =
{¬ψ} ∪ SF (ψ); SF (ψ ⊗ χ) = {ψ ⊗ χ} ∪ SF (ψ) ∪ SF (χ) for ⊗ = ∧, ∨, →.
Given a formula φ and ψ ∈ SF (φ), we use O(φ, ψ) to denote the number
of the occurrences of ψ in φ. Moreover, we consider that the occurrences of a
subformula are ordered starting from the left. For example, consider the formula
φ = (p ∧ q) → (¬r ∨ q). Then, SF (φ) = {φ, p ∧ q, ¬r ∨ q, ¬r, p, q, r}. Further,
O(φ, p) = 1 and O(φ, q) = 2. The first occurrence of q is that occurring in the
subformula p ∧ q, while the second is that occurring in the subformula ¬r ∨ q.
The polarity of a subformula occurrence within a formula that has a polarity
(positive or negative) is defined as follows:
– φ is a positive (resp. negative) subformula occurrence of the positive (resp.
negative) formula φ;
186 Y. Salhi
Consider, for instance, the formula p → (p ∨ q) with the negative polarity. Then,
the left-hand p is a positive subformula occurrence and the right-hand occurrence
is negative.
A knowledge base is a finite set of propositional
formulas. A knowledge base
K is inconsistent if its associated formula φ∈K φ ( if K = ∅) is inconsistent,
written K ⊥, otherwise it is consistent, written K ⊥. We use KForm to
denote
the set of knowledge bases. Moreover, we use SF (K) to denote the set
φ∈K SF (φ).
From now on, we consider that the polarity of the formulas occurring in any
knowledge base are negative, the same results can be obtained by symmetrically
considering the positive polarity.
Given a knowledge base K, a subset K ⊆ K is said to be a minimal inconsis-
tent subset (MIS) of K if (i) K ⊥ and (ii) ∀φ ∈ K , K \ {φ} ⊥. Moreover,
K is said to be a maximal consistent subset (MCS) of K if (i) K ⊥ and
(ii) ∀φ ∈ K \ K , K ∪ {φ} ⊥. We use M ISes(K) and M CSes(K) to denote
respectively the set of all the MISes and the set of all the MCSes of K.
2.2 Substitution
3 Inconsistency Measure
In the literature, an inconsistency measure is defined as a function that associates
a non negative value to each knowledge base (e.g. [3,7,9,14,15]). It is used to
quantify the amount of inconsistency in a knowledge base. The different works
on inconsistency measures use postulate-based approaches to capture important
aspects related to inconsistency. In particular, in the recent work [3], the authors
have proposed the following formal definition of inconsistency measure that we
consider in this work.
Definition 1 (Inconsistency Measure). An inconsistency measure is a
function I : KForm → R+ ∞ that satisfies the two following properties: (i) ∀K ∈
KForm , I(K) = 0 iff K is consistent (Consistency); and (ii) ∀K, K ∈ KForm , if
K ⊆ K then I(K) I(K ) (M onotonicity). The set R+ ∞ corresponds to the set
of positive real numbers augmented with a greatest element denoted ∞.
The postulate (Consistency) means that an inconsistency measure must
allow distinguishing between consistent and inconsistent knowledge bases, and
(M onotonicity) means that the amount of inconsistency does not decrease by
adding new formulas to a knowledge base. Many other postulates have been
introduced in the literature to characterize particular aspects related to incon-
sistency (e.g. see [1,9,15]).
Let us now describe some simple inconsistency measures from the literature:
– IM (K) = |M ISes(K)| ([8])
– Idhit (K) = |K| − max{|K | | K ∈ M CSes(K)} ([7])
– Ihs (K) = min{|S| | S ⊆ M and ∀φ ∈ K, ∃B ∈ S s.t. B |= φ} − 1 with
M = φ∈K M od(φ, P(K)) and min{} = ∞ ([14])
– If orget (K) = min{n | φ∈K φ[(p1 , i1 ), . . . (pn , in )/ C1 , . . . , Cn ], p1 , . . . , pn ∈
Prop, C1 , . . . , Cn ∈ {, ⊥}} ([2])
The measure IM quantifies the amount of inconsistency through minimal incon-
sistent subsets: more MISes brings more conflicts; Idhit consider the dual of the
size of the greatest MCSes; Ihs is defined through an explicit use of the Boolean
semantics: the amount of inconsistency is related to the minimum number of
models that satisfy all the formulas in the considered knowledge base; and If orget
defines the amount of inconsistency as the minimum number of variables that
we have to forget to restore consistency. It is worth mentioning that we consider
here the reformulation of If orget proposed in [15].
– (F orgetN egOcc):
1. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
negative, I(K ∪ {φ[(ψ, i)/]}) I(K ∪ φ);
2. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
negative and φ[(ψ, i)/⊥] ∈/ K, I(K ∪ φ) I(K ∪ {φ[(ψ, i)/⊥]}).
– (F orgetP osOcc):
1. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
positive and φ[(ψ, i)/] ∈/ K, I(K ∪ φ) I(K ∪ {φ[(ψ, i)/]});
2. (F orgetP osOcc⊥ ) ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occur-
rence of ψ in φ is positive, I(K ∪ {φ[(ψ, i)/⊥]}) I(K ∪ φ).
The first property of (F orgetN egOcc) expresses the fact that a negative
subformula occurrence becomes useless to produce inconsistency if it is replaced
with . Regarding the second property, it is worth mentioning that the con-
dition φ[(ψ, i)/⊥] ∈/ K is only used to prevent formula deletion. The postulate
(F orgetP osOcc) is simply the counterpart in the case of positive subformula
occurrences of (F orgetN egOcc).
In a sense, the next proposition shows that the previous postulates can be
seen as restrictions of the postulate (Dominance), introduced in [9], in the case
of consistent formulas. Let us recall that (Dominance) is defined as follows:
Proposition 1. The following two properties are satisfied for ∀φ ∈ Form with
negative polarity and ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ): (i) if the ith occurrence
of ψ in φ is negative, then φ[(ψ, i)/⊥] φ and φ φ[(ψ, i)/]; (ii) if the ith
occurrence of ψ in φ is positive, then φ[(ψ, i)/] φ and φ φ[(ψ, i)/⊥].
Proof. We here consider only the case of φ[(ψ, i)/⊥] φ when the considered
occurrence is negative and the case of φ φ[(ψ, i)/⊥] when the considered
occurrence is positive, the other case being similar. The proof is by mutual
induction on the value of s(φ). If s(φ) = 1, then φ is a propositional variable
or a constant, and as a consequence, φ[(ψ, i)/⊥] = ⊥ holds in the case where
the ith occurrence of ψ in φ is negative. Thus, we obtain φ[(ψ, i)/⊥] = ⊥ φ.
Moreover, there is no positive subformula occurrence in this case. Assume now
that s(φ) > 1. Then, φ has one of the following forms ¬φ φ1 ∧ φ2 , φ1 ∨ φ2
and φ1 → φ2 . Consider first the case φ = ¬φ the proof is trivial in the case
ψ = φ. If the ith occurrence of ψ in φ is negative, then it is positive in φ ,
and using the induction hypothesis, φ φ [(ψ, i)/⊥] holds. Thus, we obtain
¬φ [(ψ, i)/⊥] = φ[(ψ, i)/⊥] ¬φ = φ. The case where the ith occurrence of ψ
in φ is positive is similar. The proof in the remaining cases can be obtained by
simple application of the induction hypothesis, except the case φ1 → φ2 , which
is similar to that of ¬φ .
For instance, a direct consequence of Proposition 1 is the fact that Ihs satis-
fies (F orgetN egOcc) and (F orgetP osOcc). However, IM does not satisfy these
Measuring Inconsistency Through Subformula Forgetting 189
References
1. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
2. Besnard, P.: Forgetting-based inconsistency measure. In: Schockaert, S., Senellart,
P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 331–337. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-45856-4 23
3. Bona, G.D., Grant, J., Hunter, A., Konieczny, S.: Towards a unified framework
for syntactic inconsistency measures. In: Proceedings of the Thirty-Second AAAI
Conference on Artificial Intelligence, New Orleans, Louisiana, USA (2018)
4. Chen, Q., Zhang, C., Zhang, S.: A verification model for electronic transaction
protocols. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS,
vol. 3007, pp. 824–833. Springer, Heidelberg (2004). https://doi.org/10.1007/978-
3-540-24655-8 90
5. Condotta, J., Raddaoui, B., Salhi, Y.: Quantifying conflicts for spatial and tempo-
ral information. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the Fifteenth International Conference, KR 2016, Cape Town, South
Africa, 25–29 April 2016, pp. 443–452 (2016)
6. Grant, J., Hunter, A.: Measuring inconsistency in knowledgebases. J. Intell. Inf.
Syst. 27(2), 159–184 (2006)
7. Grant, J., Hunter, A.: Distance-based measures of inconsistency. In: van der Gaag,
L.C. (ed.) ECSQARU 2013. LNCS (LNAI), vol. 7958, pp. 230–241. Springer, Hei-
delberg (2013). https://doi.org/10.1007/978-3-642-39091-3 20
8. Hunter, A., Konieczny, S.: Measuring inconsistency through minimal inconsistent
sets. In: Principles of Knowledge Representation and Reasoning: Proceedings of the
Eleventh International Conference, KR 2008, Sydney, Australia, 16–19 September
2008, pp. 358–366. AAAI Press (2008)
9. Hunter, A., Konieczny, S.: On the measure of conflicts: shapley inconsistency val-
ues. Artif. Intell. 174(14), 1007–1026 (2010)
10. Hunter, A., Parsons, S., Wooldridge, M.: Measuring inconsistency in multi-agent
systems. Kunstliche Intelligenz 28, 169–178 (2014)
11. Lang, J., Marquis, P.: Resolving inconsistencies by variable forgetting. In: Proceed-
ings of the Eights International Conference on Principles and Knowledge Represen-
tation and Reasoning (KR-02), Toulouse, France, 22–25 April 2002, pp. 239–250
(2002)
12. Lang, J., Marquis, P.: Reasoning under inconsistency: a forgetting-based approach.
Artif. Intell. 174(12–13), 799–823 (2010)
13. Martinez, M.V., Pugliese, A., Simari, G.I., Subrahmanian, V.S., Prade, H.: How
dirty is your relational database? An axiomatic approach. In: Mellouli, K. (ed.)
ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 103–114. Springer, Heidelberg
(2007). https://doi.org/10.1007/978-3-540-75256-1 12
14. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
15. Thimm, M.: On the evaluation of inconsistency measures. In: Grant, J., Martinez,
M.V. (eds.) Measuring Inconsistency in Information, Volume 73 of Studies in Logic.
College Publications, February 2018
Explaining Hierarchical Multi-linear Models
Christophe Labreuche(B)
1 Introduction
One of the major challenges of Artificial Intelligence (AI) methods is to explain their
predictions and make them transparent for the user. The explanations can take very
different forms depending on the area. For instance, in Computer Vision, one is inter-
ested in identifying the salient factors explaining the classification of an image [12]. In
Machine Learning, one might look for the smallest modification to make on an instance
to change its class (counter-factual example) [16]. In Constraint Programming, the aim
is to find the simplest way to repair a set of inconsistent constraints [8]. And so on. There
is thus a variety of explanation methods applicable to a wide range of AI methods.
Many decision problems involve multiple attributes to be taken into account. Multi-
Criteria Decision Aiding (MCDA) aims at representing the preferences of a decision
maker regarding options on the basis of multiple and conflicting criteria. In real appli-
cations, one shall use elaborate decision models able to capture complex expertise. A
few models have been shown to have this ability, such as the Choquet integral [3], the
multi-linear model [11] or the Generalized Additive Independence (GAI) model [1, 6].
The main asset of these models is their ability to represent interacting criteria. The
multi-linear model is especially important as it is the most natural multi-dimensional
interpolation model. It is very smooth and does not have discontinuity of the Gradient
that the Choquet integral has. The following example illustrates applications in which
such models are important.
10
1 7 9
2 3 8 6
4 5
more suspicious a ship and the more urgent it is to intercept it. The PL is used to raise
the attention of the DM on some specific ships. The computation of the PL depends
on several criteria: 1. Incoherence between Automatic Identification System (AIS) data
and radar detection; 2. Suspicion of drug smuggling on the ship; 3. Suspicion of human
smuggling on the ship; 4. Current speed (since fast boats are often used to avoid being
easily intercepted); 5. Maximum speed since the first detection of the ship (it represents
the urgency for the potential interception); 6. Proximity of the ship to the shore (since
smuggling ships often aim at reaching the shore as fast as possible).
In the previous example, as in most real-applications, the criteria are not considered
in a flat way but are organized as a tree. The criteria are indeed organized hierarchically
with several nested aggregation functions. The hierarchical structure shall represent the
natural decomposition of the decision reasoning into points of view and sub-points of
view. In the previous example, the six criteria are organized as in Fig. 1. The tree of
the DM contains four aggregation nodes: 7. Suspicion of illegal activity; 8. Kinematics;
9. Capability to escape interception; 10. Overall PL.
The ability to explain the evaluation is very important in Example 2. If the PL of
a ship suddenly increases over time, the tactical operator needs to understand where
this comes from. This latter is under stress and time pressure. He is thus looking for
an explanation highlighting the most influencing attributes in the evolution of the PL.
This type of explanation has been recently widely studied under the name of feature
attribution. The aim is to attribute to each feature its level of contribution. Among the
many concepts that have been proposed, the Shapley value has been widely used in
Machine Learning [4, 10].
The Shapley value has also been recently as an explanation means in MCDA [9].
In this reference, a new explanation approach for hierarchical MCDA models has been
introduced. The idea is to highlight the criteria that contribute most to the decision.
In Example 2, consider two ships represented by two alternatives x and y taking the
following values on the six attributes x = (x1 , x2 , x3 , x4 , x5 , x6 ) = (+, −, −, −, −, −)
and y = (+, +, +, +, +, +) (where values ‘+’ and ‘−’ indicate a high and low value
respectively). The type of explanation that is sought can typically be that the nodes
contributing the most to the preference of y over x are nodes 8 (Kinematics) and 9
(Capability to escape interception) and not 2 (Suspicion of drug smuggling on the ship)
or 3 (Suspicion of human smuggling on the ship). This helps the user to further analyze
the values of criteria 8 and 9 (and not criteria 2 or 3). To this end, an indicator measuring
194 C. Labreuche
the degree to which a node contributes to the preference between two alternatives has
been defined in Ref. [9]. It is a generalization of the Shapley value on trees.
The contribution of this paper is to further develop this approach in two directions.
We are interested in the practical computation of the influence indicator. The main
drawback of the Shapley value is that it has an exponential complexity in the num-
ber of nodes. It has been shown in Ref. [9] that the influence index for a node can be
equivalently be computed on a subtree. The first contribution of this paper is to rewrite
the influence index so as to improve the computational complexity. It cannot be fur-
ther reduced without making assumptions on the utility model. An illustration of the
influence indicator to the Choquet integral has been proposed in Ref. [9]. We consider
in this paper another important class of aggregation model, based on the multi-linear
extension. One of the main result of this paper shows that for the multi-linear model,
the computations can be performed independently on each aggregation node, making
the computation of the influence index much more tractable (see Sect. 5.2).
In practice, the values of the alternatives on the attributes are imprecise (second
direction of this work). In Example 2, one needs to assess the PL of faraway ships for
which the values of some attributes are not precisely known. In particular, the attributes
related to the intent of the ship cannot readily be determined. Other attributes such as
the heading of a ship cannot be assigned to a precise value as it is a fluctuating vari-
able. The imprecision of the values of the attributes can also come from some disagree-
ment among experts opinions (for attributes corresponding to a subjective judgment).
For numerical attributes, the imprecise value can take the form of an interval. So far,
there is no explanation approach able to capture imprecise values of the alternatives. In
Example 2, the values of a ship on numerical attributes such as the maximum speed or
the proximity of the shore might be given as an interval of confidence. The imprecisions
on the value of the alternatives on the attributes propagate to the influence degrees in
a very complex manner. We show that when the aggregation models are multi-linear
models, the computation of the bounds on the influence degree can be easily obtained
(see Sect. 4).
where δπx,y,T,U (i) := U (ySπ (i) , xN \Sπ (i) ) − U (ySπ (i)\{i} , x(N \Sπ (i))∪{i} ). In (3), the
set of admissible orderings Π(T ) is defined as the set of orderings of elements of N for
which all elements of a subtree of T are consecutive. More precisely, π ∈ Π(T ) iff, for
every l ∈ MT \ N , indices π −1 (Leaf T (l)) are consecutive.
The influence index can be equivalently be computed on the restricted tree T[J] .
Our aim is to implement the influence index in practice. The influence index contains
an exponential number of terms. It is thus very challenging to perform its exact compu-
tation. A complexity analysis is performed in Sect. 3.1. An alternative expression of the
influence index, reducing its computational complexity is proposed in Sect. 3.2.
By Sect. 2.3, the expression of the influence index is given by (3). Hence the complexity
of Ii (x, y; U, T ) depends on the number of permutations Π(T ). For j ∈ MT \ N ,
we denote by T|j the subtree of T starting at node j, defined by MT|j := DescT (j),
NT|j := Leaf T (j), sT|j := j, and ChT|j (l) = ChT (l) for all l ∈ MT|j \ NT|j . Then the
cardinality of Π(T ) can be recursively computed thanks to the next result.
Explaining Hierarchical Multi-linear Models 197
Lemma 1. |Π(T )| = |ChT (sT )|! × |Π(T|j )|.
j∈ChT (sT )
The proofs of this result and the others are omitted for space limitation.
Lemma 3 provides a recursive formula to compute the number of compatible per-
mutations in a tree T , that is the complexity of Ii (x, y, T, U ).
By (4), the extended Owen value of node i for tree T can be computed equiva-
lently on tree T[J] . The implementation of these formulae requires to enumerate over
the permutations Π(T[J] ). This helps to drastically reduce the complexity.
Example 3. For T of Fig. 2(left), i = 1, we obtain J = {2, 10, 14}. Figure 2(right)
shows T[J] .
15
15
13 14
=⇒ 13 14
9 10 11 12
9 10
1 2 3 4 5 6 7 8
1 2
In order to demonstrate the gain obtained by using T[J] instead of T , let us take
the example of uniform trees, denoted by Td,p Un
(with d, p ∈ N∗ ) where each aggrega-
tion node contains exactly p children and each leaf is exactly at depth d of the root.
Un
Figure 2(left) illustrates T3,2 . The next lemma gives the expression of the number of
Un
permutations associated to the uniform tree Td,p .
d−1 k
d
Un
k=0 p
d
Lemma 2. n =
NTd,p Un
= p ,
Π(T
d,p )
= (p!) , and
Π (T Un
d,p )[J]
= (p!) .
Table 1 below shows a clear benefit of using T[J] instead of T in the computation of
the influence index: the ratio amounts to orders of magnitude when n increases.
Expression (3) takes the form of an average over permutations. The number of terms in
the sum in (3) is equal to C(N ) := 2|N |−1 . We give in this section an equivalent new
expression taking profit of relation (4).
Consider Ii for some fixed i ∈ N . We set Vl := ChT (rl−1 ) for all l ∈ {1, . . . , t},
Vl := Vl \ {rl } – see Fig. 3.
198 C. Labreuche
Expression (3) can be turned into a sum over coalitions, which reduces a little bit
the computation complexity:
r0
r1 ◦ ◦ V1
.. ◦
.
rl−1 ◦ ◦
rl ◦ ◦ ◦ ◦ Vl
.. ◦
.
rt ◦ ◦ ◦ Vt
Theorem 1. We have
t
where Sl..j = Sl ∪ · · · ∪ Sj , Vl..j = Vl ∪ · · · ∪ Vj , xk = vkU (x), yk = vkU (y) and
[yS x]T (for S ⊆ T ) denotes an alternative taking the value of y in S and the value of x
in T \ S.
The computation complexity of (5) is given by the next result.
t
Lemma 3. The number of terms in (5) is of order C(T ) := l=1 2|Vl | .
The last two column in Table 1 presents the log of the number of operations in the
expression of the influence index written over coalitions rather than on permutations.
The complexity of computing the influence index reducing (resp. not reducing) to the
Un
restricted tree is denoted by C(Td,p ) (resp. C(NTd,p
Un )).
Un
Table 1. Logarithm of the number of permutations and subsets for uniform trees Td,p .
In many practical situations, the values of the alternatives are imprecise. We have justi-
fied this in the introduction, in particular for Example 2. For the sake of simplicity, the
imprecision of the two alternatives on which the explanation is computed are given as
intervals: x = [x, x] and y = [y, y], with x, x, y, y ∈ X. The problem is to define the
influence index between x and y.
The idea is to propagate the imprecisions on the values of x and y on the computa-
tion of the influence index. The influence of node i in the comparison between x and y
is a closed interval defined by
Ii (
x, y, T, U ) = Ii (
x, y, T, U ), Ii (
x, y, T, U ) ,
where
Ii (
x, y, T, U ) = min min Ii (x, y, T, U ),
x∈
x y∈
y
Ii (
x, y, T, U ) = max max Ii (x, y, T, U ).
x∈
x y∈
y
200 C. Labreuche
We have
Ii (
x, y, T, U ) = min min Ii (x, y, T, U )
x∈
x y∈
y
1
= min min U (yi , ySπ (i)\{i} , x−Sπ (i) ) − U (xi , ySπ (i)\{i} , x−Sπ (i) )
x∈
x y∈
y |Π(T )|
π∈Π(T )
1
= min min U (y i , ySπ (i)\{i} , x−Sπ (i) )
x−i ∈
x−i y−i ∈
y−i |Π(T )|
π∈Π(T )
− U (xi , ySπ (i)\{i} , x−Sπ (i) ) ,
and
Ii (
x, y, T, U ) = max max Ii (x, y, T, U )
x∈
x y∈
y
1
= max max U (yi , ySπ (i)\{i} , x−Sπ (i) ) − U (xi , ySπ (i)\{i} , x−Sπ (i) )
x∈ y |Π(T )|
x y∈
π∈Π(T )
1
= max max U (y i , ySπ (i)\{i} , x−Sπ (i) )
x−i ∈
x−i y−i ∈
y−i |Π(T )|
π∈Π(T )
− U (xi , ySπ (i)\{i} , x−Sπ (i) ) .
where wl (k) is the weight of node k at aggregation ode l, then one can easily show that
t−1
Ii (x, y; T ; U ) = (ui (yi ) − ui (xi )) wrl (rl+1 ). (7)
l=0
Explaining Hierarchical Multi-linear Models 201
Even though the complexity of computing Ii (x, y; T ; U ) is linear in the depth of the
tree, the underlying model is very simple and far from being able to capture real-life
preferences.
We are thus looking for a decision model realizing a good compromise between
a high representation power (in particular being able to capture interaction among
attributes) and a low computation time for the influence indices. We explore in this paper
the multi-linear model and believe that it realizes such good compromise. Section 5.1
describes the multi-linear model. Section 5.2 shows that the expression of the influence
index for the multi-linear model can be drastically simplified in terms of computational
complexity. Section 5.3 shows that when the values of the alternatives are uncertain, the
computation of the influence is also tractable for the multi-linear model.
where wl (i) is the weight assigned to node i. This model assumes the independence
among the criteria.
Without loss of generality, we can assume that the score lies in interval [0, 1] where
0 (resp. 1) means the criterion is not satisfied at all (resp. completely satisfied). In order
to represent interaction among criteria, the idea is to assign weights not only to single
criteria but also to subsets of criteria. A capacity (also called fuzzy measure [15]) is a set
function vl : 2nl → [0, 1] such that vl (∅) = 0, vl ({1, . . . , nl }) = 1 and vl (S) ≤ vl (T )
whenever S ⊆ T [3]. Term vl (S) represents the aggregated score of an option being
very well-satisfied on criteria S (with score 1) and very ill-satisfied on the other criteria
(with score 0).
The Möbius transform of vl , denoted by ml : 2nl → R, is given by [13]
ml (A) = (−1)|A\B| vl (B).
B⊆A
A capacity is said to 2-additive if the Möbius coefficients are zero for all subsets of three
or more terms. Two classical aggregation functions can be obtained given the Möbius
coefficients ml . The first one is the Choquet integral [3]
Cml (a) = ml (T ) × min am ,
m∈T
T ∈Sl
where Sj is the subset of {1, . . . , nl } on which the Möbius coefficients are non-null.
The next example illustrates the multi-linear model w.r.t. a two-additive capacity.
Example 4 (Example 2 cont.). After eliciting the tactical operator preferences, the
aggregation functions are given by:
Node 7: There is suspicion of illegal activity whenever either drug or human smug-
gling is detected. Hence there is redundancy between criteria 2 and 3. As human
smuggling (crit. 3) is slightly more important than criterion 2, we obtain v7U (x) =
0.8 v3U (x) + v3U (x) − 0.8 v2U (x) × v3U (x);
Node 8: v8U (x) = (v4U (x) + v5U (x))/2;
Node 9: Nodes 6 and 8 are redundant, since there is a high risk that the ship escapes
interception when it is either close to the shore (crit. 6) or very fast (node 8). Hence
v9U (x) = 0.8 v6U (x) + 0.8 v8U (x) − 0.6 v6U (x) × v8U (x),
Node 10: Nodes 1 and 7 are redundant since there is a suspicion on the ship when
the score is high on either node 1 or 7. Nodes 7 and 9 are complementary as the
risk is not so high for a suspicious ship (high value at node 7) that is easy to inter-
cept (low value at node 9), or for a ship that is difficult to intercept but that is not
suspicious. We have the same behavior between nodes 1 and 9. Hence v10 (x) =
U
v1U (x) + v7U (x) − v1U (x) × v7U (x) + v1U (x) × v9U (x) + v7U (x) × v9U (x) /3.
For x = (+, −, −, +, +, +), we obtain u2 (x) = u3 (x) = 0, ui (x) = 1 for i ∈
{1, 4, 5, 6}, v7U (x) = 0, v8U (x) = v9U (x) = 1 and U (x) = v10 U
(x) = 23 .
We consider the case where all aggregations functions are multi-linear models.
We now give the main result of this paper.
Theorem 2. Assume that the aggregation function at node rl (for l ∈ {0, . . . , t − 1})
is done with a multi-linear extension w.r.t. Möbius coefficients mrl . Then
t−1
Ii (x, y; U, T[J] ) = (ui (yi ) − ui (xi )) × Φl , (9)
l=0
where
Φl = mrl (T ∪ {rl+1 }) × ym × xm ×
T ⊆Vl+1 , T ∪{rl+1 }∈Sl+1 S ⊆T m∈T ∩S m∈T \S
⎡ ⎤
|Vl+1 |−|T |−1
(|Vl+1 | − |T | − 1)!
(|S | + s )!(|Vl+1 | − |S | − s − 1)! ⎦
⎣ .
s !(|Vl+1 | − |T | − 1 − s )! |Vl+1 |!
s =0
In the generic expression of the influence index (see (5)), the complexity of the
computation of Ii grows exponentially with the number t of layers (see Lemma 3).
Explaining Hierarchical Multi-linear Models 203
Thanks to the previous result, one readily sees that the computation of the influence only
grows linearly with the depth of the tree for the multi-linear model. In (9), the influence
index takes the form of a product of an influence computed for each layer, where Φl
is the local influence at aggregation node rl . Hence the computation of (9) becomes
very fast, whatever the depth of the tree and the number of aggregation functions, as the
number of children at each aggregation node is small in practice (in general between
2 and 6). We note that there are strong similarities with the case of a weighted sum
– see (7). The weighted is a particular case of a multi-linear model where all Möbius
coefficients for the subsets of two or more elements are zero. In this case, Φl subsumes
to mrl ({rl+1 }), which is equal to the weight wrl (rl+1 ) of node rl+1 at aggregation
node rl in a weighted sum. Hence (9) subsumes to (7) for a weighted sum.
t
t
CMultiLin (T ) := 1 + |Vl | × 2|T | ≤ 1 + |Vl |3|Vl | . (10)
l=1 T ⊆Vl , T ∪{rl }∈Sl l=1
t
CMultiLin (T ) = 1 + |Vl | [1 + 2|Vl |] .
l=1
Hence Φ0 = 12 , Φ1 = 12 , Φ2 = 1
2 and I4 (x, y; U, T[J] ) = 0.125.
204 C. Labreuche
5.3 Computation of the Influence Index with Imprecise Values for the
Multi-linear Model
As in Sect. 5.2, we now assume that all aggregation functions are multi-linear models.
From Theorem 2,
t−1
Ii (x, y; U, T[J] ) = (yi − xi ) × Φl (x, y),
l=0
where
|S|! × |Vl | − |S| − 1 !
Φl (x, y) = × ml (T ∪ {rl })× yj × xj .
|Vl |!
S⊆Vl T ⊆Vl j∈T ∩S j∈T \S
Let us start with the computation of the lower bound of the influence of criterion i:
Ii (
x, y, T, U ) = min min Ii (xi , x−i ), (y i , y−i ); T, U
x−i ∈
x−i y−i ∈
y−i
t
= (y i − xi ) × Φl ,
l=1
where Φl = minx−i ∈x−i miny−i ∈y−i Φl (x, y). Let k ∈ Vl . Let us analyse the mono-
tonicity of variables xk and yk on Φl (x, y):
|S|! × |Vl | − |S| − 1 !
Φl (x, y) = ml (T ∪ {rl }) (11)
|Vl |!
S⊆Vl \{k} T ⊆Vl \{k}
(|S| + 1)! × |Vl | − |S| − 2 !
+ ml (T ∪ {rl })
|Vl |!
|S|! × |Vl | − |S| − 1 !
+ ml (T ∪ {rl , k}) xk
|Vl |!
(|S| + 1)! × |Vl | − |S| − 2 !
+ ml (T ∪ {rl , k}) yk × yj × xj .
|Vl |!
j∈T ∩S j∈T \S
The first two terms in the bracket are constant w.r.t. xk and yk . Hence Φl is linear in
xk and in yk . This implies that the minimum value in Φl (x, y) is attained at an extreme
point of the intervals. As this holds for every k, we obtain
The optimal value can be obtained by enumerating the extreme values. This is not so
time consuming as the number of elements in Vl is not large. A similar approach can
be performed to compute Ii ( x, y, T, U ).
A more efficient approach can be derived to compute Ii ( x, y, T, U ) and
Ii (
x, y, T, U ) under assumptions on ml . By (11), if ml (T ∪ {rl , k}) ≥ 0 (resp. ≤ 0)
Explaining Hierarchical Multi-linear Models 205
for all T ⊆ Vl \ {k}, then Φl is monotonically increasing (resp. decreasing) w.r.t. xk
and yk . Hence the minimum Φl is attained at xk = xk (resp. at xk = xk ). This is in
particular the case when the Möbius coefficients are 2-additive. Indeed, for a 2-additive
capacity, m(T ∪ {rl , k}) can be non-zero only for T = ∅.
References
1. Bacchus, F., Grove, A.: Graphical models for preference and utility. In: Conference on
Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, pp. 3–10, July 1995
2. Beliakov, G., Pradera, A., Calvo, T.: Aggregation Functions: A Guide for Practitioners. Stud-
ies in Fuzziness and Soft Computing, vol. 221. Springer, Heidelberg (2007). https://doi.org/
10.1007/978-3-540-73721-6
3. Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5, 131–295 (1953)
4. Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence. In:
IEEE Symposium on Security and Privacy, San Jose, CA, USA, May 2016
5. Diestel, R.: Graph Theory. Springer, New York (2005)
6. Fishburn, P.: Interdependence and additivity in multivariate, unidimensional expected utility
theory. Int. Econ. Rev. 8, 335–342 (1967)
7. Grabisch, M., Marichal, J., Mesiar, R., Pap, E.: Aggregation Functions. Cambridge Univer-
sity Press, Cambridge (2009)
8. Junker, U.: QUICKXPLAIN: preferred explanations and relaxations for over-constrained
problems. In: Proceedings of the 19th National Conference on Artificial Intelligence (AAAI
2004), San Jose, California, pp. 167–172, July 2004
9. Labreuche, C., Fossier, S.: Explaining multi-criteria decision aiding models with an extended
Shapley value. In: Proceedings of the Twenty-Seventh International Joint Conference on
Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, pp. 331–339, July 2018
10. Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. In: Guyon, I.,
et al. (eds.) 31st Conference on Neural Information Processing Systems (NIPS 2017), Long
Beach, CA, USA, pp. 4768–4777 (2017)
11. Owen, G.: Multilinear extensions of games. Management Sci. 18, 64–79 (1972)
206 C. Labreuche
12. Ribeiro, M., Singh, S., Guestrin, C.: “Why Should I Trust You?”: explaining the predic-
tions of any classifier. In: KDD 2016 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, pp.
1135–1144 (2016)
13. Rota, G.: On the foundations of combinatorial theory I. Theory of Möbius functions.
Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 2, 340–368 (1964)
14. Shapley, L.S.: A value for n-person games. In: Kuhn, H.W., Tucker, A.W. (eds.) Contribu-
tions to the Theory of Games, Vol. II. Annals of Mathematics Studies, no. 28, pp. 307–317.
Princeton University Press, Princeton (1953)
15. Sugeno, M.: Fuzzy measures and fuzzy integrals. Trans. S.I.C.E. 8(2), 218–226 (1972)
16. Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the
black box: automated decisions and the GDPR. Harvard J. Law Technol. 31(2), 841–887
(2018)
Assertional Removed Sets Merging of DL-Lite
Knowledge Bases
1 Introduction
In the last years, there has been an increasing use of ontologies in many application
areas including query answering, Semantic Web and information retrieval. Description
Logics (DLs) have been recognized as powerful formalisms for both representing and
reasoning about ontologies. A DL knowledge base is built upon two distinct compo-
nents: a terminological base (called TBox), representing generic knowledge about an
application domain, and an assertional base (called ABox), containing assertional facts
that instantiate terminological knowledge. Among Description Logics, a lot of attention
was given to DL-Lite [12], a lightweight family of DLs specifically tailored for appli-
cations that use huge volumes of data for which query answering is the most important
reasoning task. DL-Lite guarantees a low computational complexity of the reasoning
process.
In many practical situations, data are provided by several and potentially conflicting
sources, where getting meaningful answers to queries is challenging. While the avail-
able sources are individually consistent, gathering them together may lead to inconsis-
tency. Dealing with inconsistency in query answering has received a lot of attention
in recent years. For example, a general framework for inconsistency-tolerant semantics
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 207–220, 2019.
https://doi.org/10.1007/978-3-030-35514-2_16
208 S. Benferhat et al.
was proposed in [4, 5]. This framework considers two key notions: modifiers and infer-
ence strategies. Inconsistency tolerant query answering is seen as made out of a modi-
fier, which transforms the original ABox into a set of repairs, i.e. subsets of the original
ABox which are consistent w.r.t. the TBox, and an inference strategy, which evaluates
queries from these repairs. Interestingly enough, such setting covers the main existing
works on inconsistency-tolerant query answering (see e.g. [2, 9, 22]). Pulling together
the data provided by available sources and then applying inconsistency-tolerant query
answering semantics provides a solution to deal with inconsistency. However, in this
case valuable information about the sources will be lost. This information is indeed
important when trying to find better strategies to deal with inconsistency during merg-
ing process.
This paper addresses query answering by merging data sources. Merging consists
in achieving a synthesis between pieces of information provided by different sources.
The aim of merging is to provide a consistent set of information, making maximum use
of the information provided by the sources while not favoring any of them. Merging
is an important issue in many fields of Artificial Intelligence [10]. Within the classical
logic setting belief merging has been studied according different standpoints. One can
distinguish model-based approaches that perform selection among the interpretations
which are the closest to original belief bases. Postulates characterizing the rational
behaviour of such merging operators, known as IC postulates, which have been pro-
posed by Revesz [25] and improved by Konieczny and Pérez [21] in the same spirit as
the seminal AGM [1] postulates for revision. Several concrete merging operators have
been proposed [11, 20, 21, 23, 26]. In contrast to model-based approaches, the formula-
based approaches perform selection on the set of formulas that are explicitly encoded
in the initial belief bases. Some of these approaches have been adapted in the con-
text of DL-Lite [13]. Falappa et al. [14] proposed a set of postulates to characterize
the behaviour of belief bases merging operators and concrete merging operators have
been proposed [6, 8, 14, 17, 19, 24]. Among these formula-based merging approaches,
Removed Sets Fusion approach has been proposed in [17, 18] for merging propositional
belief bases. This approach stems from removing a minimal subset of formulae, called
removed set, to restore consistency. The minimality in Removed Sets Fusion stems
from the operator used to perform merging, which can be the sum (Σ), the cardinal-
ity (Card), the maximum (M ax), the lexicographic ordering (GM ax). This approach
has shown interesting properties: it is not too cautious and satisfies most rational IC
postulates when extended to belief sets revision.
This paper studies DL-Lite Assertional Removed Sets Fusion (ARSF). The main
motivation in considering ARSF is to take advantage of the tractability of DL-Lite for
the merging process and the rational properties satisfied by ARSF operators. We con-
sider in particular DL-LiteR as member of the DL-Lite family, which offers a good
compromise between expressive power and computational complexity and underlies
the OWL2-QL profile. We propose several merging strategies based on different defini-
tions of minimality criterion, and we give a characterization of these merging strategies.
The last section contains algorithms based on the notion hitting sets for computing the
merging outcome.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 209
2 Background
In this paper, we only consider DL-LiteR , denoted by L, which underlies OWL2-QL.
However, results of this work can be easily generalized for several members of the
DL-Lite family (see [3] for more details about the DL-Lite family).
Syntax. A DL-Lite knowledge base K = T , A is built upon a set of atomic con-
cepts (i.e. unary predicates), a set of atomic roles (i.e. binary predicates) and a set of
individuals (i.e. constants). Complex concepts and roles are formed as follows:
where A (resp. P) is an atomic concept (resp. role). B (resp. C) are called basic (resp.
complex) concepts and roles R (resp. E) are called basic (resp. complex) roles. The
TBox T consists of a finite set of inclusion axioms between concepts of the form: B
C and inclusion axioms between roles of the form: R E. The ABox A consists of a
finite set of membership assertions on atomic concepts and on atomic roles of the form:
A(ai ), P (ai , aj ), where ai and aj are individuals. For the sake of simplicity, in the rest
of this paper, when there is no ambiguity we simply use DL-Lite instead of DL-LiteR .
Semantics. The DL-Lite semantics is given by an interpretation I = (ΔI , .I ) which
consists of a nonempty domain ΔI and an interpretation function .I . The function
.I assigns to each individual a an element aI ∈ ΔI , to each concept C a subset
C I ⊆ ΔI and to each role R a binary relation RI ⊆ ΔI × ΔI over ΔI . More-
over, the interpretation function .I is extended for all constructs of DL-LiteR . For
instance: (¬B)I = ΔI \B I , (∃R)I = {x ∈ ΔI |∃y ∈ ΔI such that (x, y) ∈ RI }
and (P − )I = {(y, x) ∈ ΔI × ΔI |(x, y) ∈ P I }. Concerning the TBox, we say that
I satisfies a concept (resp. role) inclusion axiom, denoted by I |= B C (resp.
I |= R E), iff B I ⊆ C I (resp. RI ⊆ E I ). Concerning the ABox, we say that
I satisfies a concept (resp. role) membership assertion, denoted by I |= A(ai ) (resp.
I |= P (ai , aj )), iff aIi ∈ AI (resp. (aIi , aIj ) ∈ P I ). Finally, an interpretation I is said
to satisfy K = T , A iff I satisfies every axiom in T and every assertion in A. Such
interpretation is said to be a model of K.
Incoherence and Inconsistency. Two kinds of inconsistency can be distinguished in
DL setting: incoherence and inconsistency [7]. A knowledge base is said to be incon-
sistent iff it does not admit any model and it is said to be incoherent if there exists at
least a non-satisfiable concept, namely for each interpretation I which is a model of T ,
we have C I = ∅. In DL-Lite setting a TBox T = {PIs, NIs} can be viewed as com-
posed of positive inclusion axioms, denoted by (PIs), and negative inclusion axioms,
denoted by (NIs). PIs are of the form B1 B2 or R1 R2 and NIs are of the form
B1 ¬B2 or R1 ¬R2 . The negative closure of T , denoted by cln(T ), represents the
propagation of the NIs using both PIs and NIs in the TBox (see [12] for more details).
Important properties have been established in [12] for consistency checking in DL-Lite:
K is consistent if and only if cln(T ), A is consistent. Moreover, every DL-Lite knowl-
edge base with only PIs in its TBox is always satisfiable. However when T contains NI
axioms then the DL-Lite knowledge base may be inconsistent and in an assertional-
based approach only elements of ABoxes are removed to restore consistency [13].
210 S. Benferhat et al.
We denote by RP (MK ) the set of assertional removed sets according to the strat-
egy P of MK . If MK is consistent then RP (MK ) = {∅}. The usual merging strate-
gies sum-based (Σ), cardinality-based (Card), maximum-based (M ax) and lexico-
graphic ordering (GM ax) are captured by the following total pre-orders. We denote
by s(MA ) the ABox obtained from MK where every assertion expressed more than
once is reduced to a singleton.
(Σ): X ≤Σ Y if 1≤i≤n | X ∩ Ai |≤ 1≤i≤n | Y ∩ Ai | .
(Card): X ≤Card Y if |X ∩ s(MA )| ≤ |Y ∩ s(MA )|.
(M ax): X ≤M ax Y if max1≤i≤n | X ∩ Ai |≤ max1≤i≤n | Y ∩ Ai | .
(GM ax): For every potential assertional removed set X and every ABox Ai , we define
pA
X =| X ∩ Ai |. Let LX
i MA
be the sequence (pA An
X , . . . , pX ) sorted by decreasing
1
4 Logical Properties
Within the context of propositional logic, postulates have been proposed in order to
classify reasonable belief bases merging operators [14–16]3 . In order to give logical
properties of ARSF operators, we first rephrase these postulates within the DL-Lite
framework, and then analyse to which extent the proposed operators satisfy these pos-
tulates for any selection function.
Let MK = T , MA and MK = T , MA be two MBox DL-Lite knowledge
bases, let Δ be an assertional-based merging operator and T , Δ(MA ) be the DL-Lite
knowledge base resulting from merging, where Δ(MA ) is a set of assertions. Let σ be
a permutation over {1, . . . n}, and MA = {A1 , . . . , An } be a multiset of assertions,
σ(MA ) denotes the set {Aσ(1) , . . . , Aσ(n) }. We rephrase the postulates as follows:
Inclusion Δ(MA ) ⊆ A1 ∪ . . . ∪ An .
Symmetry For any permutation σ over {1, . . . n}, Δ(σ(MA )) = Δ(MA ).
Consistency T , Δ(MA ) is consistent.
Congruence If A1 ∪ . . . ∪ An = A1 ∪ . . . ∪ An then Δ(MA ) = Δ(MA ).
Vacuity If T , MA is consistent then Δ(MA ) = A1 ∪ . . . ∪ An .
Reversion If T , MA and T , MA have the same minimal inconsistent sub-
sets then (A1 ∪ . . . ∪ An )\Δ(MA ) = (A1 ∪ . . . ∪ An )\Δ(MA ).
2
On each column the assertional removed sets are in bold.
3
We do not consider the IC postulates [21] since they apply to belief sets and not to belief bases.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 213
Inclusion states that the union of the initial ABoxes is the upper bound of any merging
operation. Symmetry establishes that all ABoxes are considered of equal importance.
Consistency requires the consistency of the result of merging. Congruence requires that
the result of merging should not depend on syntactic properties of the ABoxes. Vacuity
says that if the union of the ABoxes is consistent w.r.t. T then the result of merging
equals this union. Reversion says that if ABoxes have the same minimal inconsistent
subsets w.r.t. T then the assertions erased in the respective ABoxes are the same. Core-
retainment and Relevance express the intuition that nothing is removed from the original
ABoxes unless its removal in some way contribute to make the result consistent.
Proposition 1. Let MK = T , MA be a MBox DL-Lite knowledge base. For any
selection function, ∀P ∈ {Σ, Card, M ax, GM ax}, Δarsf P satisfies the Inclusion,
Symmetry, Consistency, Vacuity, Core-retainment and Relevance. Δarsf
Card satisfies Con-
gruence and Reversion, but ∀P ∈ {Σ, M ax, GM ax}, Δarsf
P does not satisfy Congru-
ence nor Reversion.
(sketch of the proof) For any selection function, by Definitions 4 and 5, ∀P ∈
{Σ, Card, M ax, GM ax}, Δarsf P satisfies Inclusion, Symmetry, Consistency, Vacuity
and Core-retainment.
Relevance: By Definition 5, for any selection function f , ∀P ∈ {Σ, Card, M ax,
GM ax}, if α ∈ A1 ∪. . .∪An and α ∈ Δarsf P (MA ) then α ∈ f (RP (MK )). Let A =
arsf
ΔP (MA ), A is consistent and A ∪ {α} is inconsistent since α ∈ f (RP (MK )) and
f (RP (MK ) is an assertional removed set. By Definition 5, Δarsf
Card satisfies Congruence
and Reversion since every assertion expressed more than once is reduced to a singleton.
We provide a counter-example for Δarsf P , ∀P ∈ {Σ, M ax, GM ax}. Let MK =
T , MA be an inconsistent MBox DL-Lite knowledge base such that T = {A ¬B}
and A1 = {A(a)}, A2 = {A(b), B(a)}, A3 = {B(a), A(b)}. The potential asser-
tional removed sets are PR(MK ) = {X1 , X2 , X3 , X4 } with X1 = {A(a), A(b)},
X2 = {A(a), B(b)}, X3 = {B(a), A(b)}, X4 = {B(a), B(b)} and the sets of
assertional removed sets are RΣ (MK ) = {X1 , X2 }, RM ax (MK ) = {X1 , X2 } and
RGM ax (MK ) = {X1 , X2 }.
Besides, let MK = T , MA be an inconsistent MBox DL-Lite knowledge base such
that T = {A ¬B} and A1 = {A(a), B(b)}, A2 = {B(a)}, A3 = {A(a), A(a)}.
We have (A1 ∪ A2 ∪ A3 ) = (A1 ∪ A2 ∪ A3 ) and PR(MK ) = PR(MK ), and the sets
of assertional removed sets are RΣ (MK ) = {X3 , X4 }, RM ax (MK ) = {X3 , X4 }
and RGM ax (MK ) = {X3 , X4 }.
The proof is straightforward following Definition 2. Notice that the algorithm for the
computation of the set of conflicts C(MK ) is done in polynomial w.r.t. the size of MK .
This can be found e.g. in [7]. In the following, we provide a single algorithm to compute
the potential assertional removed sets and the assertional removed sets according to the
strategies Card, Σ, M ax and Gmax. We give explanations on the different use cases of
this algorithm hereafter. For a given assertional base MK , the outcome of Algorithm 1
depends on the value of the parameter P : if P ∈ {Card, Σ, M ax, Gmax}, then the
result is RP (MK ). Otherwise the result is PR(MK ).
Let us first focus on the computation of PR(MK ). The algorithm is an adaptation
of the algorithm for the computation of the minimal hitting sets w.r.t. set inclusion of
a collection of sets described in [28]. It relies on the breadth-first construction of a
directed acyclic graph called an HS-dag. An HS-dag T is a dag with labeled nodes and
edges such that: (i) The root is labeled with ∅ if C(MK ) is empty, otherwise it is labeled
with an arbitrary element of C(MK ); (ii) for each node n of T , we denote by H(n) the
set of edge labels on the path from n to the root of T ; (iii) The label of a node n is any
set C ∈ C(MK ) such that C ∩ H(n) = ∅ if such a set exists. Otherwise n is labeled
with ∅. Nodes labeled with ∅ are called terminal nodes; (iv) If n is labeled by a set C,
then for each α ∈ C, n has a successor nα , joined to n by an edge labeled by α.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 215
In our case, the elements of C ∈ C(MK ) are such that |C| = 2 (see [12]), so the
HS-dag is binary. Algorithm 1 computes the potential assertional removed sets by com-
puting the minimal hitting sets w.r.t. set inclusion of C(MK ). It builds a pruned HS-dag
in a breadth-first order, using some pruning rules to avoid a complete development of
the branches. We move the processing of the left and right children nodes in a separate
function (described in Algorithm 2), as it first permits to keep the algorithm short and
simple, and second facilitates the extension of this algorithm to the computation of the
assertional removed sets according to the different strategies.
216 S. Benferhat et al.
P revQ and CurQ are sets containing respectively the nodes of the previous and the
current level. label(n) denotes the label of a node n. In a similar way, if b is a branch,
label(b) represents the label of b. lef t_branch(n) (resp. right_branch(n)) denotes
the left (resp. right) branch under the node n. lef t_child(n) (resp. right_child(n))
represent the left (resp. right) child node of the node n. The algorithm iterates the nodes
of a level and tries to develop the branches under each of these nodes. The central
property is that the conflict C labeling a node n is such that C ∩ H(n) = ∅.
Pruning rules are applied when trying to develop the left and right branches of some
parent node pa (lines 4–22 in function P ROCESS C HILD, Algorithm 2). Let us briefly
describe them: (i) if there exists a node n on the same level as the currently developed
child branch such that H(n ) = H(pa) ∪ {b_label} (b_label being the label of the
currently developed child branch), we connect the child branch to n , and there is no
node creation (line 4); (ii) if there exists a node n in the HS-dag such that H(n ) ⊂
H(pa)∪{b_label} and n is a terminal node, then the node connected to the child branch
is a closed node (which is marked with ) (line 6); (iii) otherwise the node connected
to the child branch is labelled by a conflict C such that H(pa) ∪ {b_label} ∩ C = ∅.
This new node is added to the current level queue.
Now we explain the aspects of the computation of the assertional removed sets
according to each strategy P . Card strategy. The Card strategy is the simplest one
to implement. First, observe that the level of a node n in the HS-dag is equal to the
cardinality of H(n). This means that if n is an end node (a node labeled with ∅), the
cardinality of the corresponding minimal hitting set is H(n). Thus, there is no need
to continue the construction of the HS-dag, as we are only interested in hitting sets
which are minimal w.r.t. cardinality. In the light of the preceding observation, The only
modification of the algorithm is the use of a boolean flag mincard which halts the
computation at the end of the level where the first potential assertional removed set has
been detected. Σ, M ax and GM ax strategies. As regards these strategies, we have
no guarantee that the assertional removed sets reside in the same level of the tree, as
illustrated by the following example for the Σ strategy.
Example 3. Let MK = T , MA be an inconsistent MBox DL-Lite knowledge base
such that T = {A ¬B, C ¬B}, and A1 = {A(a)}, A2 = {C(a)}, A3 = {B(a)},
A4 = {B(a)}, A5 = {B(a)}. We have PR(MK ) = {{A(a), C(a)}, {B(a)}} and
RΣ (MK ) = {{A(a), C(a)}}. Thus the only assertional removed set is found at level
2, while the first potential assertional removed set is found at level 1.
Similar examples can be exhibited for the M ax and GM ax strategies. The search
strategy and associated pruning techniques for Σ, M ax and Gmax are located in lines 9
and 16 of Algorithm 2. They rely on a cost function which takes as parameters a strategy
and a set S of ABox assertions. The different cost functions are defined according to
the strategies, that is, given an MBox MA = {A1 ∪ . . . ∪ An }: For the Σ strategy
COST(Σ, S) computes |S ∩ A1 |+ . . . + |S ∩ An |. For the M ax strategy COST(M ax, S)
computes max(|S ∩A1 |, . . . , |S ∩An |), For the GM ax strategy, using pA
X = |X ∩Ai |,
i
MA A1 An
COST(GM ax, S) computes LX , which is the sequence (pX , . . . , pX ) sorted by
decreasing lexicographic order.
The variable M inCost maintains the current minimal cost. In line 9 of Algorithm 2,
if the cost of the current node is greater than M inCost, then the node is closed, as is
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 217
cannot be optimal. Otherwise we create a new node, labelled with a conflict which does
not intersect H(pa)∪{b_label}. If such a label cannot be found (line 16), i.e. the current
node is a terminal node then, at this point: (i) we are assured that C OST(P, H(pa) ∪
{b_label}) ≤ M inCost, so we add the new node to the set of currently optimal nodes
(line 22); (ii) if the cost of the current node is strictly less than M inCost, then we close
all nodes currently believed to be optimal, empty the set containing them, and update
M inCost (lines 18–21).
Example 4. We illustrate the operation of the algorithm with the computation of the
assertional removed sets of Example 2. Figure 1 depicts the HS-dag built by Algo-
rithm 1. Circled numbers shows the ordering of nodes (apart from root which is obvi-
ously the first node).
{A(a), B(a)}
A(a) B(a)
{C(a), D(a)} 1 {C(a), D(a)} 2
the new node is labelled with ∅. Whatever the strategy is, its cost is necessarily
less than M inCost which has been initialized to ∞. Thus M inCost is updated
to the cost of node 3 depending on the strategy and node 3 is added to the
M inN odes set.
State: M inN odes = {}, 3 M inCostΣ = 3, M inCostM ax = 2,
M inCostGM ax = (2, 1, 0).
– P ROCESS C HILD(β, no, CurQ, MK , M inCost, M inN odes, P ) (right branch
of node )1 is called. None of the pruning conditions in lines 4, 6 and 9 apply,
so node 4 is created. As there is no conflict C such that C ∩ H() 4 = ∅,
the new node is labelled with ∅. For strategy Σ, the cost of node 4 is equal to
M inCost, thus node 4 is added to the M inN odes set. For strategies M ax and
GM ax, the cost of node 4 is less than M inCost: node 3 is closed (line 18),
set M inN odes is emptied, and M inCost is updated.
State: M inN odesΣ = {, 3 },
4 M inN odesM ax = {}, 4 M inN odesGM ax =
4 M inCostΣ = 3, M inCostM ax = 1, M inCostGM ax = (1, 1, 1).
{},
– Left and right branches of node 2 are labelled respectively with C(a) and D(a),
the members of the label (lines 16–17 of Algorithm 1).
– P ROCESS C HILD(α, no, CurQ, MK , M inCost, M inN odes, P ) (left branch
of node )2 is called. None of the pruning conditions in lines 4, 6 and 9 apply,
so node 5 is created. As there is no conflict C such that C ∩ H() 5 = ∅, the
new node is labelled with ∅. For strategy Σ, The cost of node 5 (2) is less than
M inCost. The same applies for GM ax
State: M inN odesΣ = {}, 5 M inN odesM ax = {, 4 },
5 M inN odesGM ax =
5 M inCostΣ = 2, M inCostM ax = 1, M inCostGM ax = (1, 1, 0).
{},
– P ROCESS C HILD(β, no, CurQ, MK , M inCost, M inN odes, P ) (right branch
of node )2 is called. None of the pruning conditions apply, so node 6 is cre-
ated. As there is no conflict C such that C ∩H() 6 = ∅, the new node is labelled
with ∅. For strategy Σ, The cost of node 6 (2) is equal to M inCost.
State: M inN odesΣ = {, 5 },
6 M inN odesM ax = {, 4 },
5 M inN odesGM ax =
5 , M inCostΣ = 2, M inCostM ax = 1, M inCostGM ax = (1, 1, 0).
{}
6 Conclusion
belief merging to DL-Lite. This approach differs from the one we propose since we
extend formula-based merging to DL lite.
In a future work we plan to conduct a complexity analysis of the proposed algo-
rithm for the different used merging strategies. Moreover, we also want to focus on
the implementation of ARSF operators and on an experimental study on real world
applications, in particular 3D surveys within the context of underwater archaeology
and handling conflicts in dances’ videos. Furthermore, the ARSF operators stem from
a selection function that selects one assertional removed set, we also plan to investigate
operators stemming from other selection functions as well as other strategies and other
approaches than ARSF for performing assertional-based merging.
References
1. Alchourrón, C., Gärdenfors, P., Makinson, D.: On the logic of theory change: partial meet
contraction and revision functions. J. Symb. Log. 50(2), 510–530 (1985)
2. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases.
In: Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
ciples of Database Systems, Philadelphia, Pennsylvania, USA, pp. 68–79 (1999)
3. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family and rela-
tions. J. Artif. Intell. Res. (JAIR) 36, 1–69 (2009)
4. Baget, J.F., et al.: A general modifier-based framework for inconsistency-tolerant query
answering. In: Principles of Knowledge Representation and Reasoning: Proceedings of the
Fifteenth International Conference, KR 2016, Cape Town, South Africa, 25–29 April 2016,
pp. 513–516 (2016)
5. Baget, J.F., et al.: Inconsistency-tolerant query answering: rationality properties and compu-
tational complexity analysis. In: Michael, L., Kakas, A. (eds.) JELIA 2016. LNCS (LNAI),
vol. 10021, pp. 64–80. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48758-
8_5
6. Baral, C., Kraus, S., Minker, J., Subrahmanian, V.S.: Combining knowledge bases consisting
of first order theories. Comp. Intell. 8(1), 45–71 (1992)
7. Benferhat, S., Bouraoui, Z., Papini, O., Würbel, E.: Assertional-based removed sets revision
of DL-LiteR knowledge bases. In: ISAIM (2014)
8. Benferhat, S., Dubois, D., Kaci, S., Prade, H.: Possibilistic merging and distance-based
fusion of propositional information. Stud. Logica. 58(1), 17–45 (1997)
9. Bienvenu, M.: On the complexity of consistent query answering in the presence of simple
ontologies. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
(2012)
10. Bloch, I., Hunter, A., et al.: Fusion: general concepts and characteristics. Int. J. Intell. Syst.
16(10), 1107–1134 (2001)
11. Bloch, I., Lang, J.: Towards mathematical morpho-logics. In: Bouchon-Meunier, B.,
Gutiérrez-Ríos, J., Magdalena, L., Yager, R.R. (eds.) Technologies for Constructing Intel-
ligent Systems 2. STUDFUZZ, vol. 90, pp. 367–380. Physica, Heidelberg (2002). https://
doi.org/10.1007/978-3-7908-1796-6_29
220 S. Benferhat et al.
12. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
and efficient query answering in description logics: the DL-Lite family. J. Autom. Reasoning
39(3), 385–429 (2007)
13. Calvanese, D., Kharlamov, E., Nutt, W., Zheleznyakov, D.: Evolution of DL - Lite knowledge
bases. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 112–128.
Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_8
14. Falappa, M.A., Kern-Isberner, G., Reis, M.D.L., Simari, G.R.: Prioritized and non-prioritized
multiple change on belief bases. J. Philos. Log. 41, 77–113 (2012)
15. Falappa, M.A., Kern-Isberner, G., Simari, G.R.: Explanations, belief revision and defeasible
reasoning. Artif. Intell. 141(1/2), 1–28 (2002)
16. Fuhrmann, A.: An Essay on Contraction. CSLI Publications, Stanford (1997)
17. Hue, J., Papini, O., Würbel, E.: Syntactic propositional belief bases fusion with removed
sets. In: Mellouli, K. (ed.) ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 66–77. Springer,
Heidelberg (2007). https://doi.org/10.1007/978-3-540-75256-1_9
18. Hué, J., Würbel, E., Papini, O.: Removed sets fusion: performing off the shelf. In: Proceed-
ings of ECAI 2008 (FIAI 178), pp. 94–98 (2008)
19. Konieczny, S.: On the difference between merging knowledge bases and combining them.
In: Proceedings of KR 2000, pp. 135–144 (2000)
20. Konieczny, S., Lang, J., Marquis, P.: DA2 merging operators. Artif. Intell. 157, 49–79 (2004)
21. Konieczny, S., Pérez, R.P.: Merging information under constraints. J. Log. Comput. 12(5),
773–808 (2002)
22. Lembo, D., Lenzerini, M., Rosati, R., Ruzzi, M., Savo, D.F.: Inconsistency-tolerant query
answering in ontology-based data access. J. Web Sem. 33, 3–29 (2015)
23. Lin, J., Mendelzon, A.: Knowledge base merging by majority. In: Pareschi, R., Fronhoefer,
B. (eds.) In Dynamic Worlds: From the Frame Problem to Knowledge Management. Kluwer,
Dordrecht (1999)
24. Meyer, T., Ghose, A., Chopra, S.: Syntactic representations of semantic merging operations.
In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, p. 620. Springer,
Heidelberg (2002). https://doi.org/10.1007/3-540-45683-X_88
25. Revesz, P.Z.: On the semantics of theory change: arbitration between old and new informa-
tion. In: 12th ACM SIGACT-SGMIT-SIGART Symposium on Principes of Databases, pp.
71–92 (1993)
26. Revesz, P.Z.: On the semantics of arbitration. J. Algebra Comput. 7, 133–160 (1997)
27. Wang, Z., Wang, K., Jin, Y., Qi, G.: Ontomerge a system for merging DL-Lite ontologies.
In: CEUR Workshop Proceedings, vol. 969, pp. 16–27 (2014)
28. Wilkerson, R.W., Greiner, R., Smith, B.A.: A correction to the algorithm in Reiter’s theory
of diagnosis. Artif. Intell. 41, 79–88 (1989)
An Interactive Polyhedral Approach
for Multi-objective Combinatorial
Optimization with Incomplete Preference
Information
1 Introduction
The increasing complexity of applications encountered in Computer Science sig-
nificantly complicates the task of decision makers who need to find the best
solution among a very large number of options. Multi-objective optimization
is concerned with optimization problems involving several (conflicting) objec-
tives/criteria to be optimized simultaneously (e.g., minimizing costs while max-
imizing profits). Without preference information, we only know that the best
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 221–235, 2019.
https://doi.org/10.1007/978-3-030-35514-2_17
222 N. Benabbou and T. Lust
solution for the decision maker (DM) is among the Pareto-optimal solutions (a
solution is called Pareto-optimal if there exists no other solution that is better
on all objectives while being strictly better on at least one of them). The main
problem with this kind of approach is that the number of Pareto-optimal solu-
tions can be intractable, that is exponential in the size of the problem (e.g. [13]
for the multicriteria spanning tree problem). One way to address this issue is to
restrict the size of the Pareto set in order to obtain a “well-represented” Pareto
set; this approach is often based on a division of the objective space into differ-
ent regions (e.g., [15]) or on -dominance (e.g., [18]). However, whenever the DM
needs to identify the best solution, it seems more appropriate to refine the Pareto
dominance relation with preferences to determine a single solution satisfying the
subjective preferences of the DM. Of course, this implies the participation of the
DM who has to give us some insights and share her preferences.
In this work, we assume that the DM’s preferences can be represented by a
parameterized scalarizing function (e.g., a weighted sum), allowing some trade-
off between the objectives, but the corresponding preference parameters (e.g.,
the weights) are initially not known; hence, we have to consider the set of all
parameters compatible with the collected preference information. An interesting
approach to deal with preference imprecision has been recently developed [19,
21,30] and consists in determining the possibly optimal solutions, that is the
solutions that are optimal for at least one instance of the preference parameters.
The main drawback of this approach, though, is that the number of possibly
optimal solutions may still be very large compared to the number of Pareto-
optimal solutions; therefore there is a need for elicitation methods aiming to
specify the preference model by asking preference queries to the DM.
In this paper, we study the potential of incremental preference elicitation
(e.g., [23,27]) in the framework of multi-objective combinatorial optimization.
Preference elicitation on combinatorial domains is an active topic that has been
recently studied in various contexts, e.g. in multi-agents systems [1,3,6], in stable
matching problems [9], in constraint satisfaction problems [7], in Markov Deci-
sion Processes [11,24,28] and in multi-objective optimization problems [4,14,16].
Our aim here is to propose a general interactive approach for multi-objective
optimization with imprecise preference parameters. Our approach identifies
informative preference queries by exploiting the extreme points of the polyhe-
dron representing the admissible preference parameters. Moreover, these extreme
points are also used to provide a stopping criterion which guarantees the deter-
mination of the (near-)optimal solution. Our approach is general in the sense
that it can be applied to any multi-objective optimization problem, providing
that the scalarizing function is linear in its preference parameters (e.g., weighted
sums, Choquet integrals [8,12]) and that there exists an efficient algorithm to
solve the problem when preferences are precisely known (e.g., [17,22] for the
minimum spanning tree problem with a weighted sum).
The paper is organized as follows: We first give general notations and recall
the basic principles of regret-based incremental elicitation. We then propose
a new interactive method based on the minimax regret decision criterion and
An Interactive Polyhedral Approach for MOCO Problems 223
In this definition, X is the feasible set in the decision space, typically defined by
some constraint functions (e.g., for the multicriteria spanning tree problem, X is
the set of all spanning trees of the graph). In this problem, any solution x ∈ X is
associated with a cost vector y(x) = (y1 (x), . . . , yn (x)) ∈ Rn where yi (x) is the
evaluation of x on the i-th criterion/objective. Thus the image of the feasible
set in the objective space is defined by {y(x) : x ∈ X } ⊂ Rn .
Solutions are usually compared through their images in the objective space
(also called points) using the Pareto dominance relation: we say that point u =
(u1 , . . . , un ) ∈ Rn Pareto dominates point v = (v1 , . . . , vn ) ∈ Rn (denoted by
u ≺P v) if and only if ui ≤ vi for all i ∈ {1, . . . , n}, with at least one strict
inequality. Solution x∗ ∈ X is called efficient if there does not exist any other
feasible solution x ∈ X such that y(x) ≺P y(x∗ ); its image in objective space is
then called a non-dominated point.
We assume here that the DM’s preferences over solutions can be represented by a
parameterized scalarizing function fω that is linear in its parameters ω. Solution
x ∈ X is preferred to solution x ∈ X if and only if fω (y(x)) ≤ fω (y(x )).
ngive a few examples, function fω can be a weighted sum (i.e. fω (y(x)) =
To
i=1 ωi yi (x)) or a Choquet integral with capacity ω [8,12]. We also assume that
parameters ω are not known initially. Instead, we consider a (possibly empty)
set Θ of pairs (u, v) ∈ Rn × Rn such that u is known to be preferred to v; this set
can be obtained by asking preference queries to the DM. Let ΩΘ be the set of all
parameters ω that are compatible with Θ, i.e. all parameters ω that satisfy the
constraints fω (u) ≤ fω (v) for all (u, v) ∈ Θ. Thus, since fω is linear in ω, we can
assume that ΩΘ is a convex polyhedron throughout the paper. The problem is
now to determine the most promising solution under the preference imprecision
(defined by ΩΘ ). To do so, we use the minimax regret approach (e.g., [7]) which
is based on the following definitions:
224 N. Benabbou and T. Lust
M R(x, X , ΩΘ ) = max
P M R(x, x , ΩΘ )
x ∈X
1. First, the set of all extreme points of polyhedron ΩΘ are generated. This set
is denoted by EPΘ and its kth element is denoted by ω k .
2. Then, for every point ω k ∈ EPΘ , P is solved considering the precise scalar-
izing function fωk (the corresponding optimal solution is denoted by xk ).
3. Finally M M R(XΘ , ΩΘ ) is computed, where XΘ = {xk : k ∈ {1, . . . , |EPΘ |}}.
If this value is strictly larger than δ, then the DM is asked to compare two
solutions x, x ∈ XΘ and ΩΘ is updated by imposing the linear constraint
fω (x) ≤ fω (x ) (or fω (x) ≥ fω (x ) depending on her answer); the algorithm
stops otherwise.
Proof. Let x∗ be the returned solution and let K be the number of extreme
points of ΩΘ at the end of the execution. For all k ∈ {1, . . . , K}, let ω k be the
kth extreme point of ΩΘ and let xk be a solution minimizing function fωk . Let
XΘ = {xk : k ∈ {1, . . . , K}}. We know that M R(x∗ , XΘ , ΩΘ ) ≤ δ holds at the
end of the while loop (see the loop condition); hence we have fω (x∗ )−fω (xk ) ≤ δ
for all solutions xk ∈ XΘ and all parameters ω ∈ ΩΘ (see Definition 2).
We want to prove that M R(x∗ , X , ΩΘ ) ≤ δ holds at the end of execution. To
do so, it is sufficient to prove that fω (x∗ ) − fω (x) ≤ δ holds for all x ∈ X and
all ω ∈ ΩΘ . Since ΩΘ is a convex polyhedron, Kfor any ω ∈ ΩΘ , there
K exists a
vector λ = (λ1 , . . . , λK ) ∈ [0, 1]K such that k=1 λk = 1 and ω = k=1 λk ω k .
Therefore, for all solutions x ∈ X and for all parameters ω ∈ ΩΘ , we have:
K
fω (x∗ ) − fω (x) = λk (fωk (x∗ ) − fωk (x)) by linearity
k=1
K
≤ λk (fωk (x∗ )−fωk (xk )) since xk is fωk -optimal
k=1
226 N. Benabbou and T. Lust
K
≤ λk × δ since fω (x∗ ) − fω (xk ) ≤ δ
k=1
K
=δ× λk
k=1
= δ.
Algorithm 1. IEEP
IN ↓ P : a MOCO problem; δ: a threshold; fω : a scalarizing function with unknown
parameters ω; Θ: a set of preference statements.
OUT ↑: a solution x∗ with a max regret smaller than δ.
- -| Initialization of the convex polyhedron:
ΩΘ ← {ω : ∀(u, v) ∈ Θ, fω (u) ≤ fω (v)}
- -| Generation of the extreme points of the polyhedron:
EPΘ ←ExtremePoints(ΩΘ )
- -| Generation of the optimal solutions attached to EPΘ :
XΘ ← Optimizing(P ,EPΘ )
while M M R(XΘ , ΩΘ ) > δ do
- -| Selection of two solutions to compare:
(x, x ) ← Select(XΘ )
- -| Question:
query(x, x )
- -| Update preference information:
if x is preferred to x then
Θ ← Θ ∪ {(y(x), y(x ))}
else
Θ ← Θ ∪ {(y(x ), y(x))}
end
ΩΘ ← {ω : ∀(u, v) ∈ Θ, fω (u) ≤ fω (v)}
- -| Generation of the extreme points of the polyhedron:
EPΘ ←ExtremePoints(ΩΘ )
- -| Generation of the optimal solutions attached to EPΘ :
XΘ ← Optimizing(P ,EPΘ )
end
return a solution x∗ ∈ XΘ minimizing M R(x, XΘ , ΩΘ )
Example 1. Consider the multicriteria spanning tree problem with 5 nodes and
7 edges given in Fig. 1. Each edge is evaluated with respect to 3 criteria. Assume
that the DM’s preferences can be represented by a weighted sum fω with
unknown parameters ω. Our goal is to determine an optimal spanning tree for the
DM (δ = 0), i.e. a connected acyclic sub-graph with 5 nodes that is fω -optimal.
An Interactive Polyhedral Approach for MOCO Problems 227
We now apply algorithm IEEP on this instance, starting with an empty set of
preference statements (i.e. Θ = ∅).
Initialization: As Θ = ∅, ΩΘ is initialized to the set of all weighting vectors
ω = (ω1 , ω2 , ω3 ) ∈ [0, 1]3 such that ω1 + ω2 + ω3 = 1. In Fig. 2, ΩΘ is represented
by the triangle ABC in the space (ω1 , ω2 ); value ω3 is implicitly defined by
ω3 = 1 − ω1 − ω2 . Hence the initial extreme points are the vectors of the natural
basis of the Euclidean space, corresponding to Pareto dominance [29]; in other
words, we have EPΘ = {ω 1 , ω 2 , ω 3 } with ω 1 = (1, 0, 0), ω 2 = (0, 1, 0) and
ω 3 = (0, 0, 1). We then optimize according to all weighting vectors in EPΘ using
Prim algorithm [22], and we obtain the following three solutions: for ω 1 , we
have a spanning tree x1 evaluated by y(x1 ) = (15, 17, 14); for ω 2 , we obtain a
spanning tree x2 with y(x2 ) = (23, 8, 16); for ω 3 , we find a spanning tree x3 such
that y(x3 ) = (17, 16, 11). Hence we have XΘ = {x1 , x2 , x3 }.
(8,1,1)
1 2
(3,4,7)
(7,7,7)
3 (2,2,2) (4,9,1)
(7,3,9)
5 4
(6,2,4)
update Θ by inserting the preference statement ((19, 9, 14), (17, 16, 11)) and
we update ΩΘ by imposing the following additional constraint: fω (19, 9, 14) ≤
fω (17, 16, 11) (see Fig. 5); the corresponding extreme points are given by EPΘ =
{(0.18, 0.28, 0.54), (0, 0.3, 0.7), (0, 0.38, 0.62), (0.43, 0.42, 0.15)}. Now the set XΘ
only includes one spanning tree x1 and y(x1 ) = (19, 9, 14). Finally, the algorithm
stops (since we have M M R(XΘ , ΩΘ ) = 0 ≤ δ = 0) and it returns solution x1
(which is guaranteed to be the optimal solution for the DM).
ω2 ω2 ω2 ω2
1•B 1•B 1 + 1+
F G G
• H • H •
• •
J• •
A C E• E• I
• • + + +
0 ω
1 1 0 1 ω1 0 1 ω1 0 1 ω1
Fig. 2. Initial set. Fig. 3. After step 1. Fig. 4. After step 2. Fig. 5. After step 3.
5 Experimental Results
2
https://polymake.org.
230 N. Benabbou and T. Lust
Table 1. MST: comparison of the different query strategies (best values in bold).
Running Time and Number of Evaluations. We observe that Random and Max-
Dist strategies are much faster than CSS strategy; for instance, for n = 6 and
|V | = 100, Random and Max-Dist strategies end before one minute whereas
CSS needs almost a minute and a half. Note that time is mostly consumed by
the generation of extreme points, given that the evaluations are performed by
Prim algorithm which is very efficient. Since the number of evaluations with CSS
drastically increases with the size of the problem, we may expect the performance
gap between CSS and the two other strategies to be much larger for MOCO
problems with a less efficient solving method.
Number of Generated Preference Queries. We can see that Max-Dist is the best
strategy for minimizing the number of generated preference queries. More pre-
cisely, for all instances, the preferred solution is detected with less than 40 queries
and the optimality is established after at most 50 queries. In fact, we can reduce
even further the number of preference queries by considering a strictly positive
tolerance threshold; to give an example, if we set δ = 0.1 (i.e. 10% of the “maxi-
mum” error computed using the ideal point and the worst objective vector), then
our algorithm combined with Max-Dist strategy generates at most 20 queries in
all considered instances. In Table 1, we also observe that CSS strategy generates
many more queries than Random, which is quite surprising since CSS strategy
is intensively used in incremental elicitation (e.g., [4,7]). To better understand
this result, we have plotted the evolution of minimax regret with respect to the
number of queries for the bigger instance of our set (|V | = 100, n = 6). We have
divided the figure in two parts: the first part is when the number of queries is
between 1 and 20 and the other part is when the number of queries is between
20 and 50 (see Fig. 6). In the first figure, we observe that there is almost no
difference between the three strategies, and the minimax regret is already close
to 0 after only 20 questions (showing that we are very close to the optimum
relatively quickly). However, there is a significant difference between the three
strategies in the second figure: the minimax regret with CSS starts to reduce
An Interactive Polyhedral Approach for MOCO Problems 231
Fig. 6. MST problem with n = 6 and |V | = 100: evolution of the minimax regret
between 1 and 20 queries (left) and between 21 and 50 queries (right).
less quickly after 30 queries, remaining strictly positive after 50 queries, whereas
the optimal solution is found after about 40 queries with the other strategies.
Thus, queries generated with CSS gradually becomes less and less informative
than those generated by the two other strategies. This can be explained by the
following: CSS always selects the minimax regret optimal solution and one of its
worst adversary. Therefore, when the minimax regret optimal solution does not
change after asking a query, the same solution is used for the next preference
query. This can be less informative than asking the DM to compare two solu-
tions for which we have no preference information at all; Random and Max-Dist
strategies select the two solutions to compare in a more diverse way.
Table 2. MST: comparison between IEEP and IE-Prim (best values in bold).
construction of the solutions is less informative than asking queries using the
extreme points of the polyhedron representing the preference uncertainty.
Now we want to estimate the performances of our algorithm seen as an any-
time algorithm (see Fig. 7). For each iteration step i, we compute the error
obtained when deciding to return the solution that is optimal for the minimax
regret criterion at step i (i.e., after i queries); this error is here expressed in
terms of percentage from the optimal solution. For the sake of comparison, we
also include the results obtained with IE-Prim. However IE-Prim cannot be seen
as an anytime algorithm since it is constructive. Therefore, to vary the number
of queries, we used different tolerance thresholds: δ = 0.3, 0.2, 0.1, 0.05 and 0.01.
Fig. 7. MST problem with |V | = 100: Comparison of the errors with respect to the
number of queries for n = 3 (left) and for n = 6 (right).
In Fig. 7, we observe that the error drops relatively quickly for both proce-
dures. Note however that the error obtained with IE-Prim is smaller than with
An Interactive Polyhedral Approach for MOCO Problems 233
IEEP when the number of queries is very low. This may suggest to favor IE-Prim
over IEEP whenever the interactions are very limited and time is not an issue.
We now provide numerical results for the multicriteria traveling salesman prob-
lem (MTSP). In our tests, we consider existing Euclidean instances of the MTSP
with 50 and 100 cities, and n = 2 to 6 objectives4 . Moreover, we use the exact
solver Concorde5 to perform the optimization part of IEEP algorithm (see proce-
dure Optimizing). Contrary to the MST, there exist no interactive constructive
algorithms to solve the MTSP. Therefore, we only provide the results obtained by
our algorithm IEEP with the three proposed query generation strategies (namely
Random, Max-Dist and CSS). The results obtained by averaging over 30 runs
are given in Table 3 for δ = 0.
In this table, we see that Max-Dist remains the best strategy for minimizing
the number of generated preference queries. Note that the running times are
much higher for the MTSP than for the MST (see Table 1), as the traveling
salesman problem is much more difficult to solve exactly with known preferences.
Table 3. MTSP: comparison of the different query strategies (best values in bold)
4
https://eden.dei.uc.pt/∼paquete/tsp/.
5
http://www.math.uwaterloo.ca/tsp/concorde.
234 N. Benabbou and T. Lust
References
1. Benabbou, N., Di Sabatino Di Diodoro, S., Perny, P., Viappiani, P.: Incremental
preference elicitation in multi-attribute domains for choice and ranking with the
Borda count. In: Schockaert, S., Senellart, P. (eds.) SUM 2016. LNCS (LNAI),
vol. 9858, pp. 81–95. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
45856-4 6
2. Benabbou, N., Perny, P.: On possibly optimal tradeoffs in multicriteria spanning
tree problems. In: Walsh, T. (ed.) ADT 2015. LNCS (LNAI), vol. 9346, pp. 322–
337. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23114-3 20
3. Benabbou, N., Perny, P.: Solving multi-agent knapsack problems using incremental
approval voting. In: Proceedings of ECAI 2016, pp. 1318–1326 (2016)
4. Benabbou, N., Perny, P.: Interactive resolution of multiobjective combinatorial
optimization problems by incremental elicitation of criteria weights. EURO J.
Decis. Process. 6(3–4), 283–319 (2018)
5. Benabbou, N., Perny, P., Viappiani, P.: Incremental elicitation of Choquet capaci-
ties for multicriteria choice, ranking and sorting problems. Artif. Intell. 246, 152–
180 (2017)
6. Bourdache, N., Perny, P.: Active preference elicitation based on generalized Gini
functions: application to the multiagent knapsack problem. In: Proceedings of
AAAI 2019 (2019)
7. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
8. Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5, 31–295 (1953)
9. Drummond, J., Boutilier, C.: Preference elicitation and interview minimization in
stable matchings. In: Proceedings of AAAI 2014, pp. 645–653 (2014)
10. Dyer, M., Proll, L.: An algorithm for determining all extreme points of a convex
polytope. Math. Program. 12–81 (1977)
An Interactive Polyhedral Approach for MOCO Problems 235
11. Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries
in interactive value iteration. In: Walsh, T. (ed.) ADT 2015. LNCS (LNAI), vol.
9346, pp. 139–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-
23114-3 9
12. Grabisch, M., Labreuche, C.: A decade of application of the Choquet and Sugeno
integrals in multi-criteria decision aid. Ann. Oper. Res. 175(1), 247–286 (2010)
13. Hamacher, H., Ruhe, G.: On spanning tree problems with multiple objectives. Ann.
Oper. Res. 52, 209–230 (1994)
14. Kaddani, S., Vanderpooten, D., Vanpeperstraete, J.M., Aissi, H.: Weighted sum
model with partial preference information: application to multi-objective optimiza-
tion. Eur. J. Oper. Res. 260, 665–679 (2017)
15. Karasakal, E., Köksalan, M.: Generating a representative subset of the nondomi-
nated frontier in multiple criteria. Oper. Res. 57(1), 187–199 (2009)
16. Korhonen, P.: Interactive methods. In: Figueira, J., Greco, S., Ehrogott, M. (eds.)
Multiple Criteria Decision Analysis: State of the Art Surveys. ISOR, vol. 78, pp.
641–661. Springer, New York (2005). https://doi.org/10.1007/0-387-23081-5 16
17. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling sales-
man problem. Proc. Am. Math. Soc. 7, 48–50 (1956)
18. Laumanns, M., Thiele, L., Deb, K., Zitzler, E.: Combining convergence and diver-
sity in evolutionary multiobjective optimization. Evol. Comput. 10(3), 263–282
(2002)
19. Lust, T., Rolland, A.: Choquet optimal set in biobjective combinatorial optimiza-
tion. Comput. OR 40(10), 2260–2269 (2013)
20. Marinescu, R., Razak, A., Wilson, N.: Multi-objective constraint optimization with
tradeoffs. In: Schulte, C. (ed.) CP 2013. LNCS, vol. 8124, pp. 497–512. Springer,
Heidelberg (2013). https://doi.org/10.1007/978-3-642-40627-0 38
21. Marinescu, R., Razak, A., Wilson, N.: Multi-objective influence diagrams with
possibly optimal policies. In: Proceedings of AAAI 2017, pp. 3783–3789 (2017)
22. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst.
Tech. J. 36, 1389–1401 (1957)
23. White III, C.C., Sage, A.P., Dozono, S.: A model of multiattribute decisionmaking
and trade-off weight determination under uncertainty. IEEE Trans. Syst. Man
Cybern. 14(2), 223–229 (1984)
24. Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision
processes. In: Proceedings of IJCAI 2011, pp. 2159–2164 (2011)
25. Rubinstein, R.: Generating random vectors uniformly distributed inside and on
the surface of different regions. Eur. J. Oper. Res. 10(2), 205–209 (1982)
26. Schrijver, A.: Combinatorial Optimization - Polyhedra and Efficiency. Springer,
Heidelberg (2003)
27. Wang, T., Boutilier, C.: Incremental utility elicitation with the minimax regret
decision criterion, pp. 309–316 (2003)
28. Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes
with unknown rewards. In: Proceedings of IJCAI 2013, pp. 2415–2421 (2013)
29. Wiecek, M.M.: Advances in cone-based preference modeling for decision making
with multiple criteria. Decis. Making Manuf. Serv. 1(1–2), 153–173 (2007)
30. Wilson, N., Razak, A., Marinescu, R.: Computing possibly optimal solutions for
multi-objective constraint optimisation with tradeoffs. In: Proceedings of IJCAI
2015, pp. 815–822 (2015)
Open-Mindedness of Gradual
Argumentation Semantics
Nico Potyka(B)
1 Introduction
The basic idea of abstract argumentation is to study the acceptability of argu-
ments abstracted from their content, just based on their relationships [13]. While
arguments can only be accepted or rejected under classical semantics, gradual
argumentation semantics consider a more fine-grained scale between these two
extremes [3,6–8,10,16,20,22]. Arguments may have a base score that reflects a
degree of belief that the argument is accepted when considered independent of
all the other arguments. Semantics then assign strength values to all arguments
based on their relationships and the base score if provided.
Of course, strength values should not be assigned in an arbitrary manner, but
should satisfy some common-sense properties. Baroni, Rago and Toni recently
showed that 29 properties from the literature can be reduced to basically two
fundamental properties called Balance and Monotonicity [8] that we will discuss
later. Balance and Monotonicity already capture a great deal of what we should
expect from strength values of arguments, but they do not (and do not attempt
to) capture everything. One desiderata that may be missing in many applications
is Open-Mindedness. To illustrate the idea, suppose that we evaluate arguments
by strength values between 0 and 1, where 0 means that we fully reject and 1
means that we fully accept an argument. Then, as we increase the number of
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 236–249, 2019.
https://doi.org/10.1007/978-3-030-35514-2_18
Open-Mindedness of Gradual Argumentation Semantics 237
3 Open-Mindedness
Intuitively, it seems that Balance and Monotonicity could already imply Open-
Mindedness. After all, they demand that adding attacks (supports) increases
(decreases) the strength in a sense. However, this is not sufficient to guarantee
Open-Mindedness of Gradual Argumentation Semantics 239
that the strength can be moved arbitrarily close to the boundary values. To
illustrate this, let us consider the Euler-based semantics that has been introduced
for the whole class of QBAFs in [4]. Strength values are defined by
1 − β(a)2
σ(a) = 1 −
1 + β(a) · exp( (b,a)∈Sup σ(b) − (b,a)∈Att σ(b))
Note that if there are no attackers or supporters, the strength becomes just
1 − (1+β(a))(1−β(a))
1+β(a)·1 = β(a). If the strength of a’s attackers accumulates to a
larger (smaller) value than the strength of a’s supporters, the strength will be
smaller (larger) than the base score. The Euler-based semantics satisfies the
basic Balance and Monotonicity properties in most cases, see [4] for more details.
However, it does not satisfy Open-Mindedness as has been noted in [21] already.
There are two reasons for this. The first reason is somewhat weak and regards
2
the boundary case β(a) = 0. In this case, the strength becomes 1 − 1−0 1+0 = 0
independent of the supporters. In this boundary case, the Euler-based semantics
does not satisfy Balance and Monotonicity either. The second reason is more
profound and corresponds to the fact that the exponential function always yields
2
positive values. Therefore, 1 + β(a) · exp(x) ≥ 1 and σ(a) ≥ 1 − 1−β(a)
1 = β(a)2
independent of the attackers. Hence, the strength value can never be smaller
than the base score squared. The reason that the Euler-based semantics can still
satisfy Balance and Monotonicity is that the limit β(a)2 can never actually be
taken, but is only approximated as the number of attackers goes to infinity.
Hence, Open-Mindedness is indeed a property that is currently not captured
by Balance and Monotonicity. To begin with, we give a formal definition for a
restricted case. We assume that larger values in D are stronger to avoid tedious
case differentiations. This assumption is satisfied by the first eight semantics
that we consider. We will give a more general definition later that also makes
sense when this assumption is not satisfied. Open-Mindedness includes two dual
conditions, one for attack- and one for support-relations. Intuitively, we want
that in every QBAF, the strength of every argument with arbitrary base score
can be moved arbitrarily close to min(D) (max(D)) if we only add a sufficient
number of strong attackers (supporters). In the following definition, captures
the closeness and N the sufficiently large number.
Definition 3 (Open-Mindedness). Consider a semantics that defines an
interpretation σ : A → D for every QBAF from a particular class F of QBAFs
over a compact interval D. We call the semantics open-minded if for every QBAF
(A, Att, Sup, β) in F, for every argument a ∈ A and for every > 0, the fol-
lowing condition is satisfied: there is an N ∈ N such that when adding N new
arguments AN = {a1 , . . . , aN }, A ∩ AN = ∅, with maximum base score, then
1. if F allows attacks, then for (A ∪ AN , Att ∪ {(ai , a) | 1 ≤ i ≤ N }, Sup, β ),
we have |σ(a) − min(D)| < and
2. if F allows supports, then for (A ∪ AN , Att, Sup ∪ {(ai , a) | 1 ≤ i ≤ N }, β ),
we have |σ(a) − max(D)| < ,
where β (b) = β(b) for all b ∈ A and β (ai ) = max(D) for i = 1, . . . , n.
240 N. Potyka
Some explanations are in order. Note that we do not make any assumptions
about the base score of a in Definition 3. Hence, we demand that the strength
of a must become arbitrary small (large) within the domain D, no matter what
its base score is. One may consider a weaker notion of Open-Mindedness that
excludes the boundary base scores for a. However, this distinction does not make
a difference for our investigation and so we will not consider it here. Note also
that we do not demand that the strength of a ever takes the extreme value
max(D) (min(D)), but only that it can become arbitrarily close. Finally note
that item 1 in Definition 3 is trivially satisfied for support-only QBAFs, and item
2 for attack-only QBAFs.
The weighted max-based semantics from [6] can be seen as a variant of the
h-categorizer semantics that aggregates the strength of attackers by means of
the maximum instead of the sum. The strength of arguments is defined by
β(a)
σ(a) = . (2)
1 + max(b,a)∈Att σ(b)
Open-Mindedness of Gradual Argumentation Semantics 241
When reordering terms in the denominator, we can see that the only difference to
the h-categorizer semantics is that every attacker b with non-zero strength adds
1+σ(b) instead of just σ(b) in the sum in the denominator (attacker with strength
0 do not add anything anyway). This enforces a property called Cardinality
Precedence, which basically means that when arguments a1 and a2 have the
same base score and a1 has a larger number of non-rejected attackers (σ(b) > 0)
than a2 , then the strength of a1 must be smaller than the strength of a2 . The
strength values under the weighted card-based semantics can again be computed
iteratively [6]. Analogously to the weighted h-categorizer semantics, it can be
checked that the weighted card-based semantics satisfies Open-Mindedness.
Proposition 3. The weighted card-based semantics is open-minded.
where S(a) is an aggregate of the strength of a’s supporters. Therefore, the ques-
tion whether Open-Mindedness is satisfied boils down to the question whether
S(a) converges to 1 as we keep adding supporters.
The top-based semantics from [3] defines the strength of arguments by
The strength values can again be computed iteratively [6]. It is easy to check
that the aggregation-based semantics is open-minded. Just note that the fraction
N
in (5) has the form 1+N and therefore approaches 1 as N → ∞. Therefore, the
strength of an argument will go to 1 as we keep adding supporters under the
aggregation-based semantics.
Proposition 5. The aggregation-based semantics is open-minded.
The reward-based semantics from [3] is based on the idea of founded argu-
ments. An argument a is called founded if there exists a sequence of argu-
ments (a0 , . . . , an ) such that an = a, (ai−1 , ai ) ∈ Sup for i = 1, . . . , n and
β(a0 ) > 0. That is, a has non-zero base score or is supported by a sequence
of supporters such that the first argument in the sequence has a non-zero
base score. Intuitively, this implies that a must have non-zero strength. We let
Sup+ = {(a, b) ∈ Sup | a is founded} denote the founded supports. For every
a ∈ A, welet N (a) = |Sup+ | denote the number of founded supporters of a and
+ σ(b)
M (a) = (b,a)∈Sup
N (A) the mean strength of the founded supporters. Then the
strength of a is defined as
N (a)−1
1 M (a)
σ(a) = β(a) + (1 − β(a)) i
+ N (a) . (6)
i=1
2 2
The strength values can again be computed iteratively [6]. As we show next, the
reward-based semantics also satisfies Open-Mindedness.
Open-Mindedness of Gradual Argumentation Semantics 243
Note that this term already goes to 1 as the number of founded supporters
N (a)
M (a) (b,a)∈Sup+ σ(b)
increases. We additionally add the non-negative term 2N (a)
= N (A)·2N (a)
N (a)−1 1
which is bounded from above by 2N1(a) . Therefore, the factor i=1 2i +
M (a)
2N (a)
is always between 0 and 1 and approaches 1 as |N (A)| → ∞.
To complete the proof, consider any support-only QBAF (A, ∅, Sup, β), any
argument a ∈ A, any > 0 and let (A∪{a1 , . . . , aN }, Att, Sup∪{(ai , a) | 1 ≤ i ≤
N }, β ) be the QBAF defined in Definition 3 for some N ∈ N. Note that every
argument in {a1 , . . . , aN } is a founded supporter of a. Therefore, N (A) ≥ N and
σ(a) → β(a) + (1 − β(a)) = 1 as N → ∞. This then implies that there exists an
N0 ∈ N such that |σ(a) − 1| < .
1
−max σ(b)
supporters, so that σ(b ) = 0, σ(b ) = 0−(−1)
2 = 12 and σ(a) ≥ 2 (b,a)∈Att
2 .
Since the maximum of the attackers can never become larger than 1, we have
1
−1
σ(a) ≥ 2 2 ≥ − 14 , no matter how many attackers we add. Thus, the first condi-
tion of Open-Mindedness is violated. Using a symmetrical example, we can show
that the second condition can be violated as well.
Proposition 7. The loc-max semantics is not open-minded. It is open-minded
when restricting to attack-only QBAFs without base score or to support-only
QBAFs without base score.
Following [8], we call the second semantics from [7], the loc-sum semantics.
It defines strength values by the formula
1 1
σ(a) = σ(b)+1
− σ(b)+1
(8)
1+ (b,a)∈Att 2 1+ (b,a)∈Sup 2
Note that if there are neither attackers nor supporters, then both fractions are
1 such that their difference is just 0. As we keep adding attackers (supporters),
the first (second) fraction goes to 0. It follows again that the loc-sum semantics
is open-minded for attack-only QBAFs without base score and for support-only
QBAFs without base score. However, it is again not open-minded for bipolar
QBAFs without base score. For example, if a has a single supporter b that has
neither attackers nor supporters, then σ(b ) = 0 and the second fraction evaluates
to 1+1 1 = 23 . As we keep adding attackers, the first fraction will to 0 so that the
2
strength of a will converge to − 23 rather than to −1 as the first condition of
Open-Mindedness demands. It is again easy to construct a symmetrical example
to show that the second condition of Open-Mindedness can be violated as well.
Proposition 8. The loc-sum semantics is not open-minded. It is open-minded
when restricting to attack-only QBAFs without base score or to support-only
QBAFs without base score.
We now consider the general form of QBAFs as introduced in [8]. The domain
D = (S, ) is now an arbitrary set along with a preorder , that is, a reflexive
and transitive relation over S. We further assume that there is an infimum inf(S)
and a supremum sup(S) that may or may not be contained in S. For example,
the open interval (0, ∞), contains neither its infimum 0 nor its supremum ∞,
whereas the half-open interval [0, ∞) contains its infimum, but not its supremum.
α is called the burden-parameter and can be used to modify the semantics, see [5]
for more details about the influence of α. For α ∈ [1, ∞) ∪ {∞}, (9) is equivalent
to arranging the reciprocals of strength values of all attackers in a vector v and
p1
p
to take the p-norm vp = i vi of this vector with respect to p = α
246 N. Potyka
and adding 1. Popular examples of p-norms are the Manhattan-, Euclidean- and
Maximum-norm that are obtained for p = 1, p = 2 and the limit-case p = ∞,
respectively. An unattacked argument has just strength 1 under the α-burden-
1
semantics. Hence, when adding N new attackers to a, we have σ(a) ≥ 1+N α for
α ∈ [1, ∞). Hence, the α-burden-semantics is clearly open-minded in this case,
even though it becomes more conservative as α increases. In particular, for the
limit case α = ∞, it is not open-minded. This can be seen from the observation,
that the second term in (9) now corresponds to the maximum norm. Since the
strength of each attacker is in [1, ∞), their reciprocals are in (0, 1]. Therefore,
σ(a) ≤ 2 independent of the number of attackers of a.
Proposition 9. The α-burden-semantics is open-minded for α ∈ [1, ∞), but is
not open-minded for α = ∞.
5 Related Work
Gradual argumentation has become a very active research area and found appli-
cations in areas like information retrieval [24], decision support [9,22] and social
media analysis [1,12,16]. Our selection of semantics followed the selection in [8].
One difference is that we did not consider social abstract argumentation [16]
here. The reason is that social abstract argumentation has been formulated in a
very abstract form, which makes it difficult to formulate interesting conditions
under which Open-Mindedness is guaranteed. Instead, we added the α-burden-
semantics from [5] because it gives a nice example for a more general semantics
that neither uses strength values from a compact interval nor regards larger
values as stronger.
The authors in [8] also view ranking-based semantics [11] as gradual argu-
mentation frameworks. In their most general form, ranking-based semantics just
order arguments qualitatively, so that our notion of Open-Mindedness is not very
meaningful. A variant may be interesting, however, that demands, that in every
argumentation graph, every argument can become first or last in the order if
only a sufficient number of supporters or attackers is added to this argument.
However, in many cases, this notion of Open-Mindedness may be entailed by
other properties already. For example, Cardinality Precedence [11] states that if
argument a1 has more attackers than a2 , then a1 must be weaker than a2 . In
finite argumentation graphs, this already implies that a1 will be last in the order
if we add a sufficient number of attackers.
There are other quantitative argumentation frameworks like probabilis-
tic argumentation frameworks [14,15,17,19,23]. In this area, Open-Mindedness
would simply state that the probability of an argument must go to 0 (1) as we
keep adding attackers (supporters). It may be interesting to perform a similar
analysis for probabilistic argumentation frameworks.
An operational definition of Open-Mindedness for the class of modular seman-
tics [18] for weighted bipolar argumentation frameworks has been given in [21].
The Df-QuAD semantics [22] and the Quadratic-energy Semantics [20] satisfy
this notion of open-mindedness [21]. However, in case of DF-QuAD and some
Open-Mindedness of Gradual Argumentation Semantics 247
other semantics, this is actually counterintuitive because they cannot move the
strength of an argument towards 0 if there is a supporter with non-zero strength.
Indeed, DF-QuAD does not satisfy Open-Mindedness as defined here (every
QBAF with a non-zero strength supporter provides a counterexample). How-
ever, the quadratic energy model from [21] still satisfies the more restrictive
definition of Open-Mindedness that we considered here.
Another interesting property for bipolar QBAFs that is not captured by
Balance and Monotonicity is Duality [20]. Duality basically states that attack
and support should behave in a symmetrical manner. Roughly speaking, when
we convert an attack relation into a support relation or vice versa, the effect
of the relation should just be inverted. Duality is satisfied by the Df-QuAD
semantics [22] and the Quadratic-energy Semantics [20], but not by the Euler-
based semantics [4]. A formal analysis can be found in [20,21].
6 Conclusions
moment. Some well-defined semantics for general QBAFs have been presented
recently in [18], but they are not open-minded. I am indeed unaware of any
semantics for general QBAFs that is generally well-defined and open-minded.
It is actually possible to define for every k ∈ N, an open-minded semantics
that is well-defined for all QBAFs where arguments have at most k parents.
One example is the 1-max(k) semantics, see Corollary 3.5 in [21]. However,
as k grows, these semantics become more and more conservative even though
they remain open-minded. More precisely, every single argument can change the
strength value of another argument by at most k1 , so that at least k arguments
are required to move the strength all the way from 0 to 1 and vice versa. A
better way to improve convergence guarantees may be to define strength values
not by discrete iterative procedures, but to replace them with continuous pro-
cedures that maintain the strength values in the limit, but improve convergence
guarantees [20,21]. However, while I find this approach promising, I admit that
it requires further analysis.
In conclusion, I think that Open-Mindedness is an interesting property that
is important for many applications. It is indeed satisfied by many semantics from
the literature. For others, like the weighted max-based semantics, we may be able
to adapt the definition. One interesting open question is whether we can define
semantics for general QBAFs that are generally well-defined and open-minded.
References
1. Alsinet, T., Argelich, J., Béjar, R., Fernández, C., Mateu, C., Planes, J.: Weighted
argumentation for analysis of discussions in Twitter. Int. J. Approximate Reason-
ing 85, 21–35 (2017)
2. Amgoud, L., Ben-Naim, J.: Axiomatic foundations of acceptability semantics. In:
International Conference on Principles of Knowledge Representation and Reason-
ing (KR), pp. 2–11 (2016)
3. Amgoud, L., Ben-Naim, J.: Evaluation of arguments from support relations:
axioms and semantics. In: International Joint Conferences on Artificial Intelligence
(IJCAI), p. 900 (2016)
4. Amgoud, L., Ben-Naim, J.: Evaluation of arguments in weighted bipolar graphs.
In: Antonucci, A., Cholvy, L., Papini, O. (eds.) ECSQARU 2017. LNCS (LNAI),
vol. 10369, pp. 25–35. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
61581-3 3
5. Amgoud, L., Ben-Naim, J., Doder, D., Vesic, S.: Ranking arguments with
compensation-based semantics. In: International Conference on Principles of
Knowledge Representation and Reasoning (KR) (2016)
6. Amgoud, L., Ben-Naim, J., Doder, D., Vesic, S.: Acceptability semantics for
weighted argumentation frameworks. In: IJCAI, vol. 2017, pp. 56–62 (2017)
7. Amgoud, L., Cayrol, C., Lagasquie-Schiex, M.C., Livet, P.: On bipolarity in argu-
mentation frameworks. Int. J. Intell. Syst. 23(10), 1062–1093 (2008)
8. Baroni, P., Rago, A., Toni, F.: How many properties do we need for gradual argu-
mentation? In: AAAI Conference on Artificial Intelligence (AAAI), pp. 1736–1743.
AAAI (2018)
Open-Mindedness of Gradual Argumentation Semantics 249
9. Baroni, P., Romano, M., Toni, F., Aurisicchio, M., Bertanza, G.: An
argumentation-based approach for automatic evaluation of design debates. In:
Leite, J., Son, T.C., Torroni, P., van der Torre, L., Woltran, S. (eds.) CLIMA
2013. LNCS (LNAI), vol. 8143, pp. 340–356. Springer, Heidelberg (2013). https://
doi.org/10.1007/978-3-642-40624-9 21
10. Besnard, P., Hunter, A.: A logic-based theory of deductive arguments. Artif. Intell.
128(1–2), 203–235 (2001)
11. Bonzon, E., Delobelle, J., Konieczny, S., Maudet, N.: A comparative study of
ranking-based semantics for abstract argumentation. In: AAAI Conference on Arti-
ficial Intelligence (AAAI), pp. 914–920 (2016)
12. Cocarascu, O., Rago, A., Toni, F.: Extracting dialogical explanations for review
aggregations with argumentative dialogical agents. In: International Conference on
Autonomous Agents and MultiAgent Systems (AAMAS), pp. 1261–1269. Interna-
tional Foundation for Autonomous Agents and Multiagent Systems (2019)
13. Dung, P.M.: On the acceptability of arguments and its fundamental role in non-
monotonic reasoning, logic programming and n-person games. Artif. Intell. 77(2),
321–357 (1995)
14. Hunter, A., Polberg, S., Potyka, N.: Updating belief in arguments in epistemic
graphs. In: International Conference on Principles of Knowledge Representation
and Reasoning (KR), pp. 138–147 (2018)
15. Hunter, A., Thimm, M.: Probabilistic reasoning with abstract argumentation
frameworks. J. Artif. Intell. Res. 59, 565–611 (2017)
16. Leite, J., Martins, J.: Social abstract argumentation. In: International Joint Con-
ferences on Artificial Intelligence (IJCAI), vol. 11, pp. 2287–2292 (2011)
17. Li, H., Oren, N., Norman, T.J.: Probabilistic argumentation frameworks. In: Mod-
gil, S., Oren, N., Toni, F. (eds.) TAFA 2011. LNCS (LNAI), vol. 7132, pp. 1–16.
Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29184-5 1
18. Mossakowski, T., Neuhaus, F.: Modular semantics and characteristics for bipolar
weighted argumentation graphs. arXiv preprint arXiv:1807.06685 (2018)
19. Polberg, S., Doder, D.: Probabilistic abstract dialectical frameworks. In: Fermé,
E., Leite, J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 591–599. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-11558-0 42
20. Potyka, N.: Continuous dynamical systems for weighted bipolar argumentation. In:
International Conference on Principles of Knowledge Representation and Reason-
ing (KR), pp. 148–157 (2018)
21. Potyka, N.: Extending modular semantics for bipolar weighted argumentation.
In: International Conference on Autonomous Agents and MultiAgent Systems
(AAMAS), pp. 1722–1730. International Foundation for Autonomous Agents and
Multiagent Systems (2019)
22. Rago, A., Toni, F., Aurisicchio, M., Baroni, P.: Discontinuity-free decision support
with quantitative argumentation debates. In: International Conference on Princi-
ples of Knowledge Representation and Reasoning (KR), pp. 63–73 (2016)
23. Rienstra, T., Thimm, M., Liao, B., van der Torre, L.: Probabilistic abstract argu-
mentation based on SCC decomposability. In: International Conference on Princi-
ples of Knowledge Representation and Reasoning (KR), pp. 168–177 (2018)
24. Thiel, M., Ludwig, P., Mossakowski, T., Neuhaus, F., Nürnberger, A.: Web-
retrieval supported argument space exploration. In: ACM SIGIR Conference on
Human Information Interaction and Retrieval (CHIIR), pp. 309–312. ACM (2017)
Approximate Querying on Property
Graphs
1 Introduction
A tremendous amount of information stored in the LOD can be inspected, by
leveraging the already mature query capabilities of SPARQL, relational, and
graph databases [14]. However, arbitrarily complex queries [2,3,7], entailing
rather intricate, possibly recursive, graph patterns prove difficult to evaluate,
even on small-sized graph datasets [4,5]. On the other hand, the usage of these
queries has radically increased in real-world query logs, as shown by recent empir-
ical studies on SPARQL queries from large-scale Wikidata and DBPedia corpuses
[8,17]. As a tangible example of this growth, the percentage of SPARQL prop-
erty paths has increased from 15% to 40%, from 2017 to beginning 2018 [17], for
user-specified Wikidata queries. In this paper, we focus on regular path queries
(RPQs) that identify paths labeled with regular expressions and aim to offer
an approximate query evaluation solution. In particular, we consider counting
queries with regular paths, which are a notable fragment of graph analytical
queries. The exact evaluation of counting queries on graphs is #P −complete
[21] and is based on another result on enumeration of simple graph paths.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 250–265, 2019.
https://doi.org/10.1007/978-3-030-35514-2_19
Approximate Querying on Property Graphs 251
In Sect. 2, we revisit the property graph model and query language. We present
related work in Sect. 6 and conclude the paper in Sect. 7.
l4
l4
l4
P1 P2 P3 R1 R2 R3 R4
l4
l0 l0 l0 l5 l5 l5 l5
l0 P4 l1 M1 M2 M3
l0
l2
l1 l0 l3 l3 l3
l0
P5 P7 P6 F1
l0
l1 l0 l4 l4 Q1 (l5 ) Ans(count( )) ← l5 ( , )
l0
Q2 (l2 ) Ans(count( )) ← l2 ?( , )
P8 P9 l6
l4
Q3 (l0 ) Ans(count( )) ← l0+ ( , )
Q4 (l0 ) Ans(count( )) ← l0∗ ( , )
l0 R5 R6 R7
Q5 (l4 , l1 ) Ans(count( )) ← l4 + l1 ( , )
P10 l5 l5 l5 Q6 (l4 , l5 ) Ans(count( )) ← l4− · l5 ( , )
l2
Q7 (l4 , l5 ) Ans(count(x)) ← l4− · l5 (x, ), ≥ (x.age, 18),
M4 M5 M6
≤ (x.age, 24).
l3
l3 l3 Forum Message Reply Person
knows (l0 ) follows (l1 ) moderates (l2 ) contains (l3 )
F2 authors (l4 ) replies (l5 ) reshares (l6 )
2 Preliminaries
Graph Model. We take the property graph model (PGM) [7] as our founda-
tion. Graph instances are multi-edge digraphs; its objects are represented by
typed, data vertices and their relationships, by typed, labeled edges. Vertices
and edges can have any number of properties (key/value pairs). Let LV and LE
be disjoint sets of vertex (edge) labels and G = (V, E), with E ⊆ V × LE × V , a
graph instance. Vertices v ∈ V have an id label, lv , and a set of property labels
(attributes, li ), each with a (potentially undefined) term value. For e ∈ E, we
use the binary notation e = le (v1 , v2 ) and abbreviate v1 , as e.1, and v2 , as e.2.
We denote the number of occurrences of le , as #le , and the set of all edge labels
in G, as Λ(G). Other key notations henceforth used are given in Table 1.
Clauses C ::= A ← A1 , . . . , An | Q ← A1 , . . . , An
Queries Q ::= Ans(count( )) | Ans(count(lv )) | Ans(count(lv1 , lv2 ))
Atoms A ::= π(lv1 , lv2 ) | op(lv1 .li , lv2 .lj ) | op(lv1 .li , k), op ∈ {<, ≤, >, ≥}, k ∈ R
Paths π ::= | le | le ? | le−1 | le∗ | le1 · le2 | π + π
Graph Query Language. To query the above property graph model, we rely
on an RPQ [10,11] fragment with aggregate operators (see Fig. 3). RPQs cor-
respond to SPARQL 1.1 property paths and are a well-studied query class
tailored to express graph patterns of one or more label-constrained reachabil-
ity paths. For labels lei and vertices vi , the labeled path π, corresponding to
v1 − → le1 v2 . . . vk−1 −
→ lek vk , is the concatenation le1 · . . . · lek . In their full gen-
erality, RPQs allow one to select vertices connected via such labeled paths in
a regular language over LE . We restrict RPQs to handle atomic paths – bi-
directional, optional, single-labeled (le , le ?, and le− ) and transitive single-labeled
(le∗ ) – and composite paths – conjunctive and disjunctive composition of atomic
paths (le · le and π + π). While not as general as SPARQL, our fragment already
captures more than 60% of the property paths found in practice in SPARQL
query logs [8]. Moreover, it captures property path queries, as found in the large
Wikidata corpus studied in [9]. Indeed, almost all the property paths in the con-
sidered logs contain Kleene-star expressions over single labels. In our work, we
enrich the above query classes with the count operator and support basic graph
reachability estimates.
3 Graph Summarization
We introduce a novel algorithm that summarizes any property graph into one
tailored for approximately counting reachability queries. The key idea is that,
as nodes and edges are compressed, informative properties are iteratively added
to the corresponding newly formed structures, to enable accurate estimations.
The grouping phase (Sect. 3.1) computes Φ, a label-driven G-partitioning
into subgroupings, following the connectivity on the most frequent labels in G. A
first summarization collapses the vertices and inner-edges of each subgrouping
into s-nodes and the edges connecting s-nodes, into s-edges. The merge phase
(Sect. 3.2), based on further label-reachability conditions, specified by a heuristic
mode m, collapses s-nodes into h-nodes and s-edges into h-edges.
Algorithm 1. GROUPING(G)
Input: G – a graph; Output: Φ – a graph partitioning
−
−−→ −
−−→
1: n ← |Λ(G)|, Λ(G) ← [l1 , . . . , ln ], Φ ← ∅, i ← 1 Descending frequency label list Λ(G)
−
−−→
2: for all li ∈ Λ(G) do Label-driven partitioning computation
3: Φ ← Φ ∪ {Gk∗ = (Vk∗ , Ek∗ ) ⊆ G | λ(Gk∗ ) = li } Maximally li -Connected Subgraphs
4: V ← V \ {v ∈ Vk∗ | k ∈ N} Discard Already Considered Nodes
5: i ← i + 1
6: Φ ← Φ ∪ {Gi = (Vi∗ , Ei∗ ) ⊆ G | Vi∗ = V \ V ∗ } Collect Remains in Final Subgroup
7: return Φ
l4
l4 l4
v1∗ v2∗ ∗
l 4 v3 v4∗
l4 l3
l4 l4 l3 l3
l4 l4
P1 P2 P3 R1 R2 R3 R4 l6
l4 l2 v5∗ v6∗ v7∗
l0 l0 l0 l5 l5 l5 l5
l2 l3 l3 l3
l0 P4 l1 M1 M2 M3
l0 l1 l0 l4 l4 v8∗ v9∗
l4
P5 P7 P6 R5 R6 R7
(b) Evaluation Phase
l0 l0 l3
l1 l0 l6 l5 l5 l5
l0 l2
P8 P9 l3 M4 M5 M6
l4
l3 v̂1 v̂2
l0 l3 l3 l3 G2
P10
l2
F1 F2
l4
l6 l6
l2 v̂3 l3 v̂1 v̂2
G1 G3
l3 l4
Grouping Subgrouping l2 l3
v̂4 v̂3
(a) Grouping Phase
(c) Source and Target Merge
its number of compressed edges, e.g., in Fig. 4b, all s-edges have weight 1, except
e∗ (v4∗ , v1∗ ), with weight 2. To every s-node, v ∗ , we attach properties concerning:
(1) Compression. VWeight and EWeight store its number of inner vertices/edges.
(2) Inner-Connectivity. The percentage of its l-labeled inner edges is LPercent
and the number of its vertex pairs, connected with an l-labeled edge, is LReach.
These first two types of properties will be useful in Sect. 4, for estimating Kleene
paths, as the labels of inner-edges in s-nodes are not unique, e.g., both l0 and
l1 appear in v1∗ . (3) Outer-Connectivity. For pairs of labels and direction indices
with respect to v ∗ (d = 1, for incoming edges, and d = 2, for outgoing ones), we
compute cross-connectivity, CReach, as the number of binary cross-edge paths
that start/end in v ∗ . Analogously, we record that of binary traversal paths, i.e.,
formed of an inner v ∗ edge and of a cross-edge, as T Reach. Also, for a label l
and given direction, we store, as VF , the number of frontier vertices on l, i.e.,
that of v ∗ nodes at either endpoint of a l-labeled s-edge.
We can thus record traversal connectivity information, LP art, dividing the
number of traversal paths by that of the frontier vertices on the cross-edge label.
Intuitively, this is due to the fact that, traversal connectivity, as opposed to cross
connectivity, also needs to account for the “dispersion” of the inner-edge label
of the path, within the s-node it belongs to. For example, for a traversal path
lc · li , formed of a cross-edge, lc , and an inner one, li , not all frontier nodes lc
are endpoints of li labeled inner-edges, as we will see in the example below.
VW eight v1∗ ∗
10, v{2,3,5,6,7} 2,
∗ ∗
v4 3, v{8,9} 1
EW eight v1∗ ∗
14, v{2,3,5,6,7} 1,
∗ ∗
v4 3, v{8,9} 0
LReach (v1∗ , l0 ) 11, (v1∗ , l1 ) 3
∗
LP ercent (v1 , l0 ) 79, (v1∗ , l1 ) 21
We take as input the graph computed by Algorithm 1, and a label set and out-
put a compressed graph, Ĝ = (V̂ , Ê). During this phase, sets of h-nodes, V̂ ,
and h-edges, Ê, are created. At each step, as previously, Ĝ is enriched with
approximation-relevant precomputed properties (see Sect. 4).
Each h-node, v̂, merges all s-nodes, vi∗ , vj∗ ∈ V ∗ , that are maximally label
connected on the same label, i.e., λ(vi∗ ) = λ(vj∗ ), and that have either the
same set of incoming (source-merge) or outgoing (target-merge) edge labels, i.e.,
Λd (vi∗ ) = Λd (vj∗ ), d ∈ {1, 2} (see Algorithm 2). Each h-edge, ê, merges all s-edges
in E ∗ with the same label and orientation, i.e., e∗i .d = e∗j .d, for d ∈ {1, 2}.
Algorithm 2. MERGE(V ∗ , Λ, m)
Input: V ∗ – s-nodes; Λ – labels; m – heuristic mode; Output: V̂ – h-nodes
1: for all v ∗ ∈ V ∗ do
2: Λd (v ∗ ) ← {l ∈ Λ | ∃e∗ = l( , ) ∈ E ∗ ∧ e.d = v ∗ } Labels Incoming/Outgoing v∗
∗ ∗ ∗
3: for all v1 , v2 ∈ V do Pair-wise S-node Inspection
? ?
4: bλ ← λ(v1∗ ) = λ(v2∗ ), bd ← Λd (v1∗ ) = Λd (v2∗ ), d ∈ {1, 2} Boolean Conditions
5: if m = true then v̂ ← {v1∗ , v2∗ | bλ ∧ b1 = true} Target-Merge
6: else v̂ ← {v1∗ , v2∗ | bλ ∧ b2 = true} Source-Merge
7: V̂ ← {v̂k | k ∈ [1, |V ∗ |]} H-node Computation
8: return V̂
Approximate Querying on Property Graphs 257
A valid summary is thus obtained from G, by collapsing vertices with the same
χΛ into h-nodes and edges with the same (depending on the heuristic, ingo-
ing/outgoing) label into h-edges. We illustrate this below.
We study our summarization’s optimality, i.e., the size of the obtained com-
pressed graph, to graphs its tractability. Specifically, we investigate the follow-
ing MinSummary problem, to establish whether one can always minimize the
number of nodes of an input graph, when constructing its valid summary.
Problem 1 (Minimal Summary). Let MinSummary be the problem that, for a
graph G and an integer k ≥ 2, decides if there exists a label-driven partitioning
Φ of G, |Φ| ≤ k , such that χΛ is a valid summarization.
Each MinSummary h-node is thus intended to regroup as many nodes from
the original graph as possible, while ensuring these are connected by frequently
occurring labels. This condition (see Definition 2) reflects the central idea of our
framework, namely that the connectivity of such prominent labels can serve to
both compress a graph and to approximately evaluate label-constrained reacha-
bility queries. Next, we establish the difficulty of solving MinSummary.
4
Proof given at: http://web4.ensiie.fr/∼stefania.dumbrava/SUM19 appx.pdf.
258 S. Dumbrava et al.
5 Experimental Analysis
Fig. 7. Datasets: no. of vertices |V |, edges |E|, vertex |LV | and edge labels |LE |.
the size parameter and schema constraints, the resulting sizes vary (especially for
the very dense graphs social and shop). Next, on the same datasets, we generated
workloads of varying sizes, for each type in Sect. 2. These datasets and related
query workloads have been chosen since they provide the most recent benchmarks
for recursive graph queries and also to ensure a comparison with SumRDF [19]
(as shown next) on a subset of those supported by the latter. Studies [8,17] have
shown that practical graph pattern queries formulated by users in online query
endpoints are often small: 56.5% of real-life SPARQL queries consist of a single
edge (RDF triple), whereas 90.8% use 6 edges at most. Hence, we select small-
sized template queries with frequently occurring topologies, such as chains [8],
and formulate them on our datasets, for workloads of ∼600 queries.
Experiments ran on a cloud VM with Intel Xeon E312xx, 4 cores, 1.80 GHz
CPU, 128 GB RAM, and Ubuntu 16.04.4 64-bit. Each data point corresponds to
repeating an experiment 6 times, removing the first value from the average.
Summary Compression Ratios. First, we evaluate the effect that using the
source-merge and target-merge heuristics has on the summary construction time
(SCT). We also assess the compression ratio (CR) on the original graph’s vertices
ˆ
and edges, by measuring (1 − |V̂|/|V|) ∗ 100 and, respectively, (1 − |E|/|E|) ∗ 100.
Next, we compare the results for source and target merge. In Fig. 8(a-d), the
most homogeneous datasets, bib and uniprot, achieve very high CR (close to
100%) and steadily maintain it with varying graph sizes. As far as heterogeneity
significantly grows for shop and social, the CR becomes eagerly sensitive to
the dataset size, starting with low values, for smaller graphs, and stabilizing
between 85% and 90%, for larger ones. Notice also that the most heterogeneous
datasets, shop and social, although similar, display a symmetric behavior for the
vertex and edge CRs: the former better compresses vertices, while the latter,
edges. Concerning the SCT runtime in Fig. 8(e-f), all datasets keep a reasonable
performance for larger sizes, even the most heterogeneous one shop. The runtime
is, in fact, not affected by heterogeneity, but is rather sensitive, for larger sizes,
to |E| variations (up to 450K and 773K, for uniprot and social ). Also, while
the source and target merge SCT runtimes are similar, the latter achieves better
CRs for social. Overall, the dataset with the worst CR for the two heuristics is
shop, with the lowest CR for smaller sizes. This is also due to the high number of
labels in the initial shop instances, and, hence, to the high number of properties
its summary needs: on average, for all considered sizes, 62.33 properties, against
17.67, for social graph, 10.0, for bib, and 14.0, for uniprot. These experiments
Approximate Querying on Property Graphs 261
Datasets:
Shop Social Bib Uniprot
(a) CR Edges (s−m) (b) CR Edges (t−m) (c) CR Vertices (s−m) (d) CR Vertices (t−m)
92 94 96 98 100
95 100
90 100
60 70 80 90 100
CR Vertices (%)
CR Edges (%)
80
90
70
85
60
1K
5K
K
0K
0K
1K
5K
K
0K
0K
1K
5K
K
0K
0K
1K
5K
K
0K
0K
25
50
25
50
25
50
25
50
10
20
10
20
10
20
10
20
(e) SCT (s−m) [sec] (f) SCT (t−m) [sec]
6000
6000
0 2000
0 2000
1K
5K
K
0K
0K
1K
5K
K
0K
0K
25
50
25
50
10
20
10
20
Graph sizes (# nodes)
Fig. 8. CRs for vertices and edges, along with SCT runtime for various dataset sizes,
for both source-merge (a-c-e), and target-merge (b-d-f).
show that, despite its high complexity, our summarization provides high CRs
and low SCT runtimes, even for large, heterogeneous graphs.
Approximate Evaluation Accuracy. We assess the accuracy and efficiency
of our engine with the relative error and time gain measures, respectively. The
relative error (per query Qi ) is 1 − min(Qi (G), QTi (Ĝ))/ max(Qi (G), QTi (Ĝ)) (in
%), where Qi (G) computes (with PGX) the counting query Qi , on the original
graph, and QTi (Ĝ) computes (with our engine) the translated query QTi , on the
summary. The time gain is: tG − tĜ /max(tG , tĜ ) (in %), where tG and tĜ are the
query evaluation times of Qi on the original graph and on the summary.
For the Disjunction, Kleene-plus, Kleene-star, Optional and Single Label
query types, we have generated workloads of different sizes, bound by the num-
ber of labels in each dataset. For the concatenation workloads, we considered
binary conjunctive queries (CQs) without disjunction, recursion, or optionality.
Note that, currently, our summaries do not support compositionality.
Figure 9(a) and (b) show the relative error and average time gain for the
Disjunction, Kleene-plus, Kleene-star, Optional and Single Label workloads. In
Fig. 9(a), we note that the avg. relative error is kept low in all cases and is bound
by 5.5%, for the Kleene-plus and Kleene-star workloads of the social dataset.
In all the other cases, including the Kleene-plus and Kleene-star workloads of
the shop dataset, the error is relatively small (near 0%). This confirms the effec-
tiveness of our graph summaries for approximate evaluation of graph queries. In
Fig. 9(b), we studied the efficiency of approximate evaluation on our summaries
by reporting the time gain (in %) compared with the query evaluation on the
original graphs for the four datasets. We notice a positive time gain (≥75%)
in most cases, but for disjunction. While the relative approximation error is
still advantageous for disjunction, disjunctive queries are time-consuming for
262 S. Dumbrava et al.
Fig. 9. Rel. Error (a), Time Gain (b) per Workload, per Dataset, 200K nodes.
Fig. 10. Performance Comparison: SumRDF vs. APP (our approach): approx. eval. of
binary CQs, SELECT COUNT(*) MATCH Qi , on the summaries of a shop graph instance
(31K nodes, 56K edges); comparing estimated cardinality (no. of computed answers),
rel. error w.r.t the original graph results, and query runtime.
SumRDF (see Fig. 10), we recorded an average relative error of estimation of only
0.15%. vs. 2.5% and an average query runtime of only 27.55 ms vs. 427.53
ms. As SumRDF does not support disjunctions, Kleene-star/plus queries and
optional queries, further comparisons were not possible.
6 Related Work
7 Conclusion
Our paper focuses on a novel graph summarization method that is suitable for
property graph querying. As the underlying MinSummary decision problem is
NP-complete, this technique builds on an heuristic that compresses label fre-
quency information in the nodes of the graph summary. We show the practical
effectiveness of our approach, in terms of compression ratios, error rates and
query evaluation time. As future work, we plan to investigate the feasibility of
our graph summary for other query classes, such as those described in [22]. Also,
we aim to apply formal methods, as described in [6], to ascertain the correctness
of our approximation algorithm, with provably tight error bounds.
264 S. Dumbrava et al.
References
1. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF
data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796,
pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-
9 13
2. Angles, R., et al.: G-CORE: a core for future graph query languages. In: SIGMOD,
pp. 1421–1432 (2018)
3. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Founda-
tions of modern query languages for graph databases. ACM Comput. Surv. 50(5),
68:1–68:40 (2017)
4. Arenas, M., Conca, S., Pérez, J.: Counting beyond a Yottabyte, or how SPARQL
1.1 property paths will prevent adoption of the standard. In: WWW, pp. 629–638
(2012)
5. Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H.L., Lemay, A., Advokaat, N.:
gMark: schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data
Eng. 29(4), 856–869 (2017)
6. Bonifati, A., Dumbrava, S., Arias, E.J.G.: Certified graph view maintenance with
regular datalog. TPLP 18(3–4), 372–389 (2018)
7. Bonifati, A., Fletcher, G., Voigt, H., Yakovets, N.: Querying Graphs. Synthesis
Lectures on Data Management. Morgan & Claypool Publishers (2018)
8. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query
logs. PVLDB 11(2), 149–161 (2017)
9. Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs.
In: WWW, pp. 127–138 (2019)
10. Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Rewriting of regular
expressions and regular path queries. J. Comput. Syst. Sci. 64(3), 443–465 (2002)
11. Cruz, I.F., Mendelzon, A.O., Wood, P.T.: A graphical query language supporting
recursion. In: SIGMOD, pp. 323–330 (1987)
12. Erling, O., et al.: The LDBC social network benchmark: interactive workload. In:
SIGMOD, pp. 619–630 (2015)
13. Fan, W., Li, J., Wang, X., Wu, Y.: Query preserving graph compression. In: SIG-
MOD, pp. 157–168 (2012)
14. Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying Wikidata:
comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.)
ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). https://doi.org/
10.1007/978-3-319-46547-0 10
15. Iyer, A.P., et al.: Bridging the GAP: towards approximate graph analytics. In:
GRADES, pp. 10:1–10:5 (2018)
16. Khan, A., Bhowmick, S.S., Bonchi, F.: Summarizing static and dynamic big graphs.
PVLDB 10(12), 1981–1984 (2017)
17. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the
most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph.
In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer,
Cham (2018). https://doi.org/10.1007/978-3-030-00668-6 23
18. Rudolf, M., Voigt, H., Bornhövd, C., Lehner, W.: SynopSys: foundations for multi-
dimensional graph analytics. In: Castellanos, M., Dayal, U., Pedersen, T.B., Tatbul,
N. (eds.) BIRTE 2013-2014. LNBIP, vol. 206, pp. 159–166. Springer, Heidelberg
(2015). https://doi.org/10.1007/978-3-662-46839-5 11
Approximate Querying on Property Graphs 265
19. Stefanoni, G., Motik, B., Kostylev, E.V.: Estimating the cardinality of conjunctive
queries over RDF data using graph summarisation. In: WWW, pp. 1043–1052
(2018)
20. Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summariza-
tion. In: SIGMOD, pp. 567–580. ACM (2008)
21. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J.
Comput. 8(3), 410–421 (1979)
22. Valstar, L.D.J., Fletcher, G.H.L., Yoshida, Y.: Landmark indexing for evaluation
of label-constrained reachability queries. In: SIGMOD, pp. 345–358 (2017)
23. Wood, P.T.: Query languages for graph databases. SIGMOD Rec. 41(1), 50–60
(2012)
24. Wu, Y., Yang, S., Srivatsa, M., Iyengar, A., Yan, X.: Summarizing answer graphs
induced by keyword queries. PVLDB 6(14), 1774–1785 (2013)
Learning from Imprecise Data:
Adjustments of Optimistic and
Pessimistic Variants
1 Introduction
Superset learning is a specific type of learning from weak supervision, in which
the outcome (response) associated with a training instance is only characterized
in terms of a set of possible candidates. There are numerous applications in which
supervision is partial in that sense [9]. Correspondingly, the superset learning
problem has received increasing attention in recent years, and has been studied
under various names, such as learning from ambiguously labelled examples or
learning from partial labels [2,10]. The contributions so far also differ with regard
to their assumptions on the incomplete information being provided, and how it
has been produced. In this paper, we only assume the actual outcome to be
covered by the subset—hence the name superset learning.
In spite of the ambiguous, set-valued training data, the goal that is commonly
considered in superset learning is to induce a unique model, or a set of models
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 266–279, 2019.
https://doi.org/10.1007/978-3-030-35514-2_20
Optimistic and Pessimistic Learning from Imprecise Data 267
that are all deemed optimal (in the sense of fitting the observed data equally
well) and not differentiated any further. This differs from approaches that allow
for a set of incomparable, undominated models, resulting for instance from the
interval order induced by set-valued loss functions [3], or by the application of
conservative, imprecise Bayesian updating rules [11].
In this paper, we reconsider the principle of generalized loss minimization
based on the so-called optimistic superset loss (OSL) as introduced in [7]. To
better understand its nature and possible deficiencies, we contrast the latter
with another, in a sense diametral approach based on a “pessimistic” inference
principle. Moreover, to compensate for a bias that might be caused by an overly
optimistic attitude, we propose an adjustment of the OSL, which can be seen as
a counterpart of a corresponding modification of the pessimistic approach [6].
Presenting the various methods within a common framework of loss minimization
in supervised learning allows us to highlight some important properties and
differences through illustrative examples.
2 Preliminaries
2.1 Setting and Notation
The OSL was introduced in a standard setting of supervised learning with an
input (instance) space X and an output space Y. The goal is to learn a mapping
from X to Y that captures, in one way or the other, the dependence of outputs
(responses) on inputs (predictors). The learning problem essentially consists of
choosing an optimal model (hypothesis) h∗ from a given model space (hypothesis
space) H, based on a set of training data
N
D= (xn , yn ) n=1
∈ (X × Y)N . (1)
have direct access to the (precise) data (1), but only to the (imprecise, coarse,
ambiguous) observations
N
O = (xn , Yn ) n=1 ∈ (X × 2Y )N . (3)
1
N
ROP T .
emp (θ) .= min L yn , hθ (xn ) , (4)
y ∈Y N n=1
i.e., in terms of the empirical risk of hθ in the case of a most favourable selection of
the outcomes yn . Moreover, given a loss L that is decomposable (over examples),
the “optimism” can be moved into the loss:
1
N
θ∗ ..= argmin ROP T
emp (θ) = argmin LO Yn , hθ (xn ) , (5)
θ∈Θ θ∈Θ N n=1
1
N
RP ESS .
emp (θ) .= max L yn , hθ (xn ) , (7)
y ∈Y N n=1
Optimistic and Pessimistic Learning from Imprecise Data 269
1
N
θ∗ ..= argmin RP ESS
emp (θ) = argmin LP Yn , hθ (xn ) (8)
θ∈Θ θ∈Θ N n=1
3 Illustrative Examples
Which of the two approaches to superset learning is more reasonable, the opti-
mistic or the pessimistic one? This question is difficult (or actually impossible)
to answer without further assumptions on the coarsening process, i.e., the pro-
cess that turns precise data into imprecise observations. In the following, to get
a better idea of the nature of the two approaches, we illustrate them by some
simple examples. We shall refer to the optimistic approach (based on the OSL)
as OPT and to the pessimistic one (based on the PSL) and PESS.
Thus, the loss is 0 if the prediction is inside the interval, i.e., if the regression
function intersects with the interval, and grows quadratically with the distance
from the interval outside. A small one-dimensional example of a set of interval-
valued data together with a regression line minimizing (5) is shown in Fig. 2
(left).
The PSL version (9) of the squared error loss is given as follows (cf. Fig. 1):
(ymax − ŷ)2 if ŷ < 12 (ymin + ymax )
LP [ymin , ymax ], ŷ = (11)
(ŷ − ymin )2 if ŷ ≥ 12 (ymin + ymax )
270 E. Hüllermeier et al.
Fig. 1. The OSL (solid line in blue) and PSL (dashed line in red) as extensions of the
squared error loss (gray line) in the case of an interval-valued observation (here the
interval [−1, 1], indicated by the vertical lines). (Color figure online)
As can be seen in Fig. 1, the PSL targets the midpoint of the interval as an opti-
mal “compromise value”; this point minimizes the maximal prediction error pos-
sible, and hence the loss function. Moreover, the larger the interval, the stronger
the loss function increases. Therefore, PESS is very similar to weighted linear
regression, where the weight of an example increases with the width of the cor-
responding interval. The OSL behaves in a quite different way: the larger the
interval, the smaller the loss function. Moreover, OSL does not prefer any values
inside the interval (e.g., the midpoint) to any other values. Note that, if the data
is completely coherent with a (noise-free) linear model, i.e., if there is a regres-
sion function intersecting all intervals, then any such function will be optimal
for OPT, while this is not necessarily the case for PESS, as PESS may prefer
a function not intersecting all intervals (see Fig. 2 (right) for an illustration).
Obviously, since the OSL is no longer strictly convex (in contrast with PSL), the
optimisation problem solved by OPT may no longer have a unique solution.
Fig. 2. Left: Linear regression with interval-valued data. Right: Comparison between
PESS and OPT for linear regression.
We can also compare OPT and PESS from the point of view of model updating
or revision in the case where new data is observed. Imagine, for example, that
Optimistic and Pessimistic Learning from Imprecise Data 271
a new data point (xN +1 , YN +1 ) is added to the data seen so far. OPT will
check for how compatible its current model is with the interval YN +1 and make
adjustments only if necessary. In particular, if ŷN +1 = hθ (xN +1 ) ∈ YN +1 , i.e.,
the interval includes the current prediction, the model will not be changed at all,
as it is considered fully coherent with the new observation. This also implies that
an extremely wide interval will be ignored as being completely uninformative.
PESS, on the other side, will always change its current estimate θ, unless ŷN +1 =
hθ (xN +1 ) corresponds exactly to the midpoint of YN +1 ; this is because any
deviation from this “perfect” prediction is considered as a mistake (or at least a
suboptimal choice) that ought to be mitigated.
From the above comments, it is clear that the two strategies may behave
quite differently on the same data. OPT assumes that Yn is a set of candidate
values, one of which corresponds to the true measurement. Therefore, fitting
one of these candidates, namely the one that is maximally coherent with the
model assumption and the rest of the data, is enough. As opposed to this, PESS
seeks to fit all values yn ∈ Yn simultaneously, i.e., to find a good compromise
prediction ŷn that is not in conflict with any of the candidates.
It appears that OPT proceeds from a disjunctive interpretation of the set
Yn , and considers that the true data will not be chosen so as to systematically
put the assumed model in default. In contrast, PESS is more in line with a
conjunctive interpretation, which makes sense if all the candidates are indeed
guaranteed to be possible measurements. One could imagine, for example, that
xn actually characterizes a whole set of entities, and that Yn is the collection of
outputs associated with these entities. As an illustration, suppose we would like
to learn a control rule that prescribes an autonomous car the strength of braking
depending on its current speed x. Since the optimal strength will also depend
on other factors (such as weather conditions), which are ignored (or “integrated
out”) here, training examples might be interval-valued. For example, depending
on further unknown conditions, the optimal strength could be in-between ymin
and ymax for a speed of x Km/h. Adopting a “cautious” model, which minimizes
the worst mistake it can make, may look like a reasonable strategy then.
N
θ∗ = argmin L(yn , hθ (xn ))
θ∈Θ n=1
272 E. Hüllermeier et al.
with
− log(p) if y = 1
L(y, p) = − log py + (1 − p)(1 − y) =
− log(1 − p) if y = 0
Using the representation (12) for the probability p, and the class encoding Y =
{−1, +1} instead of Y = {0, 1}, the loss can also be written as follows:
L(y, s) = log 1 + exp(−ys) ,
where s = θ, x is the predicted score and ys is the margin, i.e., the distance
from the decision boundary (to the right side) (Fig. 3).
Fig. 3. OSL (blue, solid line) and PSL (red, dashed line) for the logistic loss function.
(Color figure online)
Since Y = {−1, +1} contains only two elements, there is only one imprecise
observation that can be made, namely Y = {−1, +1} = Y, and the setting
reduces to so-called semi-supervised learning (with a part of the data being
precisely labeled, and another part without any supervision). Thus, the OSL is
given by
⎧
⎨ L(−1, s) if Y = {−1}
LO (Y, s) = L(+1, s) if Y = {+1} ,
⎩
min{L(−1, s), L(+1, s)} if Y = {−1, +1}
and the pessimistic version LP by the same expression with min in the third case
replaced by max. As a consequence, if an imprecise observation is made, OPT
will try to disambiguate, i.e., to choose θ such that ys = yθ, x is large (and
hence p is close to 0 or close to 1); this is in line with a large margin approach,
i.e., the learner tries to move the decision boundary away from the data points.
Indeed, the generalized loss LO can be seen as the logistic version of the “hat
loss” that is used in semi-supervised learning of support vector machines [1].
As opposed to this, PESS will try to choose θ such that s ≈ 0 and hence
p ≈ 12 . Obviously, this may lead to drastically different solutions. An example
is shown in Fig. 4, where a few labeled training examples are given (positive
Optimistic and Pessimistic Learning from Imprecise Data 273
in blue and negative in red) and many unlabeled. OPT seeks to maximize the
margin of the decision boundary, and hence puts it in-between the two clusters.
This is in line with the goal of disambiguation: ideally, the unlabeled examples
are far from the decision boundary, which means they are clearly identified as
positive or negative. PESS is doing exactly the opposite and tries to have the
unlabeled examples close to the decision boundary.
Fig. 4. Logistic regression in a semi-supervised setting: Solutions for OPT and PESS.
(Color figure online)
This example suggests that PESS is not really appropriate for tackling dis-
criminative learning tasks. To be fair, however, one has to acknowledge that
PESS may produce more reasonable results in other scenarios. For example, if
the unlabeled examples are not chosen arbitrarily but indeed correspond to those
cases that are very close to the true decision boundary, i.e., for which the pos-
terior probability is indeed close to 12 , and which could hence be hard to label,
then PESS is just doing the right thing.
As another rather extreme example, suppose that the precise observations
in Fig. 4 are just the “noisy” cases, whereas all “normal” cases are hidden (the
blue class is actually in the upper right and the red class in the lower left). One
can imagine, for example, an “adversarial” coarsening process that coarsens all
normal cases and only reveals the noise in the data. In this scenario, it is clear
that OPT will be completely misled and produce exactly the opposite of the
right model. In such adversarial settings [8], PESS (and more generally minimax
approaches) may indeed be considered a more reasonable strategy, as it may
provide some guarantees in terms of protection with regard to the coarsening
process. Anyway, what all these examples are showing is that the reasonableness
of an approach strongly depends on which assumptions about the coarsening
process can be considered as plausible.
1, 0, ?, 0, ?, 1, 1, 1, ?, ? ,
with p positive outcomes indicated by a 1 (e.g., a coin toss landing heads up),
n negative outcomes indicated by a 0, and u unknowns indicated by a ?. One
can check that, in the case where p > n, OPT will produce the estimate θ∗ =
p+u/p+u+n, based on a corresponding disambiguation in which each unknown
N
L(θ) = − Xi log(θ) + (1 − Xi ) log(1 − θ) .
n=1
Such an estimate may appear somewhat implausible. Why should all the
unknowns be positive? Of course, one may not exclude that the coarsening pro-
cess is such that only positives are hidden. In that case, OPT will exactly do the
right thing. Still, the estimate remains rather extreme and hence arguable.
In contrast, PESS would try to maximize the entropy of the estimated dis-
tribution [4, Corollary 1], which is equivalent to having θ∗ = 1/2 in the example
given above. While such an estimate may seem less extreme and more reason-
able, there is again no compelling reason to consider it more (or less) legitimate
than the one obtained by POSS, unless further assumptions are made about
the coarsening process. Finally, note that neither POSS nor PESS can produce
the estimate obtained by the classical coarsening-at-random (CAR) assumption,
which would give θ∗ = 2/3.
As a first remark, let us repeat that generalized loss minimization based
on OSL was actually not intended, or at least not motivated, by this sort of
problem. To explain this point, let us compare the above (statistical estimation)
example of coin tossing with the previous (machine learning) example of logistic
regression. In fact, the former can be seen as a special case of the latter, with
an instance space X = {x0 } consisting of a single instance, such that θ = p(y =
1 | x0 ). Correspondingly, since X has no structure, it is impossible to leverage
any structural assumptions about the sought model h : X −→ Y, which is the
basis of the idea of data disambiguation as performed by OPT.
In particular, in the case of coin flipping, each ? can be replaced by any (hypo-
thetical) outcome, independently of all others and without violating any model
assumptions. In other words, every instantiation of the coarse data is as plausible
as any other. This is in sharp contrast with the case of logistic regression, where
the assumption of a linear model, i.e., the assumption that the probability of
success for an input x depends on the spatial position of that point, lets many
disambiguations appear implausible. For example, in Fig. 5, the instantiation in
Optimistic and Pessimistic Learning from Imprecise Data 275
Fig. 5. Coarse data (left) together with two instantiations (middle and right).
the middle, where half of the unlabeled examples are disambiguated as positive
and the other half as negative, is clearly more coherent with the assumption of
(almost) linearly separable classes than the instantiation on the right, where all
unknowns are assigned to the positive class.
In spite of this, examples like the one of coin tossing are indeed suggesting
that OSL might be overly optimistic in certain cases. Even in discriminative
learning, OSL makes the assumption that the chosen model class is the right
one, which may lead to overly confident results should the model choice be
wrong. This motivates a reconsideration of the optimistic inference principle
and perhaps a suitable adjustment.
A noticeable property of the previous coin tossing example is a bias of the estima-
tion (or learning) process, which is caused by the fact that a higher likelihood can
principally be achieved with a more extreme θ. For example, with θ ∈ {0, 1}, the
probability of an “ideal” sample is 1, whereas for θ = 1/2, the highest probability
achievable on a sample of size N is (1/2)N . Thus, it seems that, from the very
beginning, the candidate estimate θ = 1/2 is put at a systematic disadvantage.
This can also be seen as follows: Consider any sample produced by θ = 1,
i.e., a sequence of tosses with heads up. When coarsening the data by covering
a subset of the sample, OPT will still produce θ = 1 as an estimate. Roughly
speaking, θ = 1 is “robust” toward coarsening. As opposed to this, when coars-
ening a sample produced with θ = 1/2, OPT will diverge and either produce a
smaller or a larger estimate.
imprecise data2 , one may start with a prior π on θ and look at the highest
posterior3
p y | θ π(θ)
max ,
y ∈Y p y
or, equivalently,
N
max log p y | θ − H(θ, y) = max log p(yn | θ) − H(θ, y) (13)
y ∈Y y ∈Y
i=1
with
H(θ, y) ..= log p(y) − log π(θ) (14)
At the level of loss minimization, when ignoring the role of y in (14), this app-
roach essentially comes down to adding a regularization term to the empirical
risk, and hence to minimizing the regularized OSL
1
N
ROP T .
reg (θ) .= LO Yn , hθ (xn ) + F (hθ ) , (15)
N n=1
In this case, F (θ) can again be moved inside the loss function LO in (15):
1
N
ROP T .
reg (θ) .= LO Yn , hθ (xn ) (17)
N n=1
with
LO (Y, ŷ) ..= min L(y, ŷ) − min L(y, ŷ) . (18)
y∈Y y∈Y
For some losses, such as squared error loss in regression, the adjustment (18)
has no effect, because L(y, ŷ) = 0 can always be achieved for at least one y ∈ Y.
For others, however, LO may indeed differ from LO . For the log-loss in binary
2
We assume the xn in the data {(xn , yn )}N n=1 to be fixed.
3
The obtained bound are similar to the upper expectation bound obtained by the
updating rule discussed by Zaffalon and Miranda [11] in the case of a completely
unknown coarsening process and precise prior information. However, Zaffalon and
Miranda discussed generic robust updating schemes leading to sets of probabilities
or sets of models, which is not the intent of the methods discussed in this paper.
Optimistic and Pessimistic Learning from Imprecise Data 277
Fig. 6. The adjusted OSL version (19) of the logistic loss (black line) compared to the
original version (red line). (Color figure online)
classification, for example, the normalizing term in (18) is min{L(0, p), L(1, p)},
which means that
⎧
⎨ log(1 − p) − log(p) if Y = {1}, p < 1/2
LO (Y, p) = log(p) − log(1 − p) if Y = {0}, p > 1/2 . (19)
⎩
0 otherwise
which now depends on y but not on θ (whereas the F in (15) depends on θ but
not on y). Obviously, the min-max regret principle is less pessimistic than the
original PSL, and leads to an adjustment of PESS that is even somewhat compa-
rable to OPT: The loss of a candidate θ on an instantiation y is corrected by the
278 E. Hüllermeier et al.
18
16
14
12
10
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Fig. 7. Loss functions and optimal predictions of θ (minima of the losses indicated by
vertical lines) in the case of coin tossing with observations 0, 0, 1, 1, 1, 1, ?, ?, ?: solid
blue line for OSL, dashed blue for the regularized OSL version (14) with π the beta
(5,5) distribution, solid red for PSL, and dashed red for the adjusted PSL (20). (Color
figure online)
minimal loss F (y) that can be achieved on this instantiation. Obviously, by doing
so, the influence of instantiations that necessarily cause a high loss is reduced.
But these instantiations are exactly those that are considered as “implausible”
and down-weighted by OPT (cf. Sect. 3.3). See Fig. 7 for an illustrative compar-
ison in the case of coin tossing as discussed in Sect. 3.3. Note that (20) does not
permit an additive decomposition into losses on individual training examples,
because the regret is defined on the entire set of data. Instead, a generalization
of (20) to loss functions other than log-loss suggests evaluating each θ in terms
of the maximal regret
MReg(θ) ..= max Remp (θ, y) − min Remp (θ̂, y) , (21)
y ∈Y θ̂
where Remp (θ, y) denotes the empirical risk of θ on the data obtained for the
instantiation y. Computing the maximal regret (21), let alone finding the mini-
mizer θ∗ = argminθ MReg(θ), appears to be intractable except for trivial cases.
In particular, the problem will be hard in cases like logistic regression, where
the empirical risk minimizer minθ̂ Remp (θ̂, y) cannot be obtained analytically,
because then even the evaluation of a single candidate θ on a single instantia-
tion y requires the solution of a complete learning task—not to mention that
the minimization over all instantiations y comes on top of this.
5 Concluding Remarks
The goal of our discussion was to provide some insight into the basic nature of
the “optimistic” and the “pessimistic” approach to learning from imprecise data.
To this end, we presented both of them in a unified framework and highlighted
important properties and differences through illustrative examples.
Optimistic and Pessimistic Learning from Imprecise Data 279
References
1. Chapelle, O., Sindhwani, V., Keerthi, S.S.: Optimization techniques for semi-
supervised support vector machines. J. Mach. Learn. Res. 9, 203–233 (2008)
2. Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. J. Mach. Learn. Res.
12, 1501–1536 (2011)
3. Couso, I., Sánchez, L.: Machine learning models, epistemic set-valued data and
generalized loss functions: an encompassing approach. Inf. Sci. 358, 129–150 (2016)
4. Guillaume, R., Couso, I., Dubois, D.: Maximum likelihood with coarse data based
on robust optimisation. In: Proceedings of the Tenth International Symposium on
Imprecise Probability: Theories and Applications, pp. 169–180 (2017)
5. Guillaume, R., Dubois, D.: Robust parameter estimation of density functions under
fuzzy interval observations. In: 9th International Symposium on Imprecise Proba-
bility: Theories and Applications (ISIPTA 2015), pp. 147–156 (2015)
6. Guillaume, R., Dubois, D.: A maximum likelihood approach to inference under
coarse data based on minimax regret. In: Destercke, S., Denoeux, T., Gil, M.Á.,
Grzegorzewski, P., Hryniewicz, O. (eds.) SMPS 2018. AISC, vol. 832, pp. 99–106.
Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97547-4 14
7. Hüllermeier, E.: Learning from imprecise and fuzzy observations: data disambigua-
tion through generalized loss minimization. Int. J. Approxim. Reasoning 55(7),
1519–1534 (2014)
8. Laskov, P., Lippmann, R.: Machine learning in adversarial environments. Mach.
Learn. 81(2), 115–119 (2010)
9. Liu, L.P., Dietterich, T.G.: A conditional multinomial mixture model for superset
label learning. In: Proceedings NIPS (2012)
10. Nguyen, N., Caruana, R.: Classification with partial labels. In: 14th International
Conference on Knowledge Discovery and Data Mining Proceedings KDD, Las
Vegas, USA, p. 2008 (2008)
11. Zaffalon, M., Miranda, E.: Conservative inference rule for uncertain reasoning
under incompleteness. J. Artif. Intell. Res. 34, 757–821 (2009)
On Cautiousness and Expressiveness
in Interval-Valued Logic
1 Introduction
Logical frameworks have always played an important role in artificial intelli-
gence, and adding weights to logical formulas allow one to deal with a variety of
problems with which classical logic struggles [3].
Usually, such weights are assumed to be precisely given, and associated to
an aggregation function, such as the maximum in possibilistic logic [4] or the
sum in penalty logic [5]. These approaches can typically find applications in
non-monotonic reasoning [1] or preference handling [7].
However, as providing specific weights to each formula is likely to be a cog-
nitively demanding tasks, many authors have considered extensions of these
frameworks to interval-valued weights [2,6], where intervals are assumed to con-
tain the true, ill-known weights. Such approaches can also be used, for instance,
to check how robust conclusions obtained with precise weights are.
In this paper, we are interested in making cautious or robust inferences in
such interval-valued frameworks. That is, we look for inference tools that will
typically result in a partial order over the interpretations or world states, such
that any preference statement made by this partial order is made in a skeptic
way, i.e., it holds for any replacement of the weights by precise ones within the
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 280–288, 2019.
https://doi.org/10.1007/978-3-030-35514-2_21
On Cautiousness and Expressiveness in Interval-Valued Logic 281
intervals, and should not be reversed when gaining more information. We simply
assume that the weights are positive and aggregated by a quite generic function,
meaning that we include for instance possibilistic and penalty logics as special
cases.
We provide the necessary notations and basic material in Sect. 2. In Sect. 3,
we introduce different ways to obtain partial orders over interpretations, and
discuss different properties that corresponding cautious inference tools could or
should satisfy. Namely, that reducing the intervals will provide more informative
and non-contradictory inferences, and that if an interpretation falsify a subset
of formulas falsified by another one, then it should be at least as good as this
latter one. Section 4 shows which of the introduced inference tools satisfy which
property.
2 Preliminaries
We consider a finite propositional language L. We denote by Ω the space of all
interpretations of L, and by ω an element of Ω. Given a formula φ, ω is a model
of φ if it satisfies it, denoted ω |= φ.
A weighted formula is a tuple φ, α where α represents the importance
of the rule, and the penalty incurred if it is not satisfied. This weight may be
understood in various ways: as a degree of certainty, as a degree of importance of
an individual preference, . . . . We assume that α take their values on an interval
of R+ , possibly extended to include ∞ (e.g., to represent formulas that cannot
be falsified). In this paper, a formula with α = 0 is understood as a totally
unimportant formula that can be ignored, while a formula with maximal α is a
formula that must be satisfied.
A (precisely) weighted knowledge base K = {φi , αi : i = 1, . . . , n} is a set of
distinct weighted formulas. Since these formulas are weighted, an interpretation
can (and sometimes must, if K without weights is inconsistent) falsify some of
them, and still be considered as valid. In order to determine an ordering between
different interpretations, we introduce two new notations:
(weights are
in [0, 1] and ag = max) or penalty logic (weights are positive reals
and ag = ). Based on this aggregation function, we define a given K the two
following complete orderings between interpretations when weights are precise:
– ω K
All ω iff ag({αi : φi ∈ FK (ω)}) ≤ ag({αi : φi ∈ FK (ω )}).
– ω Dif f ω iff ag({αi : φi ∈ FK (ω \ ω )}) ≤ ag({αi : φi ∈ FK (ω \ ω)}).
K
ω K K
All ω ⇔ ω Dif f ω
Proof. Let us denote the sets {αi : φi ∈ FK (ω)} and {αi : φi ∈ FK (ω )} as real-
valued vectors a = (x1 , . . . , xn , yn+1 , . . . , yna ) and b = (x1 , . . . , xn , zn+1 , . . . ,
znb ), where x1 , . . . , xn are the weights associated to the formulas that both
interpretations falsify. Showing the equivalence of Proposition 1 then comes down
to show
These two orderings can then be applied either to All or Dif f , resulting in
four different extensions: All,L , All,S , Dif f,L , Dif f,S . A first remark is that
strict comparisons are stronger than lattice ones, as the former imply the latter,
that is if [a, b] S [c, d], then [a, b] L [c, d]. In order to decide which of these
orderings are the most adequate, let us first propose some properties they should
follow when one wants to perform cautious inferences.
ω K2 ω =⇒ ω K1 ω
That is, the more we gain information, the better we become at differenti-
ating and ranking interpretations. If ω is strictly preferred to ω before getting
more precise assessments, it should remain so after the assessments become more
284 S. Destercke and S. Lagrue
This principle is quite intuitive: if we are sure that ω falsifies the same
formulas than ω in addition to some others, then certainly ω should be less
preferable/certain than ω.
Let us now discuss the different partial orders in light of these properties, starting
with the lattice orderings and then proceeding to interval orderings.
Let us first show that All,L , Dif f,L do not satisfy Property 1 in general, by
considering the following example:
Example 1. Consider the case where ai , bi ∈ R and ag = , with the following
knowledge base on the propositional variables {p, q}
φ1 = p, φ2 = p ∧ q, φ3 = ¬q
p q ag 1 ag 2 ag 3
ω0 0 0 [2.5,6.5] [6.5,6.5] [6.5,6.5]
ω1 0 1 [3.5, 11.5] [7.5,11.5] [7.5,7.5]
ω2 1 0 [0, 4] [4, 4] [4, 4]
ω3 1 1 [1, 5] [1, 5] [1, 1]
ω2 ω2 ω3 ω3
ω3 ω2
ω0
ω0 ω0
ω1 ω1 ω1
K1
All,L K2
All,L K3
All,L
It should be noted that what happens to ω2 , ω3 for All,L is also true for
Dif f,L . Indeed, FK (ω2 ) = {p ∩ q} and FK (ω3 ) = {¬q}, hence FK (ω2 \ ω3 ) =
FK (ω2 ) and FK (ω3 \ ω2 ) = ∅. However, we can show that the two orderings
based on lattice do satisfy subset monotonicity.
Proposition 2. Given a knowledge base K, the two orderings K K
All,L , Dif f,L
satisfy subset monotonicity.
Proof. For Dif f,L , it is sufficient to notice that if FK (ω) ⊆ FK (ω ), then FK (ω\
ω ) = FK (∅). This means that ag({αi : φi ∈ FK (ω \ ω )}) = [0, 0], hence we
necessarily have ω K Dif f,L ω .
For All,L , the fact that FK (ω) ⊆ FK (ω ) means that the vectors a and a
of lower values associated to {αi : φi ∈ FK (ω)} and {αi : φi ∈ FK (ω )} will be
of the kind a = (a, a1 , . . . , am ), hence we will have ag(a) ≤ ag(a ). The same
reasoning applied to upper bounds means that we will also have ag(a) ≤ ag(a ),
meaning that ω K
All,L ω .
286 S. Destercke and S. Lagrue
From this, we deduce that lattice orderings will tend to be too informative
for our purpose4 , i.e., they will induce preferences between interpretations that
should be absent if we want to make only those inferences that are guaranteed
(i.e., hold whatever the value chosen within the intervals Ii ).
Proof. Assume that [a, b] and [c, d] are the intervals obtained from K 2 respec-
tively for ω and ω after aggregation has been performed, with b ≤ c, hence
ω K
,S ω with ∈ {All, Dif f }.
2
Let us now look at the property of subset monotonicity. From the knowledge
base K 1 in Example 1, one can immediately see that All,S is not subset mono-
tonic, as FK (ω3 ) ⊆ FK (ω1 ) and FK (ω2 ) ⊆ FK (ω0 ) ⊆ FK (ω1 ), yet all intervals in
Table 1 overlap, meaning that all interpretations are incomparable. Hence All,S
will usually not be as informative as we would like a cautious ranking procedure
to be. This is mainly due to the presence of redundant variables, or common
formulas, in the comparison of interpretations. In contrast, Dif f,S does not
suffer from the same defect, as the next proposition shows.
Proposition 4. Given a knowledge base K, the ordering Dif f,S satisfies subset
monotonicity.
Hence, the ordering Dif f,S satisfies all properties we have considered desir-
able in our framework. It does not add unwanted comparisons, while not losing
information that could be deduced without knowing the weights.
4
Which does not prevent them to be suitable for other purposes.
On Cautiousness and Expressiveness in Interval-Valued Logic 287
ω2
ω0 ω3
ω1
5 Conclusions
In this paper, we have looked at the problem of making cautious inferences in
weighted logics when weights are interval-valued, and have made first proposals
to make such inferences. There is of course a lot that remains to be done, such
as studying expressivity, representational or computational issues.
It should also be noted that our approach can easily be extended to cases
where weights are given by other uncertainty models. If Ii is an uncertain quan-
tity (modelled by a fuzzy set, a belief function, a probability, . . . ), we would then
need to specify how to propagate them to obtain ag(F ), and how to compare
these uncertain quantities.
References
1. Benferhat, S., Dubois, D., Prade, H.: Possibilistic and standard probabilistic seman-
tics of conditional knowledge bases. J. Logic Comput. 9(6), 873–895 (1999)
2. Benferhat, S., Hué, J., Lagrue, S., Rossit, J.: Interval-based possibilistic logic. In:
Twenty-Second International Joint Conference on Artificial Intelligence (2011)
3. Dubois, D., Godo, L., Prade, H.: Weighted logics for artificial intelligence - an intro-
ductory discussion. Int. J. Approx. Reason. 9(55), 1819–1829 (2014)
4. Dubois, D., Prade, H.: Possibilistic logic: a retrospective and prospective view. Fuzzy
Sets Syst. 144(1), 3–23 (2004)
5. Dupin De Saint-Cyr, F., Lang, J., Schiex, T.: Penalty logic and its link with
Dempster-Shafer theory. In: Uncertainty Proceedings 1994, pp. 204–211. Elsevier
(1994)
288 S. Destercke and S. Lagrue
6. Gelain, M., Pini, M.S., Rossi, F., Venable, K.B., Wilson, N.: Interval-valued soft
constraint problems. Ann. Math. Artif. Intell. 58(3–4), 261–298 (2010)
7. Kaci, S., van der Torre, L.: Reasoning with various kinds of preferences: logic, non-
monotonicity, and algorithms. Ann. Oper. Res. 163(1), 89–114 (2008)
Preference Elicitation with Uncertainty:
Extending Regret Based Methods
with Belief Functions
1 Introduction
Beyond the choice of a model, the expert also needs to collect or elicit prefer-
ences that are specific to the DM, and that she could not have guessed according
to a priori assumptions. Information regarding preferences that are specific to
the DM can be collected by asking them to answer questions in several form
such as the ranking of a subset of alternatives from best to worst or the choice
of a preferred candidate among a subset of alternatives.
Example 2 (choosing the best course (continued)). In our example, directly ask-
ing for weights would make little sense (as our model may be wrong, and as the
Preference Elicitation with Uncertainty: Extending Regret Based Methods 291
Provided we have made some preference model assumptions (our case here),
it is possible to look for efficient elicitation methods, in the sense that they solve
the decision problem we want to solve in a small enough, if not minimal number
of questions. A lot of work has been specifically directed towards active elicitation
methods, in which the set of questions to ask the DM is not given in advance
but determined on the fly. In robust methods, this preferential information is
assumed to be given with full certainty which leads to at least two issues. The
first one is that elicitation methods thus do not account for the fact that the
DM might doubt her own answers, and that they might not reflect her actual
preferences. The second one, that is somehow implied by the first one, is that
most robust active elicitation methods will never put the DM in a position
where she could contradict either herself or assumptions made by the expert, as
new questions will be built on the basis that previous answers are correct and
hence should not be doubted. This is especially problematic when inaccurate
preferences are given early on, or when the preference model is based on wrong
assumptions.
This paper presents an extension of the Current Solution Strategy [3] that
includes uncertainty in the answers of the DM by using the framework based on
belief functions presented in [5]. Section 2 will present necessary preliminaries on
both robust preference elicitation based on regret and uncertainty management
based on belief functions. Section 3 will present our extension and some of the
associated theoretical results and guarantees. Finally Sect. 4 will present some
first numerical experiments that were made in order to test the method and its
properties in simulations.
2 Preliminaries
2.1 Formalization
Alternatives and Models: We will denote X the space of possible alternatives,
and X ⊆ X the subset of available alternatives at the disposal of our DM
and about which a recommendation needs to be made. In this paper we will
consider alternatives summarised by q real values corresponding to criteria, hence
X ⊆ Rq . For any x ∈ X and 1 ≤ i ≤ q, we denote by xi ∈ R the evaluation
of alternative x according to criterion i. We also assume that for any x, y ∈ X
292 P.-L. Guillot and S. Destercke
x
ω y ⇐⇒ fω (x) ≥ fω (y) (1)
which means that if the model ω is known, Pω (X) is a total preorder over
X, the set of existing alternatives. Note that Pω (X) can be determined using
pairwise relations
ω . Weighted averages are a key model of preference learning
whose linearity usually allows the development of efficient methods, especially in
regret-based elicitation [2]. It is therefore an ideal starting point to explore other
more complex functions, such as those that are linear in their parameters once
alternatives are known (i.e., Choquet integrals, Ordered weighted averages).
In theory, obtaining a unique true preference model requires both unlimited time
and unbounded cognitive abilities. This means that in practice, the best we can
do is to collect information identifying a subset Ω of possible models, and act
1
In principle, our methods apply to any value function with the same properties,
but may have to solve computational issue that depends on the specific chosen
hypothesis.
Preference Elicitation with Uncertainty: Extending Regret Based Methods 293
x
Ω y ⇐⇒ ∀ω ∈ Ω fω (x) ≥ fω (y). (2)
The research question we address here is to find elicitation strategies that reduce
Ω as quickly as possible, obtaining at the limit an order PΩ (X) having only
one maximal element2 . In practice, one may have to stop collecting information
before that point, explaining the need for heuristic indicators of the fitness of
competing alternatives as potential choices.
From this regret and a set Ω of possible models, we can then define the pairwise
max regret as
that corresponds to the maximum possible regret of choosing x over y for any
model in Ω . The max regret for an alternative x defined as
then corresponds to the worst possible regret one can have when choosing x.
Finally the min max regret over a subset of models Ω is
mMR(Ω ) = min MR(x, Ω ) = min max max (fω (y) − fω (x)) (6)
x∈X x∈X y∈X ω∈Ω
Picking as choice x∗ = arg min mMR(Ω ) is then a robust choice, in the sense
that it is the one giving the minimal regret in a worst-case scenario (the one
leading to max regret).
Example 3 (choosing the best course (continued)). Let X = [0, 10]3 be the set of
valid alternatives composed of 3 grades from 0 to 10 in respectively pedagogy,
usefulness and interest. Let X = {x1 , x2 , x3 , x4 } be the set of available alterna-
tives in which x1 corresponds to the Machine learning course, x2 corresponds
2
Or in some cases a maximal set {x1 , . . . , xp } of equally preferred elements s.t. x1
. . . xp .
294 P.-L. Guillot and S. Destercke
@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5
x2 8
x2 8 0 4 4
x3 4.5
x3 4.5 0.5 0 0.5
x4 7.5
x4 7.5 3.5 6 0
Regret indicators are also helpful for making the elicitation strategy efficient
and helping the expert ask relevant questions to the DM. Let Ω and Ω be
two sets of models such that mMR(Ω ) < mMR(Ω ). In the worst case, we are
certain that x∗Ω the optimal choice for Ω is less regretted than x∗Ω the optimal
choice for Ω , which means that we would rather have Ω be our set of models
than Ω . Let I, I be two pieces of preferential information and Ω I , Ω I the sets
obtained by integrating this information. Finding which of the two is the most
helpful statement in the progress towards a robust choice can therefore be done
by comparing mMR(Ω I ) and mMR(Ω I ). An optimal elicitation process (w.r.t.
minimax regret) would then choose the question for which the worst possible
answer gives us a restriction on Ω that is the most helpful in providing a robust
choice. However, computing such a question can be difficult, and the heuristic
we present next aims at picking a nearly optimal question in an efficient and
tractable way.
The Current Solution Strategy: Let’s assume that Ω is the subset of deci-
sion models that is consistent with every information available so far to the
expert. Let’s restrict ourselves to questions that consist in comparing pairs x, y
of alternatives in X. The DM can only answer with I1 = x
y or I2 = x
y.
A pair helpful in finding a robust solution as fast as possible can be computed
as a solution to the following optimization problem that consists in finding the
pair minimizing the worst-case min max regret:
min 2 WmMR({x, y}) = min 2 max mMR(Ω ∩ Ω xy ), mMR(Ω ∩ Ω xy )
(x,y)∈X (x,y)∈X
(7)
Preference Elicitation with Uncertainty: Extending Regret Based Methods 295
Example 4 (Choosing the best course (continued).). Using the same example,
according to Table 2, we have mMR(Ω) = MR(x1 , Ω) = PMR(x1 , x2 , Ω), mean-
ing that x1 is the least regretted alternative in the worst case and x2 is the one it
is most regretted to. The CSS heuristic consists in asking the DM to compare x1
and x2 , respectively the Machine learning course and the Optimization course.
Two key assumptions behind the methods we just described are that (1) the
initial chosen set Ω of models can perfectly describe the DM’s choices and (2)
the DM is an oracle, in the sense that any answer she provides truly reflects
her preferences, no matter how difficult the question. This certainly makes CSS
an efficient strategy, but also an unrealistic one. This means in particular that
if the DM makes a mistake, we will just pursue with this mistake all along the
process and will never question what was said before, possibly ending up with
sub-optimal recommendations.
questions will only ever restrict Ω and the expert will never get quite close to
modelling ω .
A similar point could be made if ω ∗ , the model according to which the DM
makes her decision, does not even belong to Ω the set of weighted sums models
that the expert chose.
ω2
1
ω
0.8
Ω 8
2
19
3 11
19
Ω
ω1
0 0.1 1
Such mass assignments are usually called simple support [13] and represent ele-
mentary pieces of uncertain information. A confidence level of 0 will correspond
to a vacuous knowledge about the true model ω ∗ , and will in no way imply that
the answer is wrong (as would have been the case in a purely probabilistic frame-
work). A confidence level of 1 will correspond to the case of certainty putting a
hard constraint on the subset of models to consider.
Remark 1. Note that values of α do not necessarily need to come from the DM,
but can just be chosen by the analyst (in the simplest case as a constant) to
weaken the assumptions of classical models. We will see in the experiments of
Sect. 4 that such a strategy may indeed lead to interesting behaviours, without
necessitating the DM to provide confidence degrees if she thinks the task is too
difficult, or if the analyst thinks such self-assessed confidence is meaningless.
m 0 = mΩ
1 ... mk = mk−1 +∩ mΩ
αk .
k
(8)
This rule, also known as TBM conjunctive rule, is meant to combine distinct
pieces of information. It is central to the Transferable Belief Model, that intends
to justify belief functions without using probabilistic arguments [14].
Preference Elicitation with Uncertainty: Extending Regret Based Methods 297
Note that an information fusion setting and the interpretation of the TBM
fit our problem particularly well as it assumes the existence of a unique true
model ω ∗ underlying the DM’s decision process, that might or might not be in
our predefined set of models Ω. Allowing for an open world is a key feature of
the framework. Let us nevertheless recall that non-normalized Dempster’s rule
+∩ can also be justified without resorting to the TBM [8,9,11].
In our case this independence of sources associated with two mass assign-
Ωj
ments mΩ αi and mαj means that even though both preferential information
i
account for preferences of the same DM, the answer a DM gives to the ith
question does not directly impact the answer she gives to the jth question: she
would have answered the same thing had their ith answer been different for some
reason. This seems reasonable, as we do not expect the DM to have a clear intu-
ition about the consequences of her answers over the set of models, nor to even
be aware that such a set – or axioms underlying it – exists. One must however
be careful to not ask the exact same question twice in short time range.
Since combined masses are all possibility distributions, an alternative to
assuming independence would be to assume complete dependence, simply using
the minimum rule [6] which among other consequences would imply a loss of
expressivity3 but a gain in computation4 .
As said before, one of the key interest of using this rule (rather than its
normalised version) is to allow m(∅) > 0, notably to detect either mistakes in
the DM’s answer (considered as an unreliable source) or a bad choice of model
(under an open world assumption). Determining where the conflict mainly comes
from and acting upon it will be the topic of future works. Note that in the specific
case of simple support functions, we have the following result:
Proposition 1. If mΩ k
αk are simple support functions combined through Demp-
ster’s rule, then
m(∅) = 0 ⇔ ∃ω, P l({ω}) = 1
with P l({ω}) = E⊆Ω,ω∈E m(E) the plausibility measure of model ω.
Proof (Sketch). The ⇐ part is obvious given the properties of Plausibility mea-
sure. The ⇒ partfollows from the fact that if m(∅) = 0, then all focal elements
are supersets of i∈{1,...,k} Ωi , hence all contains at least one common element.
This in particular shows that m(∅) can, in this specific case, be used as an
estimate of the logical consistency of the provided information pieces.
3
For instance, no new values of confidence would be created when using a finite set
{α1 , . . . , αM } for elicitation.
4
The number of focal sets increasing only linearly with the number of information
pieces.
298 P.-L. Guillot and S. Destercke
mk Ω I = 1. Combining a set I1 , ..., Ik of such certain information will end
up in the combined mass
⎛ ⎞
mk ⎝ Ωi ⎠ = 1
i∈{1,...,k}
which is simply the intersection of all provided constraints, that may turn up
either empty or non-empty, meaning that inconsistency will be a Boolean
notion, i.e.,
1 if i∈{1,...,k} Ωi = ∅
mk (Ø) =
0 otherwise
Recall that in the usual CSS or minimax regret strategies, such a situation can
never happen.
We now present our proposed extension of the Current Solution Strategy inte-
grating confidence degrees and uncertain answers. Note that in the two first
Sects. 3.1 and 3.2, we assume that the mass on the empty set is null in order
to parallel our approach with the usual one not including uncertainties. We will
then consider the problem of conflict in Sect. 3.3.
and we can easily see that in the case of certain answers (α = 1), we do have
⎛ ⎛ ⎞⎞
EPMR(x, y, mk ) = PMR ⎝x, y, ⎝ Ωi ⎠⎠ (10)
i∈{1,...,k}
hence formally extending Eq. (4). When interpreting m(Ω ) as the probability
that ω belongs to Ω , EPMR could be seen as an expectation of PMR when
randomly picking a set in 2Ω .
Preference Elicitation with Uncertainty: Extending Regret Based Methods 299
However EMR seems to be a better option to assess the max regret of an alter-
native, as under the assumption that the true model ω ∗ is within the focal set
Ω , it makes more sense to compare x to its worst opponent within Ω , which
may well be different for two different focal sets. Indeed, if ω ∗ the true model
does in fact belong to Ω , decision x is only as bad as how big the regret can get
for any adversarial counterpart yΩ ∈ X.
Extending mMR: we propose to extend it as
mEMR minimizes for each x ∈ X the expectation of max regret and is differ-
ent from the expectation of the minimal max regret for whichever alternative
x is optimal, described by EmMR(m) = Ω m(Ω ) minx∈X MR(x, Ω ). Again,
these two options with certain answers boil down to mMR as we have
⎛ ⎞
mEMR(m) = EmMR(m) = mMR ⎝ Ωi ⎠ . (14)
i∈{1,...,k}
The problem with EmMR is that it would allow for multiple possible best alter-
natives, leaving us with an unclear answer as to what is the best choice option,
(arg min EmMR) not being defined. It indicates how robust in the sense of regret
we expect any best answer to the choice problem to be, assuming there can be
an optimal alternative for each focal set. In contrast, mEMR minimizes the max
regret while restricting the optimal alternative x to be the same in all of them,
hence providing a unique argument and allowing our recommendation system
and elicitation strategy to give an optimal recommendation.
Finally, recommend x∗ = arg mEMR(mk ). Thanks to Eqs. (10), (12) and (14),
it is easy to see that we retrieve CSS as the special case in which all answers are
completely certain.
Example 6. Starting with intial mass function m0 such that m0 (Ω) = 1, the
choice of CSS coincides with the choice of ECSS (all evidence we have is com-
mitted to ω ∈ Ω). With the values of PMR reported in Table 2 the alterna-
tives the DM is asked to compare are x1 the least regretted alternative and x2
its most regretted counterpart. In accordance with her true preference model
ω ∗ = (0.1, 0.8, 0.1), the DM states that x2
x1 , i.e., she prefers the Optimiza-
tion course over the Machine learning course, with confidence degree α = 0.7.
Let Ω1 be the set of WS models in which x2 can be preferred to x1 , which in
symmetry with Example 5 can be defined as Ω1 = {ω ∈ Ω : ω 2 ≥ 23 − 24 ω },
5 1
as represented in Fig. 2.
ω2
1
ω
0.8
2 Ω1 8
19
3 11
19
ω1
0 0.1 1
@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5 53
53 x2 1.39
x2 0 0 38
1.39 −1 38
x3 0.5
x3 − 56 −0.83 0.5 0 − 11
6
−1.83 81
x4 4.26
x4 109
38
2.87 3.5 81
19
4.26 0 19
@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5 1283
827 x2 3.38
x2 2.4 0 390
2.18 0.5 380
23
23 x3 0.7
x3 30
0.77 0.5 0 − 68
60
−1.13 30
1389
809 909 x4 5.23
x4 190
4.26 3.5 190
4.78 0 380
k
Proposition 2. Let mk−1 and mΩ mass functions on Ω issued from
αk be two
k
ECSS such that mk (∅) = mk−1 +∩ mΩαk (∅) = 0, then
Proof (sketch). The two first items are simply due to the combined facts that
on one hand we know [15] that applying +∩ means that mk is a specialisation
of mk−1 , and on the other hand that for any Ω ⊆ Ω we have f (x, y, Ω ) ≤
f (x, y, Ω ) for any f ∈ {PMR, MR}. The third item is implied by the second as
it consists in taking a minimum over a set of values of EMR that are all smaller.
Note that the above argument applies to any combination rule producing
a specialisation of the two combined masses, including possibilistic minimum
rule [6], Denoeux’s family of w-based rules [4], etc. We can also show that the
evidential approach, if we provide it with questions computed through CSS, is
actually more cautious than CSS:
Proposition 3. Consider the subsets of models Ω1 , . . . , Ωk issued from the
answers of the CSS strategy, and some values α1 , . . . , αk provided a posteri-
k
ori by the DM. Let mk−1 and mΩ αk be two mass functions issued from ECSS on
Ω such that mk (∅) = 0. Then we have
1. EPMR(x, y, mk ) ≥ PMR(x, y, Ω
i∈{1,...,k} i )
2. EMR(x, mk ) ≥ MR(x, Ω i )
i∈{1,...,k}
3. mEMR(mk ) ≥ mMR( i∈{1,...,k} Ωi )
Proof (sketch). The first two items are due to the combined
facts that on one
hand all focal elements are supersets of i∈{1,...,k} Ωi and on the other hand
that for any Ω ⊆ Ω we have f (x, y, Ω ) ≤ f (x, y, Ω ) for any f ∈ {PMR, MR}.
Any value of EPMR or EMR is a weighted average over terms all greater than
their robust counterpart on i∈{1,...,k} Ωi , and is therefore greater itself. The
third item is implied by the second as the biggest value of EMR is thus necessarily
bigger than all the values of MR.
This simply shows that, if anything, our method is even more cautious than
CSS. It is in that sense probably slightly too cautious in an idealized scenario –
especially as unlike robust indicators our evidential extensions will never reach
0 – but provides guarantees that are at least as strong.
While we find the two first properties appealing, one goal of including uncer-
tainties in the DM answers is to relax the third property, whose underlying
assumptions (perfectness of the DM and of the chosen model) are quite strong.
In Sects. 3.3 and 4, we show that ECSS indeed satisfies this requirement, respec-
tively on an example and in experiments.
Preference Elicitation with Uncertainty: Extending Regret Based Methods 303
The following example simply demonstrates that, in practice, ECSS can lead
to questions that are possibly conflicting with each others, a feature CSS does
not have. This conflict is only a possibility: no conflict will appear should the
DM provide answers completely consistent with the set of models and what she
previously stated, and in that case at least one model will be fully plausible5
(see Proposition 1).
Example 7 (Choosing the best course (continued)). At step 2 of our example the
DM is asked to compare x1 to x3 in accordance with Table 4. Even though it
conflicts with ω ∗ the model underlying her decision the DM has the option to
state that x1
x3 with confidence degree α > 0, putting
weight on Ω2 the set
q i i i
of consistent model defined by Ω2 = {ω ∈ Ω : i=1 ω x1 − x3 ≥ 0} = {ω ∈
Ω : ω 2 ≤ 169
− 38 ω 1 }. However as represented in Fig. 3, Ω1 ∩ Ω2 = Ø.
ω2 ω2
1 1
ω ω
0.8 0.8
2 Ω1 8
19 2/3
Ω1 8
19
3 11 11
19 9/16 19
Ω
0.7
Ω 0.3
Ω2
ω1 ω1
0 0.1 1 0 0.1 1
m2 : Ω → 0.12 Ω2 → 0.18
Ω1 → 0.28 Ø → 0.42
Hence, computing PMR on the partition is sufficient to retrieve the global PMR
through a simple max. Let us now show that, in our case, the size of this parti-
tion only increases polynomially. Let Ω1 , . . . , Ωn be the set of models consistent
Preference Elicitation with Uncertainty: Extending Regret Based Methods 305
with respectively the first to the nth answer, and ΩiC , . . . , ΩnC their respective
complement in Ω.
Due to the nature of the conjunctive rule +∩ , every focal set Ω of mk =
m1 +∩ mΩ
Ω Ωn
α1 +∩ . . . +∩ mαn is the union of elements of the partition PΩ =
1
Which means that for each Ω ’s PMR can be computed using the PMR of its
corresponding partition. This still does not help much, as there is a total of 2n
possible value of Ω̃k . Yet, in the case of convex domains cut by linear constraints,
which holds for the weighted sum, the following theorem shows that the total
number of elementary subset in Ω only increases polynomially.
Meaning that at most Λnq of the FUH subsets have a non empty intersection
with E.
In the above theorem (the proof of which can be found in [12], or in [10] for
the specific case of E ⊂ R3 ), the subsets BH are equivalent to Ω̃k , whose size
only grow according to a polynomial whose power increases with q.
to compute Rω (x, y) for all ω ∈ E, x, y ∈ X can also be done once for each
ω ∈ E, and not be repeated in each subset Ω s.t. ω ∈ EΩ . Those results indicate
us that when q (the model-space dimension) is reasonably low and questions
correspond to cutting hyper-planes over a convex set, ECSS can be performed
efficiently. This will be the case for several models such as OWA or k-additive
Choquet integrals with low k, but not for others such as full Choquet integrals,
whose dimension if we have k criteria is 2k − 2. In these cases, it seems inevitable
that one would resort to approximations having a low numerical impact (e.g.,
merging or forgetting focal elements having a very low mass value).
4 Experiments
To test our strategy and its properties, we proceeded to simulated experiments,
in which the confidence degree was always constant. Such experiments therefore
also show what would happen if we did not ask confidence degrees to the DM,
but nevertheless assumed that she could make mistakes with a very simple noise
model.
The first experiment reported in Fig. 4 compares the extra cautiousness of
EMR when compared to MR. To do so, simulations were made for several fixed
degrees of confidence – including 1 in which case EMR coincides with MR – in
which a virtual DM states her preferences with the given degree of confidence,
and the value of EMR at each step is divided by the initial value so as to observe
its evolution. Those EMR ratios were then averaged over 100 simulations for
each degree. Results show that while high confidence degrees will have a limited
impact, low confidence degrees (< 0.7) may greatly slow down the convergence.
1.0
0.8
MEMR/Initial MMR
0.6 α=0
α=0.3
α=0.5
0.4
α=0.7
α=0.8
0.2 α=0.9
α=1
Fig. 4. Average evolution of min max regret with various degrees of confidence
The second experiment reported in Fig. 5 aims at finding if ECSS and CSS
truly generate different question strategies. To do so, we monitored the two
Preference Elicitation with Uncertainty: Extending Regret Based Methods 307
strategies for a given confidence degree, and identify the first step k for which
the two questions are different. Those values were averaged over 300 simulations
for several confidence degrees. Results show that even for a high confidence
degree (α = 0.9) it takes in average only 3 question to see a difference. This
shows that the methods are truly different in practice.
3.2
Average number of questions to difference
3.0
2.8
2.6
2.4
2.2
2.0
0.0 0.2 0.4 0.6 0.8 1.0
Confidence degree
Fig. 5. Average position of the first different question in the elicitation process/degrees
of confidence
(WS)
α=0.7 (RAND)
0.6 α=0.7 (WS)
α=0.9 (RAND)
α=0.9 (WS)
0.4
0.2
0.0
0 2 4 6 8 10 12 14
Number of questions
The third experiment reported in Fig. 6 is meant to observe how good m(∅)
our measure of inconsistency is in practice as an indicator that something is
wrong with the answers given by a DM. In order to do so simulations were made
in which one of two virtual DMs answers with a fixed confidence degree and
308 P.-L. Guillot and S. Destercke
the value of m(∅) is recorded at each step. They were then averaged over 100
simulations for each confidence degree. The two virtual DMs behaved respec-
tively completely randomly (RAND) and in accordance with a fixed weighted
sum model (WS) with probability α and randomly with a probability 1 − α. So
the first one is highly inconsistent with our model assumption, while the second
is consistent with this assumptions but makes mistakes.
Results are quite encouraging: the inconsistency of the random DM with the
model assumption is quickly identified, especially for high confidence degrees.
For the DM that follows our model assumptions but makes mistakes, the results
are similar, except for the fact that the conflict increase is not especially higher
for lower confidence degrees. This can easily be explained that in case of low
confidence degrees, we have more mistakes but those are assigned a lower weight,
while in case of high confidence degrees the occasional mistake is quite impactful,
as it has a high weight.
5 Conclusion
References
1. Benabbou, N., Gonzales, C., Perny, P., Viappiani, P.: Incremental elicitation of
choquet capacities for multicriteria choice, ranking and sorting problems. Artif.
Intell. 246, 152–180 (2017)
2. Benabbou, N., Gonzales, C., Perny, P., Viappiani, P.: Minimax regret approaches
for preference elicitation with rank-dependent aggregators. EURO J. Decis. Pro-
cesses 3(1–2), 29–64 (2015)
3. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
4. Denœux, T.: Conjunctive and disjunctive combination of belief functions induced
by nondistinct bodies of evidence. Artif. Intell. 172(2–3), 234–264 (2008)
Preference Elicitation with Uncertainty: Extending Regret Based Methods 309
1 Introduction
Alternatively, software agents can have access to online data as well as sharing
data with other agents.
The efficacy of combining these two types of evidence in multi-agent systems
has been studied from a number of different perspectives. In social epistemol-
ogy [6] has argued that agent-to-agent communications has an important role
to play in propagating locally held information widely across a population. For
example, interaction between scientists facilitates the sharing of experimental
evidence. Simulation results are then presented which show that a combination
of direct evidence and agent interaction, within the Hegselmann-Krause opinion
dynamics model [10], results in faster convergence to the true state than updat-
ing based solely on direct evidence. A probabilistic model combining Bayesian
updating and probability pooling of beliefs in an agent-based system has been
proposed in [13]. In this context it is shown that combining updating and pooling
leads to faster convergence and better consensus than Bayesian updating alone.
An alternative methodology exploits three-valued logic to combine both types
of evidence [2] and has been effectively applied to distributed decision-making
in swarm robotics [3].
In this current study we exploit the capacity of Dempster-Shafer theory
(DST) to fuse conflicting evidence in order to investigate how direct evidence
can be combined with a process of iterative belief aggregation in the context
of the best-of-n problem. The latter refers to a general class of problems in
distributed decision-making [16,22] in which a population of agents must col-
lectively identify which of n alternatives is the correct, or best, choice. These
alternatives could correspond to physical locations as, for example, in a search
and rescue scenario, different possible states of the world, or different decision-
making or control strategies. Agents receive direct but limited feedback in the
form of quality values associated with each choice, which then influence their
beliefs when combined with those of other agents with whom they interact. It is
not our intention to develop new operators in DST nor to study the axiomatic
properties of particular operators at the local level (see [7] for an overview of
such properties). Instead, our motivation is to study the macro-level conver-
gence properties of several established operators when applied iteratively by a
population of agents, over long timescales, and in conjunction with a process of
evidential updating, i.e., updating beliefs based on evidence.
An outline of the remainder of the paper is as follows. In Sect. 2 we give a
brief introduction to the relevant concepts from DST and summarise its previous
application to dynamic belief revision in agent-based systems. Section 3 intro-
duces a version of the best-of-n problem exploiting DST measures and combi-
nation operators. In Sect. 4 we then give the fixed point analysis of a dynamical
system employing DST operators so as to provide insight into the convergence
properties of such systems. In Sect. 5 we present the results from a number of
agent-based simulation experiments carried out to investigate consensus forma-
tion in the best-of-n problem under varying rates of evidence and levels of noise.
Finally, Sect. 6 concludes with some discussion.
312 M. Crosscombe et al.
c
and hence where P l(A) = 1 − Bel(A ).
A number of operators have been proposed in DST for combining or fusing
mass functions [20]. In this paper we will compare in a dynamic multi-agent set-
ting the following operators: Dempster’s rule of combination (DR) [19], Dubois
& Prade’s operator (D&P) [8], Yager’s rule (YR) [25], and a simple averaging
operator (AVG). The first three operators all make the assumption of inde-
pendence between the sources of the evidence to be combined but then employ
different techniques for dealing with the resulting inconsistency. DR uniformly
reallocates the mass associated with non-intersecting pairs of sets to the overlap-
ping pairs, D&P does not re-normalise in such cases but instead takes the union
of the two sets, while YR reallocates all inconsistent mass values to the universal
set S. These four operators were chosen based on several factors: the operators
are well established and have been well studied, they require no additional infor-
mation about individual agents, and they are computationally efficient at scale
(within the limits of DST).
Definition 2. Combination operators
Let m1 and m2 be mass functions on 2S . Then the combined mass function
m1 m2 is a function m1 m2 : 2S → [0, 1] such that for ∅ = A, B, C ⊆ S:
1
(DR) m1 m2 (C) = m1 (A) · m2 (B),
1−K
A∩B=C=∅
(D&P) m1 m2 (C) = m1 (A) · m2 (B) + m1 (A) · m2 (B),
A∩B=C=∅ A∩B=∅,
A∪B=C
(YR) m1 m2 (C) = m1 (A) · m2 (B) if C = S, and
A∩B=C=∅
Here we present a formulation of the best-of-n problem within the DST frame-
work. We take the n choices to be the states S. Each state si ∈ S is assumed to
have an associated quality value qi ∈ [0, 1] with 0 and 1 corresponding to min-
imal and maximal quality, respectively. Alternatively, we might interpret qi as
314 M. Crosscombe et al.
quantifying the level of available evidence that si corresponds to the true state
of the world.
In the best-of-n problem agents explore their environment and interact with
each other with the aim of identifying which is the highest quality (or true)
state. Agents sample states and receive evidence in the form of the quality qi ,
so that in the current context evidence Ei regarding state si takes the form of
the following mass function;
mEi = {si } : qi , S : 1 − qi .
Hence, qi is taken as quantifying both the evidence directly in favour of si pro-
vided by Ei , and also the evidence directly against any other state sj for j = i.
Given evidence Ei an agent updates its belief by combining its current mass
function m with mEi using a combination operator so as to obtain the new mass
function given by m mEi .
A summary of the process by which an agent might obtain direct evidence
in this model is then as follows. Based on its current mass function m, an agent
stochastically selects a state si ∈ S to investigate1 , according to the pignistic
probability distribution for m as given in Definition 3. More specifically, it will
update m to m mEi with probability P (si |m) × r for i = 1, . . . , n and leave
its belief unchanged with probability (1 − r), where r ∈ [0, 1] is a fixed evidence
rate quantifying the probability of finding evidence about the state that it is
currently investigating. In addition, we also allow for the possibility of noise
in the evidential updating process. This is modelled by a random variable ∼
N (0, σ 2 ) associated with each quality value. In other words, in the presence of
noise the evidence Ei received by an agent has the form:
mEi = {si } : qi + , S : 1 − qi − ,
where if qi + < 0 then it is set to 0, and if qi + > 1 then it is set to 1. Overall,
the process of updating from direct evidence is governed by the two parameters,
r and σ, quantifying the availability of evidence and the level of associated noise,
respectively.
In addition to receiving direct evidence we also include belief combination
between agents in this model. This is conducted in a pairwise symmetric manner
in which two agents are selected at random to combine their beliefs, with both
agents then adopting this combination as their new belief, i.e., if the two agents
have beliefs m1 and m2 , respectively, then they both replace these with m1 m2 .
However, in the case that agents are combining their beliefs under Dempster’s
rule and that their beliefs are completely inconsistent, i.e., when K = 1 (see
Definition 2), then they do not form consensus and the process moves on to the
next iteration.
In summary, during each iteration both processes of evidential updating and
consensus formation take place2 . However, while every agent in the population
1
We utilise roulette wheel selection; a proportionate selection process.
2
Due to the possibility of rounding errors occurring as a result of the multiplication
of small numbers close to 0, we renormalise the mass function that results from each
process.
Evidence Propagation and Consensus Formation in Noisy Environments 315
has the potential to update its own belief, provided that it successfully receives a
piece of evidence, the consensus formation is restricted to a single pair of agents
for each iteration. That is, we assume that only two agents in the whole popu-
lation are able to communicate and combine their beliefs during each iteration.
mt1 , . . . , mtk
→ mt1 mt2 , mt1 mt2 , mt3 , . . . , mtk
.
The fixed points of this mapping are those for which mt1 = mt1 mt2 and mt2 =
mt1 mt2 . This requires that mt1 = mt2 and hence the fixed point of the mapping
are the fixed points of the operator, i.e., those mass functions m for which m
m = m.
Let us analyse in detail the fixed points for the case in which there are 3 states
S = {s1 , s2 , s3 }. Let m = {s1 , s2 , s3 } : x7 , {s1 , s2 } : x4 , {s1 , s3 } : x5 , {s2 , s3 } :
x6 , {s1 } : x1 , {s2 } : x2 , {s3 } : x3 represent a general mass function defined on
this state space and where without loss of generality we take x7 = 1 − x1 − x2 −
x3 − x4 − x5 − x6 . For Dubois & Prade’s operator the constraint that m m = m
generates the following simultaneous equations.
{s2 } : 1 and {s3 } : 1. In other words, the only stable fixed points are those for
which agents’ beliefs are both certain and precise. That is where for some state
si ∈ S, Bel({si }) = P l({si }) = 1. The stable fixed points for Dempster’s rule
and Yager’s rule are also of this form. The averaging operator is idempotent and
all mass functions are unstable fixed points.
The above analysis concerns agent-based systems applying a combination
in order to reach consensus. However, we have yet to incorporate evidential
updating into this model. As outlined in Sect. 3, it is proposed that each agent
investigates a particular state si chosen according to its current beliefs using
the pignistic distribution. With probability r this will result in an update to its
beliefs from m to m mEi . Hence, for convergence it is also required that agents
only choose to investigate states for which m mEi = m. Assuming qi > 0,
then there is only one such fixed point corresponding to m = {si } : 1. Hence,
the consensus driven by belief combination as characterised by the above fixed
point analysis will result in convergence of individual agent beliefs if we also
incorporate evidential updating. That is, an agent with beliefs close to a fixed
point of the operator, i.e., m = {si } : 1, will choose to investigate state si with
very high probability and will therefore tend to be close to a fixed point of the
evidential updating process.
5 Simulation Experiments
In this section we describe experiments conducted to understand the behaviour
of the four belief combination operators in the context of the dynamic multi-
agent best-of-n problem introduced in Sect. 3. We compare their performance
under different evidence rates r, noise levels σ, and their scalability for different
numbers of states n.
Fig. 1. Average Bel({s3 }) plotted against iteration t with r = 0.05 and σ = 0.1.
Comparison of all four operators with error bars displaying the standard deviation.
(a) Low evidence rates r ∈ [0, 0.01]. (b) All evidence rates r ∈ [0, 1].
Fig. 2. Average Bel({s3 }) for evidence rates r ∈ [0, 1]. Comparison of all four operators
both with and without belief combination between agents.
Fig. 3. Standard deviation for different evidence rates r ∈ [0, 0.5]. Comparison of all
four operators both with and without belief combination between agents.
higher evidence rates and for r > 0.3 it converges to average values for Bel({s3 })
of less than 0.8. At r = 1, when every agent is receiving evidence at each time
step, there is failure to reach consensus when applying Dempster’s rule. Indeed,
there is polarisation with the population splitting into separate groups, each cer-
tain that a different state is the best. In contrast, both Dubois & Prade’s operator
and Yager’s rule perform well for higher evidence rates and for all r > 0.02 there
is convergence to an average value of Bel({s3 }) = 1. Meanwhile the averaging
operator appears to perform differently for increasing evidence rates and instead
maintains similar levels of performance for r > 0.1. For all subsequent figures
showing steady state results, we do not include error bars as this impacts nega-
tively on readability. Instead, we show the standard deviation plotted separately
against the evidence rate in Fig. 3. As expected, standard deviation is high for
low evidence rates in which the sparsity of evidence results in different runs of
the simulation converging to different states. This then declines rapidly with
increasing evidence rates.
The dashed lines in Figs. 2a and b show the values of Bel({s3 }) obtained
at steady state when there is only updating based on direct evidence. In
most cases the performance is broadly no better than, and indeed often worse
than, the results which combine evidential updating with belief combination
Evidence Propagation and Consensus Formation in Noisy Environments 319
between agents. For low evidence rates where r < 0.1 the population does not
tend to fully converge to a steady state since there is insufficient evidence avail-
able to allow convergence. For higher evidence rates under Dempster’s rule,
Dubois & Prade’s operator and Yager’s rule, the population eventually converges
on a single state with complete certainty. However, since the average value of
Bel({s3 }) in both cases is approximately 0.6 for r > 0.002 then clearly conver-
gence is often not to the best state. The averaging operator is not affected by the
combined updating method and performs the same under evidential updating
alone as it does in conjunction with consensus formation.
Overall, it is clear then that in this formulation of the best-of-n problem
combining both updating from direct evidence and belief combination results in
much better performance than obtained by using evidential updating alone for
all considered operators except the averaging operator.
Fig. 4. Average Bel({s3 }) for all four operators plotted against σ ∈ [0, 0.3] for different
evidence rates r. Left: r = 0.01. Centre: r = 0.05. Right: r = 0.1.
320 M. Crosscombe et al.
In contrast, for the evidence rates of r = 0.05 and r = 0.1, Fig. 4 (centre) and
(right), respectively, we see that both Dubois & Prade’s operator and Yager’s
rule are the most robust combination operators to increased noise. Specifically,
for r = 0.05 and σ = 0, they both converge to an average value of Bel({s3 }) = 1
and for σ = 0.3 they only decrease to 0.99. On the other hand, the presence
of noise at this evidence rate has a much higher impact on the performance of
Dempster’s rule and the averaging operator. For σ = 0 Dempster’s rule converges
to an average value of Bel({s3 }) = 0.95 but this decreases to 0.78 for σ = 0.3, and
for the averaging operator the average value of Bel({s3 }) = 0.41 and decreases
to 0.29. The contrast between the performance of the operators in the presence
of noise is even greater for the evidence rate r = 0.1 as seen in Fig. 4 (right).
However, both Dubois & Prade’s operator and Yager’s rule differ in this context
since, for both evidence rates r = 0.05 and r = 0.1, their average values of
Bel({s3 }) remain constant at approximately 1.
In the swarm robotics literature most best-of-n studies are for n = 2 (see for
example [17,23]). However, there is a growing interest in studying larger numbers
of choices in this context [3,18]. Indeed, for many distributed decision-making
applications the size of the state space, i.e., the value of n in the best-of-n
problem, will be much larger. Hence, it is important to investigate the scalability
of the proposed DST approach to larger values of n.
Having up to now focused on the n = 3 case, in this section we present
additional simulation results for n = 5 and n = 10. As proposed in Sect. 5.1,
i
the quality values are allocated so that qi = n+1 for i = 1, . . . , n. Here, we
only consider Dubois & Prade’s operator and Yager’s rule due to their better
performance when compared with the other two combination operators.
Fig. 5. Average Bel({sn }) for n ∈ {3, 5, 10} plotted against σ for r = 0.05.
Figure 5 shows the average values of Bel({sn }) at steady state plotted against
noise σ ∈ [0, 0.3] for evidence rate r = 0.05, where Bel({sn }) is the belief in the
best state for n = 3, 5 and 10. For Dubois & Prade’s operator, Fig. 5a shows
Evidence Propagation and Consensus Formation in Noisy Environments 321
the steady state values of Bel({s3 }) = 1 independent of the noise level, followed
closely by the values of Bel({s5 }) = 0.94 at σ = 0 for the n = 5 case. However, for
n = 10 the value of Bel({s10} ) is 0.61 when σ = 0, corresponding to a significant
decrease in performance. At the same time, from Fig. 5b, we can see that for
Yager’s rule performance declines much less rapidly with increasing n than for
Dubois & Prade’s operator. So at σ = 0 and n = 5 the average value at steady
state for Yager’s rule is almost the same as for n = 3, i.e. Bel({s5 }) = 0.98, with
a slight decrease in the performance Bel({s10 }) = 0.92 for n = 10. As expected
the performance of both operators decreases as σ increases, with Yager’s rule
being much more robust to noise than Dubois & Prade’s operator for large values
of n.
In this way, the results support only limited scalability for the DST approach
to the best-of-n problem, at least as far as uniquely identifying the best state is
concerned. Furthermore, as n increases so does sensitivity to noise. This reduced
performance may in part be a feature of the way quality values have been allo-
cated. Notice that as n increases, the difference between successive quality values
qi+1 − qi = n+1
1
decreases. This is likely to make it difficult for a population of
agents to distinguish between the best state and those which have increasingly
similar quality values. Furthermore, a given noise standard deviation σ results
in an inaccurate ordering of the quality values the closer those values are to each
other, making it difficult for a population of agents to distinguish between the
best state and those which have increasingly similar quality values.
Further work will investigate the issue of scalability in more detail, including
whether alternatives to the updating process may be applicable in a DST model,
such as that of negative updating in swarm robotics [14]. We must also consider
the increasing computational cost of DST as the size of the state space increases
and investigate other representations such as possibility theory [9] as a means
of avoiding exponential increases in the cost of storing and combining mass
functions. Finally, we hope to adapt our method to be applied to a network, as
opposed to a complete graph, so as to study the effects of limited or constrained
communications on convergence.
References
1. Cho, J.H., Swami, A.: Dynamics of uncertain opinions in social networks. In: 2014
IEEE Military Communications Conference, pp. 1627–1632 (2014)
2. Crosscombe, M., Lawry, J.: A model of multi-agent consensus for vague and uncer-
tain beliefs. Adapt. Behav. 24(4), 249–260 (2016)
3. Crosscombe, M., Lawry, J., Hauert, S., Homer, M.: Robust distributed decision-
making in robot swarms: exploiting a third truth state. In: 2017 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS), pp. 4326–4332.
IEEE (September 2017). https://doi.org/10.1109/IROS.2017.8206297
4. Dabarera, R., Núñez, R., Premaratne, K., Murthi, M.N.: Dynamics of belief theo-
retic agent opinions under bounded confidence. In: 17th International Conference
on Information Fusion (FUSION), pp. 1–8 (2014)
5. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
Ann. Math. Stat. 38(2), 325–339 (1967)
6. Douven, I., Kelp, C.: Truth approximation, social epistemology, and opinion
dynamics. Erkenntnis 75, 271–283 (2011)
7. Dubois, D., Liu, W., Ma, J., Prade, H.: The basic principles of uncertain infor-
mation fusion. An organised review of merging rules in different representation
frameworks. Inf. Fus. 32, 12–39 (2016). https://doi.org/10.1016/j.inffus.2016.02.
006
8. Dubois, D., Prade, H.: Representation and combination of uncertainty with belief
functions and possibility measures. Comput. Intell. 4(3), 244–264 (1988). https://
doi.org/10.1111/j.1467-8640.1988.tb00279.x
9. Dubois, D., Prade, H.: Possibility theory, probability theory and multiple-valued
logics: a clarification. Ann. Math. Artif. Intell. 32(1), 35–66 (2001). https://doi.
org/10.1023/A:1016740830286
10. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence: models,
analysis and simulation. J. Artif. Soc. Soc. Simul. 5, 2 (2002)
11. Jøsang, A.: The consensus operator for combining beliefs. Artif. Intell. 141(1–2),
157–170 (2002). https://doi.org/10.1016/S0004-3702(02)00259-X
12. Kanjanatarakul, O., Denoux, T.: Distributed data fusion in the dempster-shafer
framework. In: 2017 12th System of Systems Engineering Conference (SoSE), pp.
1–6. IEEE (June 2017). https://doi.org/10.1109/SYSOSE.2017.7994954
Evidence Propagation and Consensus Formation in Noisy Environments 323
13. Lee, C., Lawry, J., Winfield, A.: Combining opinion pooling and evidential updat-
ing for multi-agent consensus. In: Proceedings of the Twenty-Seventh International
Joint Conference on Artificial Intelligence (IJCAI-2018) Combining, pp. 347–353
(2018)
14. Lee, C., Lawry, J., Winfield, A.: Negative updating combined with opinion pooling
in the best-of-n problem in swarm robotics. In: Dorigo, M., Birattari, M., Blum,
C., Christensen, A.L., Reina, A., Trianni, V. (eds.) ANTS 2018. LNCS, vol. 11172,
pp. 97–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00533-7 8
15. Lu, X., Mo, H., Deng, Y.: An evidential opinion dynamics model based on het-
erogeneous social influential power. Chaos Solitons Fractals 73, 98–107 (2015).
https://doi.org/10.1016/j.chaos.2015.01.007
16. Parker, C.A.C., Zhang, H.: Cooperative decision-making in decentralized multiple-
robot systems: the best-of-n problem. IEEE/ASME Trans. Mechatron. 14(2), 240–
251 (2009). https://doi.org/10.1109/TMECH.2009.2014370
17. Reina, A., Bose, T., Trianni, V., Marshall, J.A.R.: Effects of spatiality on value-
sensitive decisions made by robot swarms. In: Groß, R., et al. (eds.) Distributed
Autonomous Robotic Systems. SPAR, vol. 6, pp. 461–473. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-73008-0 32
18. Reina, A., Marshall, J.A.R., Trianni, V., Bose, T.: Model of the best-of-n nest-site
selection process in honeybees. Phys. Rev. E 95, 052411 (2017). https://doi.org/
10.1103/PhysRevE.95.052411
19. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
20. Smets, P.: Analyzing the combination of conflicting belief functions. Inf. Fusion
8(4), 387–412 (2007). https://doi.org/10.1016/j.inffus.2006.04.003
21. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66, 387–412
(1994)
22. Valentini, G., Ferrante, E., Dorigo, M.: The best-of-n problem in robot swarms:
formalization, state of the art, and novel perspectives. Front. Robot. AI 4, 9 (2017).
https://doi.org/10.3389/frobt.2017.00009
23. Valentini, G., Hamann, H., Dorigo, M.: Self-organized collective decision making:
the weighted voter model. In: Proceedings of the 2014 International Conference
on Autonomous Agents and Multi-agent Systems, pp. 45–52. AAMAS 2014. Inter-
national Foundation for Autonomous Agents and Multiagent Systems, Richland
(2014)
24. Wickramarathne, T.L., Premaratine, K., Murthi, M.N., Chawla, N.V.: Conver-
gence analysis of iterated belief revision in complex fusion environments. IEEE
J. Sel. Top. Signal Process. 8(4), 598–612 (2014). https://doi.org/10.1109/JSTSP.
2014.2314854
25. Yager, R.R.: On the specificity of a possibility distribution. Fuzzy Sets Syst. 50(3),
279–292 (1992). https://doi.org/10.1016/0165-0114(92)90226-T
Order-Independent Structure Learning
of Multivariate Regression Chain Graphs
1 Introduction
Chain graphs were introduced by Lauritzen, Wermuth and Frydenberg [5,9]
as a generalization of graphs based on undirected graphs and directed acyclic
graphs (DAGs). Later Andersson, Madigan and Perlman introduced an alter-
native Markov property for chain graphs [1]. In 1993 [3], Cox and Wermuth
introduced multivariate regression chain graphs (MVR CGs). The different inter-
pretations of CGs have different merits, but none of the interpretations subsumes
another interpretation [4].
Acyclic directed mixed graphs (ADMGs), also known as semi-Markov(ian)
[12] models contain directed (→) and bidirected (↔) edges subject to the restric-
tion that there are no directed cycles [15]. An ADMG that has no partially
directed cycle is called a multivariate regression chain graph. Cox and Wermuth
represented these graphs using directed edges and dashed edges, but we fol-
low Richardson [15] because bidirected edges allow the m-separation criterion
Supported by AFRL and DARPA (FA8750-16-2-0042).
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 324–338, 2019.
https://doi.org/10.1007/978-3-030-35514-2_24
Order-Independent Structure Learning of MVR CGS 325
the order in which the variables are given. The PC-like algorithm for learn-
ing MVR CGs under the faithfulness assumption is formally described in
Algorithm 1.
a a a
b c b c b c
d d d
e e e
Fig. 2. (a) The DAG G, (b) the skeleton returned by Algorithm 1 with order1 (V ), (c)
the skeleton returned by Algorithm 1 with order2 (V ).
We see that the skeletons are different, and that both are incorrect as the
edge c e is missing. The skeleton for order2 (V ) contains an additional error,
as there is an additional edge a e. We now go through Algorithm 1 to see
what happened. We start with a complete undirected graph on V . When i = 0,
variables are tested for marginal independence, and the algorithm correctly does
not remove any edge. Also, when i = 1, the algorithm correctly does not remove
any edge. When i = 2, there is a pair of vertices that is thought to be condition-
ally independent given a subset of size two, and the algorithm correctly removes
the edge between a and d. When i = 3, there are two pairs of vertices that are
thought to be conditionally independent given a subset of size three. Table 1
shows the trace table of Algorithm 1 for i = 3 and order1 (V ) = (d, e, a, c, b).
Table 1. The trace table of Algorithm 1 for i = 3 and order1 (V ) = (d, e, a, c, b).
Ordered pair (u, v) adH (u) Suv Is Suv ⊆ adH (u) \ {v}? Is u v removed?
(e, a) {a, b, c, d} {b, c, d} Yes Yes
(e, c) {b, c, d} {a, b, d} No No
(c, e) {a, b, d, e} {a, b, d} Yes Yes
Table 2. The trace table of Algorithm 1 for i = 3 and order2 (V ) = (d, c, e, a, b).
Ordered Pair (u, v) adH (u) Suv Is Suv ⊆ adH (u) \ {v}? Is u v removed?
(c, e) {a, b, d, e} {a, b, d} Yes Yes
(e, a) {a, b, d} {b, c, d} No No
(a, e) {b, c, e} {b, c, d} No No
a a a
b c b c b c
d e d e d e
(a) (b) (c)
Fig. 3. (a) The DAG G, (b) the CG returned after the v -structure recovery phase of
Algorithm 1 with order1 (V ), (c) the CG returned after the v -structure recovery phase
of Algorithm 1 with order3 (V ).
as the edge c d is oriented, all edges are oriented and no further orientation
rules are applied. These examples illustrate that the essential graph recovery
phase of the PC-like algorithm can be order-dependent regardless of the output
of the previous steps.
a b
e c d f
Fig. 4. Possible mixed graph after v -structure recovery phase of the sample version of
the PC-like algorithm.
Theorem 2. The skeleton resulting from the sample version of the stable PC-
like algorithm is order-independent.
Theorem 4. The decisions about v-structures in the sample version of the stable
CPC/MPC-like algorithm is order-independent.
Example 5 (Order-independent decisions about v-structures). We consider the
sample versions of the stable CPC/MPC-like algorithm, using the same input
as in Example 2. In particular, we assume that all conditional independencies
induced by the MVR CG in Fig. 3(a) are judged to hold except c ⊥⊥ d. Suppose
that c ⊥⊥ d|b and c ⊥⊥ d|e are thought to hold. Let α = β = 50.
Denote the skeleton after the skeleton recovery phase by H. We consider
the unshielded triple (c, e, d). First, we compute aH (c) = {a, d, e} and aH (d) =
{a, b, c, e}, when i = 1. We now consider all subsets S of these adjacency sets,
and check whether c ⊥⊥ d|S. The following separating sets are found: {b}, {e},
and {b, e}. Since e is in some but not all of these separating sets, the stable CPC-
like algorithm determines that the triple is ambiguous, and no orientations are
performed. Since e is in more than half of the separating sets, stable MPC-like
determines that the triple is unambiguous and not a v -structure. The output of
both algorithms is given in Fig. 3(c).
At this point it should be clear why the modified PC-like algorithm is labeled
“conservative”: it is more cautious than the (stable) PC-like algorithm in drawing
unambiguous conclusions about orientations. As we showed in Example 5, the
output of the (stable) CPC-like algorithm may not be collider equivalent with
the true MVR CG G, if the resulting CG contains an ambiguous triple.
Theorem 6. The sample versions of stable CPC-like and stable MPC-like algo-
rithms are fully order-independent.
5 Evaluation
In this section, we compare the performance of our algorithms (Table 3) with
the original PC-like learning algorithm by running them on randomly generated
MVR chain graphs in low-dimensional and high-dimensional data, respectively.
We report on the Gaussian case only because of space limitations.
We evaluate the performance of the proposed algorithms in terms of the six
measurements that are commonly used [2,8,10,20] for constraint-based learning
algorithms: (a) the true positive rate (TPR) (also known as sensitivity, recall, and
hit rate), (b) the false positive rate (FPR) (also known as fall-out), (c) the true
discovery rate (TDR) (also known as precision or positive predictive value), (d)
accuracy (ACC) for the skeleton, (e) the structural Hamming distance (SHD)
(this is the metric described in [20] to compare the structure of the learned
and the original graphs), and (f) run-time for the LCG recovery algorithms. In
principle, large values of TPR, TDR, and ACC, and small values of FPR and
SHD indicate good performance. All of these six measurements are computed on
the essential graphs of the CGs, rather than the CGs directly, to avoid spurious
differences due to random orientation of undirected edges.
Figure 5 shows that: (a) as we expected [8,10], all algorithms work well on
sparse graphs (N = 2), (b) for all algorithms, typically the TPR, TDR, and
ACC increase with sample size, (c) for all algorithms, typically the SHD and FPR
decrease with sample size, (d) a large significance level (α = 0.05) typically yields
336 M. A. Javidian et al.
0.95
OPC OPC OPC
8e−04
SPC SPC SPC
0.980 0.985
0.85 0.90
TDR
TPR
FPR
6e−04
0.80
4e−04
0.975
500 1000 5000 10000 500 1000 5000 10000 500 1000 5000 10000
Sample size Sample size Sample size
0.25
OPC OPC OPC
SPC SPC SPC
0.998
18
14
0.994
8 10
0.990
500 1000 5000 10000 500 1000 5000 10000 500 1000 5000 10000
Sample size Sample size Sample size
0.44 0.48
4e−04
0.8
TDR
TPR
FPR
0.7
1e−04
0.40
0.6
0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001
p values p values p values
P= 1000 , N= 2 , OPC vs SPC P= 1000 , N= 2 , OPC vs Modified PCs P= 1000 , N= 2 , OPC vs SPC
0.9988
1200
15 16 17 18 19 20
stable MPC
0.9986
stable LCPC
1000
SHD
ACC
stable LMPC
900
0.9984
800
0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001
p values p values p values
Fig. 5. The first two rows show the performance of the original (OPC) and stable PC-
like (SPC) algorithms for randomly generated Gaussian chain graph models: average
over 30 repetitions with 50 variables correspond to N = 2, and the significance level α =
0.001. The last two rows show the performance of the original (OPC) and stable PC-
like (SPC) algorithms for randomly generated Gaussian chain graph models: average
over 30 repetitions with 1000 variables correspond to N = 2, sample size S = 50, and
the significance level α = 0.05, 0.01, 0.005, 0.001.
Order-Independent Structure Learning of MVR CGS 337
large TPR, FPR, and SHD, (e) while the stable PC-like algorithm has a better
TDR and FPR in comparison with the original PC-like algorithm, the original
PC-like algorithm has a better TPR (as observed in the case of DAGs [2]). This
can be explained by the fact that the stable PC-like algorithm tends to perform
more tests than the original PC-like algorithm, and (h) while the original PC-
like algorithm has a (slightly) better SHD in comparison with the stable PC-like
algorithm in low-dimensional data, the stable PC-like algorithm has a better
SHD in high-dimensional data. Also, (very) small variances indicate that the
order-independent versions of the PC-like algorithm in high-dimensional data are
stable. When considering average running times versus sample sizes, as shown
in Fig. 5, we observe that: (a) the average run time increases when sample size
increases; (b) generally, the average run time for the original PC-like algorithm
is (slightly) better than that for the stable PC-like algorithm in both low and
high dimensional settings.
In summary, empirical simulations show that our algorithms achieve com-
petitive results with the original PC-like learning algorithm; in particular, in the
Gaussian case the order-independent algorithms achieve output of better qual-
ity than the original PC-like algorithm, especially in high-dimensional settings.
Since we know of no score-based learning algorithms for MVR chain graphs (and,
in fact, for any kind of chain graphs), we plan to investigate the feasibility of a
scalable algorithm of this kind.
References
1. Andersson, S.A., Madigan, D., Perlman, M.D.: An alternative Markov property
for chain graphs. In: Proceedings of UAI Conference, pp. 40–48 (1996)
2. Colombo, D., Maathuis, M.H.: Order-independent constraint-based causal struc-
ture learning. J. Mach. Learn. Res. 15(1), 3741–3782 (2014)
3. Cox, D.R., Wermuth, N.: Linear dependencies represented by chain graphs. Stat.
Sci. 8(3), 204–218 (1993)
4. Drton, M.: Discrete chain graph models. Bernoulli 15(3), 736–753 (2009)
5. Frydenberg, M.: The chain graph Markov property. Scand. J. Stat. 17(4), 333–353
(1990)
6. Golumbic, M.: Algorithmic Graph Theory and Perfect Graphs. Academic Press,
New York (1980)
7. Javidian, M.A., Valtorta, M.: On the properties of MVR chain graphs. In: Work-
shop Proceedings of PGM Conference, pp. 13–24 (2018)
8. Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs
with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)
9. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Ann. Stat. 17(1), 31–57
(1989)
10. Ma, Z., Xie, X., Geng, Z.: Structural learning of chain graphs via decomposition.
J. Mach. Learn. Res. 9, 2847–2880 (2008)
11. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1988)
12. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University
Press, Cambridge (2009)
338 M. A. Javidian et al.
13. Peña, J.M.: Alternative Markov and causal properties for acyclic directed mixed
graphs. In: Proceedings of UAI Conference, pp. 577–586 (2016)
14. Ramsey, J., Spirtes, P., Zhang, J.: Adjacency-faithfulness and conservative causal
inference. In: Proceedings of UAI Conference, pp. 401–408 (2006)
15. Richardson, T.S.: Markov properties for acyclic directed mixed graphs. Scand. J.
Stat. 30(1), 145–157 (2003)
16. Sonntag, D., Peña, J.M.: Learning multivariate regression chain graphs under faith-
fulness. In: Proceedings of PGM Workshop, pp. 299–306 (2012)
17. Sonntag, D., Peña, J.M.: Chain graph interpretations and their relations revisited.
Int. J. Approx. Reason. 58, 39–56 (2015)
18. Sonntag, D., Peña, J.M., Gómez-Olmedo, M.: Approximate counting of graphical
models via MCMC revisited. Int. J. Intell. Syst. 30(3), 384–420 (2015)
19. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search, 2nd edn.
MIT Press, Cambridge (2000)
20. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian
network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
21. Wermuth, N., Sadeghi, K.: Sequences of regressions and their independences. Test
21, 215–252 (2012)
Comparison of Analogy-Based Methods
for Predicting Preferences
1 Introduction
Predicting preferences has become a challenging topic in artificial intelligence,
e.g., [9]. The idea of applying analogical proportion-based inference to this prob-
lem has been recently proposed [11] and different approaches have been suc-
cessfully tested [1,7], following previous studies that obtained good results in
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 339–354, 2019.
https://doi.org/10.1007/978-3-030-35514-2_25
340 M. Bounhas et al.
a1 −i α a2 −i β
a1 −i γ a2 −i δ
c1 −j α c2 −j β
one cannot have:
c1 −j γ ≺ c2 −j δ
where x−i denotes the n-1-dimensional vector made of the evaluations of x on all
criteria except the ith one for which the Greek letter denotes the substituted value.
This property ensures that the differences of preference between α and β, on the
one hand, and between γ and δ, on the other hand, can consistently be compared.
Thus, when applying the first pattern, one may also make sure that no con-
tradictory trade-offs are introduced by the prediction mechanism. In the first
pattern, analogical reasoning amounts here to finding triples of pairs of com-
pared items (a, b, c) appropriate for inferring the missing value(s) in d. When
there exist several suitable triples, possibly leading to different conclusions, one
may use a majority vote for concluding.
Analogical proportions can be extended to numerical values, once the values
are renormalized on scale [0, 1], by a multiple-valued logic expression. The main
option, which agrees with the Boolean case, and where truth is a matter of degree
[6] is:
A(a, b, c, d) = 1 − |(a − b) − (c − d)| if a ≥ b and c ≥ d, or a ≤ b and c ≤ d
= 1 − max(|a − b|, |c − d|) otherwise.
Note that A(a, b, c, d) = 1 iff a − b = c − d.
We can then compute to what extent an analogical proportion holds between
vectors:
Σn A(ai , bi , ci , di )
A(a, b, c, d) = i=1 (1)
n
Lastly, let us remark that the second, simpler, pattern agrees with the view
that preferences wrt each criterion are represented by differences of evaluations.
This includes the weighted sum, namely b a iff i=1,n wi (bi − ai ) ≥ 0, while
342 M. Bounhas et al.
3.1 Methodology
Given a new pair of items d = (d1 , d2 ) for which preference is to be predicted, we
present two types of algorithms for predicting preferences in the following, corre-
sponding respectively to the “vertical reading” (first pattern) that exploits triples
of pairs of items, and to “horizontal reading” (second pattern) where pairs of items
are taken one by one. This leads to algorithms AP L3 and AP L1 respectively.
APL3 : The basic principle of AP L3 is to find triples t(a, b, c) of examples in E 3
that form with d either the non-contradictory trade-offs pattern (considered in
first), or the analogical proportion-based inference pattern.
For each triple t(a, b, c), we compute an analogical score At (a, b, c, d) that
estimates the extent to which it is in analogy with the item d using Formula 1.
Then to guess the final preference of d, for each possible solution, we first cumulate
these atomic scores provided by each of these triples in favor of this solution and
finally we assign to d the solution with the highest score. In case of ties, a majority
vote is applied.
The AP L3 can be described by this basic process:
APL1 : Applying the “horizontal reading” (second pattern), we consider only one
item a at a time and apply a comparison with d in terms of pairs of vectors rather
than comparing simultaneously 4 preferences, as with the first pattern. From a
preference a : a1 a2 such that (a1 , a2 , d1 , d2 ) is in analogical proportion, one
extrapolates that the same preference still holds for d : d1 d2 . A similar process
is applied in [7] that they called analogical transfer of preferences. A comparison
Comparison of Analogy-Based Methods for Predicting Preferences 343
3.2 Algorithms
Based on the above ideas, we propose two different algorithms for predicting
preferences. Let E be a training set of examples whose preference is known.
Algorithms 1 and 2 respectively describe the two previously introduced
Algorithm 1. AP L3
Input: a training set E of examples with known preferences
a new item d ∈ / E whose preference P (d) is unknown.
SumA(p)=0 for each p ∈ {, }
BestAt =0, S = ∅, BestSol = ∅
for each triple t = (a, b, c) in E 3 do
S = FindCandidateTriples(t)
for each candidate triple ct = (a , b , c ) in S do
if (P (a ) : P (b ) :: P (c ) : x has solution p) then
At = M in(A(a1 , b1 , c1 , d1 ), A(a2 , b2 , c2 , d2 ))
if (At > BestAt ) then
BestAt = At
BestSol = Sol(ct)
end if
end if
end for
SumA(BestSol)+ = BestAt
end for
maxi = max{SumA}
if (maxi = 0) then
if (unique(maxi, SumA)) then
P (d) = argmaxp {SumA}
else
Majority vote
end if
else
No Prediction
end if
return P (d)
344 M. Bounhas et al.
Algorithm 2. AP L1
Input: a training set E of examples with known preferences
a new item d ∈ / E whose preference P (d) is unknown.
SumA(p)=0 for each p ∈ {, }
BestA=0, BestSol = ∅
for each a in E do
BestA = max(A(a1 , a2 , d1 , d2 ), A(a2 , a1 , d1 , d2 ))
if (A(a1 , a2 , d1 , d2 ) > A(a2 , a1 , d1 , d2 )) then
BestSol = P (a)
else
BestSol = notP (a)
end if
SumA(BestSol)+ = BestA
end for
maxi = max{SumA}
if (maxi = 0) then
if (unique(maxi, SumA)) then
P (d) = argmaxp {SumA}
else
Majority vote
end if
else
No Prediction
end if
return P (d)
Comparison of Analogy-Based Methods for Predicting Preferences 345
A panoply of research works has been developed to deal with preference learning
problems, see, e.g., [8]. The goal of most of these works is to predict a total order
function that agrees with a given preference relation. See, e.g., Cohen et al. [5]
that developed a greedy ordering algorithm to build a total order on the input
preferences given by an expert, and also suggest an approach to linearly combine
a set of preferences functions. However, we are only intended to predict preferences
relations in this paper and not a total order on preferences.
Even if the proposed approaches for predicting preferences may also look sim-
ilar to the recommender systems models, these latter address problems that are
somewhat different from ours. Observe that, in general, prediction recommender
systems is based on examples where items are associated with absolute grades
(e.g., from 1 to 5); namely examples are not made of comparisons between pairs
of items as in our case.
To the best of our knowledge, the only approach also aiming at predicting pref-
erences on an analogical proportion basis is the recent paper [7], which only inves-
tigates “the horizontal reading” of preference relations, leaving aside a prelimi-
nary version of “the vertical reading” [1]. The algorithms proposed in [1] only use
the Boolean setting of analogical proportions, while the approach presented here
deals with the multiple-valued setting (also applied in [7]). Moreover, Algorithm 2
in [1] used a set of preference examples completed by monotony in case no triples
satisfying analogical proportion could be found, while in this work, this issue was
simply solved by selecting triples satisfying the analogical proportions with some
degree as suggested in the presentation of the multiple-valued extension in Sect. 2.
That’s why we compare deeply our proposed analogical proportions algorithms to
this last work [7]. APL algorithms differ from this approach in two ways.
First, the focus of [7] is on learning to rank user preferences based on the eval-
uation of a loss function, while our focus is on predicting preferences (rather than
getting a ranking) evaluated with the error rate of predictions.
Lastly, although Algorithm 1 in [7] may be useful for predicting preferences in
the same way as our AP L1 , the key difference between these two algorithms is that
AP L1 exploits the summation of all valid analogical proportions for each possible
solution for d to be predicted, while Algorithm 1 in [7] computes all valid analogi-
cal proportions, and then considers only the N most relevant ones for prediction,
i.e., those having the largest analogical scores. To select such N best scores, a total
order on valid proportions is required for each item d to be predicted which may
seem computationally costly for a large number of items, as noted in [7], while no
ordering on analogical proportions is required in the proposed AP L1 .
To compare our AP L algorithms to Algorithm 1 in [7], suitable for predicting
preferences, we also re-implemented the latter as described in their paper (without
considering their Algorithm 2). We also tuned the parameter N with the same
input values as fixed by the authors [7].
In terms of complexity, due to the use of triples of pairs of items in AP L3 , the
algorithm has a cubic complexity while Algorithm AP L3 (N N ) is quadratic. Both
AP L1 and Algorithm 1 in [7] are linear, even if Algorithm 1 is slightly computa-
tionally more costly due to the ordering process.
346 M. Bounhas et al.
4 Experimentations
To evaluate the proposed APL algorithms, we have developed a set of experiments
that we describe in the following.
4.1 Datasets
The experimental study is based on five datasets, the two first ones are synthetic
data generated from different functions: weighted average, Tversky’s additive dif-
ference and Sugeno Integral described in the following. For each dataset, any pos-
sible combination of the feature values over the scale S is associated with the pref-
erence relation.
n
SM in = min(max(vi , 6 − wi )),
i=1
where vi refers to the value of criterion i and wi represents its weight. In this
case, we tried two different sets of weights : w1 = 5, 4, 2 and w2 = 5, 3, 3,
respectively for criteria 1, 2 and 3.
– Datasets 2: we expand each preference relation to support 5 criteria, i.e:
n = 5. We apply the weights 0.4, 0.3, 0.1, 0.1, 0.1 in case of weighted average
function and w1 = 5, 4, 3, 2, 1 and w2 = 5, 4, 4, 2, 2 in case of Sugeno integral
functions. For generating the second dataset (TV), we used the following piece-
wise linear functions given in Appendix A. For the two datasets, weights are
fixed on a empirical basis, although other choices have been tested and have
led to similar results.
used to perform the initial cross-validation. We run each algorithm (with the pre-
vious procedure) 10 times. Accuracies and parameters shown in Table 2 are the
average values over the 10 different values (one for each run).
4.3 Results
Tables 2 and 3 provide prediction accuracies respectively for synthetic and real
datasets for the three proposed AP L algorithms as well as Algorithm 1 described
in [7] (denoted here “FH18”). The best accuracies for each dataset size are high-
lighted in bold.
If we analyze results in Tables 2 and 3 we can conclude that:
– For synthetic data and in case of datasets generated from a weighted average, it
is clear that AP L3 achieves the best performances for almost all dataset sizes.
AP L1 is just after. Note that these two algorithms record all triples/items ana-
logical scores for prediction. We may think that it is better to use all the train-
ing set for prediction to be compatible with weighted average examples.
– In case of datasets generated from a Sugeno integral, AP L1 is significantly bet-
ter than other algorithms for most datasets sizes and for the two weights W1
and W2 .
– If we compare results of the three types of datasets: the one generated from a
weighted average, from Tversky’s additive difference or from a Sugeno integral,
globally, we can see that the accuracy obtained for a Sugeno integral dataset
is the best in case of datasets with 3 criteria (see for example AP L1). For
datasets with 5 criteria, results obtained on weighted average datasets are bet-
ter than on the two others. While results obtained for Tversky’s additive dif-
ference datasets seem less accurate in most cases.
– For real datasets, it appears that AP L3 (N N ) is the best predictor for most
tested datasets. To predict user preferences, rather than using all the training
set for prediction, we can select a set of training examples, those where one of
them is among the k-nearest neighbors.
– AP L3 (N N ) seems less efficient in case of synthetic datasets. This is due to the
fact that synthetic data is generated randomly and applying the NN-approach
is less suitable in such cases.
– If we compare AP L algorithms to Algorithm1 “FH18”, we can see that AP L3
outperforms the latter in case of synthetic datasets. Moreover, AP L3 (N N ) is
better than “FH18” in case of real datasets.
– AP L algorithms achieve the same accuracy as Algorithm2 in [1] with a very
small dataset size (for the dataset with 5 criteria built from a weighted average,
only 200 examples are used by AP L algorithms instead of 1000 examples in [1]
to achieve the best accuracy). The two algorithms have close results for the
Food dataset.
– Comparing vertical and horizontal approaches, there is no clear superiority
of one view (for the tested datasets).
To better investigate the difference between the two best algorithms noted in
the previous study, we also develop a pairwise comparison at the instance level
Comparison of Analogy-Based Methods for Predicting Preferences 349
Datasets Size TT TF FT FF
D1 100 0.96 0.02 0.01 0.01
200 0.93 0.02 0.02 0.03
D2 100 0.84 0.07 0.05 0.04
200 0.9 0.04 0.01 0.05
Food 1000 0.528 0.18 0.106 0.186
Univ. 1000 0.735 0.158 0.033 0.074
Movie 1000 0.364 0.189 0.17 0.277
In Tables 5 and 6, we compare the best results obtained with our algorithms to
the accuracies obtained by finding the weighted sum that best fits the data in the
Comparison of Analogy-Based Methods for Predicting Preferences 351
learning sets. The weights are found by using linear programming as explained in
Appendix B.
Table 5 displays the results obtained using the synthetic datasets generated
according to Tverski’s model. The results for datasets generated by a weighted
average or a Sugeno integral are not reproduced in this table because the weighted
sum (WSUM) almost always reaches an accuracy of 100%. Only in three cases on
thirty, its accuracy is slightly lower, with a worst performance of 97.5%. This is not
unexpected for datasets generated by means of a weighted average, since WSUM is
the right model in this case. It is more surprising for data generated by a Sugeno
integral (even if we have only dealt here with particular cases), but we get here
some empirical evidence that the Sugeno integral can be well-approximated by a
weighted sum. The results are quite different for datasets generated by Tversky’s
model. WSUM shows the best accuracy in two cases; APL1 and APL3, also in
two cases each. Tversky’s model does not lead to transitive preference relations,
in general, and this may be detrimental to WSUM that models transitive relations.
Table 5. Prediction accuracies for artificial datasets generated by the Tverski model
Table 6 compares the accuracies obtained with the real datasets. WSUM yields
the best results for all datasets except for the “Food” dataset, size 1000.
5 Conclusion
The results presented in the previous section confirm the interest of considering
analogical proportions for predicting preferences, which was the primary goal of
this paper since such an approach has been proposed only recently. We observed
that analogical proportions yield a better accuracy as compared to a weighted sum
model for certain datasets (TV, among the synthetic datasets and Food, as a real
dataset). Determining for which datasets this tends to be the case requires further
investigation.
Analogical proportions may be a tool of interest for creating artificial examples
that are useful for enlarging a training set, see, e.g., [2]. It would be worth investi-
gating to see if such enlarged datasets could benefit to analogy-based preference
learning algorithms as well as to the ones based on weighted sum.
Tversky’s additive difference model functions used in the experiments are given
below. Let d1 , d2 be a pair of alternative that have to be compared. We denote by
ηi the difference between d1 and d2 on the criterion i, i.e. ηi = d1i − d2i . For the
TV dataset in which 3 features are involved, we used the following piecewise linear
functions:
⎧
⎪sgn(η1 ) 0.453 · 0.143 · η1
⎪
⎪
if |η1 | ∈ [0, 0.25],
⎨sgn(η ) 0.453 · [−0.168 + 0.815 · η ] if |η | ∈ [0.25, 0.5],
1 1 1
Φ1 (η1 ) =
⎪sgn(η1 ) 0.453 · [0.230 + 0.018 · η1 ]
⎪ if |η 1 | ∈ [0.5, 0.75],
⎪
⎩
sgn(η1 ) 0.453 · [−2.024 + 3.024 · η1 ] if |η1 | ∈ [0.75, 1],
⎧
⎪
⎪ sgn(η2 ) 0.053 · 2.648 · η2 if |η2 | ∈ [0, 0.25],
⎪
⎨sgn(η ) 0.053 · [0.371 + 1.163 · η ] if |η | ∈ [0.25, 0.5],
2 2 2
Φ2 (η2 ) =
⎪
⎪ sgn(η ) 0.053 · [0.926 + 0.054 · η ] if |η 2 | ∈ [0.5, 0.75],
⎪
⎩
2 2
sgn(η2 ) 0.053 · [0.866 + 0.134 · η2 ] if |η2 | ∈ [0.75, 1],
⎧
⎪
⎪ sgn(η3 ) 0.494 · 0.289 · η3 if |η3 | ∈ [0, 0.25],
⎪
⎨sgn(η ) 0.494 · [−0.197 + 1.076 · η ] if |η | ∈ [0.25, 0.5],
3 3 3
Φ3 (η3 ) =
⎪
⎪ sgn(η ) 0.494 · [0.150 + 0.383 · η ] if |η 3 | ∈ [0.5, 0.75],
⎪
⎩
3 3
sgn(η3 ) 0.494 · [−1.252 + 2.252 · η3 ] if |η3 | ∈ [0.75, 1].
Comparison of Analogy-Based Methods for Predicting Preferences 353
For the TV dataset in which 5 features are involved, we used the following
piecewise linear functions:
⎧
⎪
⎪ sgn(η1 ) · 0.294 · 2.510 · η1 if |η1 | ∈ [0, 0.25],
⎪
⎨sgn(η ) · 0.294 · [0.562 + 0.263 · η ]
1 1 if |η1 | ∈ [0.25, 0.5],
Φ1 (η1 ) =
⎪
⎪ sgn(η ) · 0.294 · [0.645 + 0.096 · η ] if |η1 | ∈ [0.5, 0.75],
⎪
⎩
1 1
sgn(η1 ) · 0.294 · [−0.130 + 1.130 · η1 ] if |η1 | ∈ [0.75, 1],
⎧
⎪
⎪ sgn(η2 ) · 0.151 · 0.125 · η2 if |η2 | ∈ [0, 0.25],
⎪
⎨sgn(η ) · 0.151 · [0.025 + 0.023 · η ]
2 2 if |η2 | ∈ [0.25, 0.5],
Φ2 (η2 ) =
⎪
⎪ sgn(η2 ) · 0.151 · [−0.545 + 1.164 · η2 ] if |η2 | ∈ [0.5, 0.75],
⎪
⎩
sgn(η2 ) · 0.151 · [−1.689 + 2.689 · η2 ] if |η2 | ∈ [0.75, 1],
⎧
⎪
⎪ sgn(η3 ) · 0.039 · 2.388 · η3 if |η3 | ∈ [0, 0.25],
⎪
⎨ sgn(η3 ) · 0.039 · [0.582 + 0.057 · η3 ] if |η3 | ∈ [0.25, 0.5],
Φ3 (η3 ) =
⎪
⎪ sgn(η3 ) · 0.039 · [−0.046 + 1.314 · η3 ] if |η3 | ∈ [0.5, 0.75],
⎪
⎩
sgn(η3 ) · 0.039 · [0.759 + 0.241 · η3 ] if |η3 | ∈ [0.75, 1],
⎧
⎪
⎪ sgn(η4 ) · 0.425 · 0.014 · η4 if |η4 | ∈ [0, 0.25],
⎪
⎨sgn(η ) · 0.425 · [−0.110 + 0.455 · η ] if |η | ∈ [0.25, 0.5],
4 4 4
Φ1 (η4 ) =
⎪
⎪ sgn(η ) · 0.425 · [−0.341 + 0.917 · η ] if |η 4 | ∈ [0.5, 0.75],
⎪
⎩
4 4
sgn(η4 ) · 0.425 · [−1.613 + 2.613 · η4 ] if |η4 | ∈ [0.75, 1].
⎧
⎪
⎪ sgn(η5 ) · 0.091 · 3.307 · η5 if |η5 | ∈ [0, 0.25],
⎪
⎨sgn(η ) · 0.091 · [0.697 + 0.519 · η ] if |η | ∈ [0.25, 0.5],
5 5 5
Φ1 (η5 ) =
⎪
⎪ sgn(η ) · 0.091 · [0.880 + 0.153 · η ] if |η 5 | ∈ [0.5, 0.75],
⎪
⎩
5 5
sgn(η5 ) · 0.091 · [0.979 + 0.021 · η5 ] if |η5 | ∈ [0.75, 1].
wi ∈ [0, 1] i = 1, ..., n
δa ∈ [0, ∞[
with:
– n: number of features,
– E: learning set composed of pairs (a1 , a2 ) evaluated on n features and a pref-
erence relation for each pair (a1 a2 or a1 ≺ a2 ),
– wi : weight associated to feature i,
– : a small positive value.
354 M. Bounhas et al.
References
1. Bounhas, M., Pirlot, M., Prade, H.: Predicting preferences by means of analogi-
cal proportions. In: Cox, M.T., Funk, P., Begum, S. (eds.) ICCBR 2018. LNCS
(LNAI), vol. 11156, pp. 515–531. Springer, Cham (2018). https://doi.org/10.1007/
978-3-030-01081-2 34
2. Bounhas, M., Prade, H.: An analogical interpolation method for enlarging a training
dataset. In: BenAmor, N., Theobald, M. (eds.) Proceedings of 13th International
Conference on Scalable Uncertainty Management (SUM 2019), Compiègne, 16–18
December. LNCS, Springer (2019)
3. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approx. Reason. 91, 36–55 (2017)
4. Chen, S., Joachims, T.: Predicting matchups and preferences in context. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge. Dis-
covery and Data Mining (KDD 2016), pp. 775–784. ACM (2016)
5. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. CoRR
abs/1105.5464 (2011). http://arxiv.org/abs/1105.5464
6. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
7. Fahandar, M.A., Hüllermeier, E.: Learning to rank based on analogical reasoning.
In: Proceedings of 32nd National Conference on Artificial Intelligence (AAAI 2018),
New Orleans, 2–7 February 2018 (2018)
8. Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-14125-6
9. Hüllermeier, E., Fürnkranz, J.: Editorial: preference learning and ranking. Mach.
Learn. 93(2–3), 185–189 (2013)
10. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
11. Pirlot, M., Prade, H., Richard, G.: Completing preferences by means of analogi-
cal proportions. In: Torra, V., Narukawa, Y., Navarro-Arribas, G., Yañez, C. (eds.)
MDAI 2016. LNCS (LNAI), vol. 9880, pp. 135–147. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-45656-0 12
12. Prade, H., Richard, G.: Homogeneous logical proportions: their uniqueness and their
role in similarity-based prediction. In: Brewka, G., Eiter, T., McIlraith, S.A. (eds.)
Proceedings of 13th International Conference on Principles of Knowledge Repre-
sentation and Reasoning (KR 2012), Roma, 10–14 June, pp. 402–412. AAAI Press
(2012)
13. Prade, H., Richard, G.: From analogical proportion to logical proportions. Log.
Univers. 7(4), 441–505 (2013)
14. Prade, H., Richard, G.: Analogical proportions: from equality to inequality. Int. J.
Approx. Reason. 101, 234–254 (2018)
15. Tversky, A.: Intransitivity of preferences. Psychol. Rev. 76, 31–48 (1969)
Using Convolutional Neural Network
in Cross-Domain Argumentation
Mining Framework
1 Introduction
An important sub-field of computational argumentation is Argumentation Min-
ing (AM) which aims to detect argumentative sentences and argument com-
ponents from different sources (e.g., online debates, social medias, persuasive
essays, forums). In the last decade, AM field has gained the interest of many
researchers due to its important impact in several domains [1,4,19]. AM process
can be divided into sub-tasks taking the form of a pipeline as proposed in [3].
They presented three sub-tasks namely, argumentative sentence detection, argu-
ment component boundary detection and argument structure prediction. Argu-
mentative sentence detection task is viewed as a classification problem where
argumentative sentences are classified into two classes (i.e., argumentative, not
argumentative). Argument component boundary detection is treated as a seg-
mentation problem and may be presented either as a multi-class classification
issue (i.e., classify each component) or as a binary classification issue (i.e., one
classifier for each component) solved using machine learning classifiers.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 355–367, 2019.
https://doi.org/10.1007/978-3-030-35514-2_26
356 R. Bouslama et al.
Units (ReLu) using stochastic gradient descent (SGD) with a minibatch of size
128. The input of the model is a sequence of encoded characters using one-hot
encoding. These encoded characters present n vectors with fixed length l0 . The
model proposed in [9] consists of 9 layers: 6 convolutional layers and 3 fully-
connected layers. Two versions were presented: (i) a small version with an input
length of 256 and (ii) a large version where the input length is 1024.
As in Kim’s model [8], word-based CNNs, compute multi-dimentional con-
volution (n*k matrix) and max over time pooling. For the representation of a
sentence, two input channels are proposed (i.e., static and non-static). Kim’s
model is composed of a layer that performs convolutions over the embedded
word vectors predefined in Word2Vec, then max-pooling is applied to the result
of the convolutional layer and similarly to char-level models, Rectified Linear
Units (ReLu) is applied.
ArgumentaƟve
sentences
detecƟon
ArgumentaƟve
Segments
Argument
Web scrapping Claim components
detecƟon <claim,
Text premises>
ArgumentaƟve
Sentence Segments
segmentaƟon
Premises
detecƟon
Text segments
Using web scraping techniques, we extract users comments from several sources
(e.g., social media, online forums). This step is ensured by a set of scrappers
developed using Python. Then, extracted data enters the core of our system
which is a set of trained models. More details about the argument component
detection phase are presented in the next sub-section.
Existing AM contributions treat different input’s granularity (i.e., paragraph,
sentence, intra-sentence) and most of the existing work focuses on sentences and
intra-sentences level [3]. In this work we are focusing on sentence level and we
suppose that the whole sentence coincides with an argument component (i.e.,
claim or premise). Therefore, after scrapping users’ comments from the web
a text segmentation is required. Collected comments will be segmented into
sentences based on the sentence tokenization using Natural Language Toolkit
[20].
Where the claims are the defended ideas and the premises present explanations,
proofs, facts, etc. that backup the claims.
Figure 2 depicts the followed steps in the argument component detection pro-
cess which consists of detecting argumentative sentences from non-argumentative
sentences, detecting claims in argumentative sentences and then detecting
premises presented to backup the claim. For the three tasks, word-based and
character-based CNNs are trained on three different corpora.
Using char-level CNN, an alphabet containing all the letters, numbers and
special characters is used to encode the input text. Each input character is
quantified using one-hot encoding. A Stochastic Gradient Descent is used as an
optimizer with mini-batches of size 32. As for word-based CNN, we do not use
two input channels as proposed in the original paper and instead we use only
one channel. We also ensure word embedding using an additional first layer that
embeds words into low-dimensional vectors. We use the ADAM algorithm as an
optimizer.
4 Experimental Study
– Persuasive Essays corpora [13]: which consists of over 400 persuasive essays
written by students. All essays have been segmented by three expert annota-
tors into three types of argument units (i.e., major claim, claim and premises).
In this paper we follow a claim/premises argumentation model, so to ensure
comparability between data sets, each major claim is considered as a claim.
This corpora presents a domain specific data (i.e., argumentative essays). The
first task that we evoke in this paper is argumentative sentences/texts detec-
tion. For this matter, we added non-argumentative data to argumentative
essays corpora. Descriptive and narrative text are extracted from Academic
Help2 and descriptive short stories from The short story website3 .
2
https://academichelp.net/.
3
https://theshortstory.co.uk.
Using CNN in Cross-Domain Argumentation Mining Framework 361
– Web Discourse corpora [16]: contains 990 comments and forum posts labeled
as persuasive or non-persuasive and 340 documents annotated with the
extended Toulming model to claim, grounds, backing, and rebuttal and refu-
tation. For this data set we consider grounds and backing as premises since
they backup the claim while rebuttal and refutation are ignored. This corpora
presents domain-free data.
– N-domain corpora: we construct a third corpora by combining both persua-
sive essays and web-discourse corpora to make the heterogeneity of data even
more intense with the goal to investigate the performance of the different
models in a multi domain context.
In order to train SVM and Naı̈ve Bayes, a pre-processing data phase and
a set of features (e.g., semantic, syntactic) are required. For this purpose, we
apply word tokenization to tokenize all corpora into words and we also apply
both word lemmatization and stemming. Thus, words such as “studies” will give
us a stem “studi” and a lemma “study”. This gives an idea on the meaning and
the role of a given word. Indeed, before lemmatization and stemming, we use
POS-tag (Part-Of-Speech tag) technique to indicate the grammatical category of
each word in a sentence (i.e., noun, verb, adjective, adverb). We also consider the
well known TF-IDF technique [25] that outperforms the bag-of-word techniques
and stands for Term-Frequency of a given word in a sentence and for the Inverse-
Document-Frequency (IDF) that measures a word’s rarity in the vocabulary of
each corpora.
As for word-based CNN, we only pad each sentence to the max sentence
length in order to batch data in an efficient way and build a vocabulary index
used to encode each sentence as a vector of integers. For character-level CNN,
we only remove URL and hash tags from the original data.
premises. As for the cross-domain case, six models of each classifier are trained in
each sub-task (i.e., arguments detection, claim detection and premise detection)
where each model is trained on one corpora and tested on another one. This will
guarantee the cross-domain context.
To train char-level CNN models, We start with a learning rate equal to 0.01
and halved each 3 epochs. For all corpora, the epoch size is fixed to 10 yet,
the training process may be stopped if for 3 consecutive epochs the validation
loss did not improve. The loss is calculated using cross-entropy loss function.
Two versions of char-level CNN exist, a small version (i.e., the inputs length is
256) that is used for in-domain training and the large version (i.e., the inputs
length is 1024) is used for cross-domain model training. As for word-based CNN
models, we use the same loss function (i.e., cross-entropy loss) and we optimize
the loss using the ADAM optimizer. We use a 128 dimension of characters for
the embedding phase, filter sizes equal to 3,4,5, a mini-batch size equal to 128. In
addition, a dropout regularization (L2) is applied to avoid overfitting set equal
to 5 and a dropout rate equal to 0.5. The classification of the result is ensured
using a softmax layer. In this paper, we do not use Word2Vec pre-trained word
vector, instead we ensure word embedding from scratch.
In order to train char-based CNN, SVM and Naı̈ve Bayes models, we split
each data-set to 80% for training and 20% for validation while to train word-
based CNN models we split data to 80% and 10% following the original paper
of word-based CNN [8].
and
n
P recision = (T rueP ositivei /T rueP ositivei + F alseP ositivei )/n (3)
i=1
Table 2. The in-domain and cross-domain macro F1-scores. Each row represents the
results of one of the models (character-level CNN, word-level CNN, SVM and Naı̈ve
Bayes), the highest value is marked in bold.
Test on
Test on Essays Test on n-domain
Web-discourse
n-domain
n-domain
n-domain
Web- discourse
Web-discourse
Web-discourse
Essays
Essays
Essays
Task Models
Char-level CNN 0.72 0.93 0.56 0.56 0.52 0.81 0.51 0.46 0.62
Word-based CNN 0.74 0.35 0.33 0.33 0.39 0.83 0.33 0.36 0.44
AD
SVM 0.66 0.30 0.88 0.37 0.54 0.61 0.68 0.29 0.83
Naı̈ve Bayes 0.11 0.63 0.57 0.25 0.34 0.37 0.36 0.63 0.75
Char-level CNN 0.83 0.59 0.57 0.28 0.54 0.88 0.45 0.49 0.82
Word-based CNN 0.98 0.34 0.69 0.59 0.45 0.51 0.61 0.34 0.44
CD
SVM 0.61 0.55 0.37 0.48 0.43 0.68 0.47 0.55 0.89
Naı̈ve Bayes 0.50 0.10 0.12 0.26 0.49 0.30 0.24 0.11 0.59
Char-level CNN 0.80 0.98 0.80 0.51 0.78 0.75 0.59 0.91 0.82
Word-based CNN 0.43 0.35 0.44 0.37 0.60 0.33 0.33 0.85 0.78
PD
SVM 0.44 0.06 0.12 0.43 0.71 0.89 0.35 0.84 0.69
Naı̈ve Bayes 0.39 0.10 0.16 0.34 0.57 0.87 0.35 0.80 0.62
classified as a claim (or premise) if at least six CNNs models labeled it as a claim
(resp. premise).
Figure 3a and b contain examples of comments that were extracted from
Quora. Figure 3c depicts a comment extracted from the Reddit forum platform.
These comments were detected as arguments. Figure 3a, contains a comment
extracted from an argument between many Quora’s users, Fig. 3b contains a
comment of a user convincing another one about Japenese cars and Fig. 3c con-
tains a comment of a user arguing why iPhone and Samsung users hate on each
other. Once arguments are detected, ArguWeb classify each comment’s compo-
nent to claim, premises or neither of them. Indeed, the Fig. 3 details the detected
components (i.e., claims and premises) of these arguments. Uncoloured texts seg-
ments were not classified as claims neither as premises.
Comments like “As announced by YouTube Music! Congrats, Taylor!!!” were
classified from the beginning as not-argumentative and were not processed by
models responsible to detect the different components.
5 Conclusion
This paper proposes ArguWeb a cross-domain framework for arguments detec-
tion in the web. The framework is based on a set of web scrappers that extract
users comments from the web (e.g., social media, online forums). Extracted data
is classified as: (1) argumentative or not and (2) claims, premises or neither of
them using character-level Convolutional Neural Networks and word-based Con-
volutional Neural Networks. An experimental study is conducted where both
366 R. Bouslama et al.
References
1. Stab, C., Gurevych, I.: Identifying argumentative discourse structures in persuasive
essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 46–56 (2014)
2. Cabrio, E., Villata, S.: Combining textual entailment and argumentation theory
for supporting online debates interactions. In: Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics (vol. 2: Short Papers),
pp. 208–212 (2012)
3. Lippi, M., Torroni, P.: Argumentation mining: state of the art and emerging trends.
ACM Trans. Internet Technol. (TOIT) 16(2), 10 (2016)
4. Lippi, M., Torroni, P.: MARGOT: a web server for argumentation mining. Expert
Syst. Appl. 65, 292–303 (2016)
5. Aker, A., et al.: What works and what does not: classifier and feature analysis for
argument mining. In: Proceedings of the 4th Workshop on Argument Mining, pp.
91–96 (2017)
6. Hua, X., Nikolov, M., Badugu, N., Wang, L.: Argument mining for understanding
peer reviews. arXiv preprint. arXiv:1903.10104 (2019)
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Proceedings of Advances in Neural Information
Processing Systems, pp. 1097–1105 (2012)
8. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint.
arXiv:1408.5882 (2014)
9. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text
classification. In: Proceedings of Advances in Neural Information Processing Sys-
tems, pp. 649–657 (2015)
10. Dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment anal-
ysis of short texts. In: Proceedings of COLING 2014, the 25th International Con-
ference on Computational Linguistics: Technical Papers, pp. 69–78 (2014)
11. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language
models. In: Proceedings of 30th AAAI Conference on Artificial Intelligence (2016)
12. Laha, A., Raykar, V.: An empirical evaluation of various deep learning architectures
for bi-sequence classification tasks. arXiv preprint. arXiv:1607.04853 (2016)
13. Stab, C., Gurevych, I.: Parsing argumentation structures in persuasive essays.
Comput. Linguist. 43(3), 619–659 (2017)
Using CNN in Cross-Domain Argumentation Mining Framework 367
14. Daxenberger, J., Eger, S., Habernal, I., Stab, C., Gurevych, I.: What is the essence
of a claim? Cross-domain claim identification. arXiv preprint. arXiv:1704.07203
(2017)
15. Lugini, L., Litman, D.: Argument component classification for classroom discus-
sions. In: Proceedings of the 5th Workshop on Argument Mining, pp. 57–67 (2018)
16. Habernal, I., Eckle-Kohler, J., Gurevych, I.: Argumentation mining on the web
from information seeking perspective. In: Proceedings of ArgNLP (2014)
17. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional
neural networks concepts to hybrid NN-HMM model for speech recognition. In:
Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 4277–4280 (2012)
18. Al-Khatib, K., Wachsmuth, H., Hagen, M., Köhler, J., Stein, B.: Cross-domain
mining of argumentative text through distant supervision. In: Proceedings of the
2016 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pp. 1395–1404 (2016)
19. Wachsmuth, H., et al.: Building an argument search engine for the web. In: Pro-
ceedings of the 4th Workshop on Argument Mining, pp. 49–59 (2017)
20. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing
Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)
21. Li, M., Gao, Y., Wen, H., Du, Y., Liu, H., Wang, H.: Joint RNN model for argument
component boundary detection. In: Proceeding of the 2017 IEEE International
Conference on Systems, Man, and Cybernetics (SMC), pp. 57–62 (2017)
22. Ajjour, Y., Chen, W.F., Kiesel, J., Wachsmuth, H., Stein, B.: Unit segmentation of
argumentative texts. In: Proceedings of the 4th Workshop on Argument Mining,
pp. 118–128 (2017)
23. Stab, C., Miller, T., Gurevych, I.: Cross-topic argument mining from heterogeneous
sources using attention-based neural networks. arXiv preprint. arXiv:1802.05758
(2018)
24. Guggilla, C., Miller, T., Gurevych, I.: CNN-and LSTM-based claim classification
in online user comments. In: Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical Papers, pp. 2740–2751 (2016)
25. Sparck Jones, K.: A statistical interpretation of term specificity and its application
in retrieval. J. Doc. 28(1), 11–21 (1972)
ConvNet and Dempster-Shafer Theory
for Object Recognition
1 Introduction
Dempster-Shafer (DS) theory of belief functions [3,24] has been widely used for
reasoning and making decisions with uncertainty [29]. DS theory is based on
representing independent pieces of evidence by completely monotone capacities
and aggregating them using Dempster’s rule. In the past decades, DS theory has
been applied to pattern recognition and supervised classification in three main
directions. The first one is classifier fusion, in which classifier outputs are con-
verted into mass functions and fused by Dempster’s rule (e.g., [2,19]). Another
direction is evidential calibration: the decisions of classifiers are transformed into
This research was carried out in the framework of the Labex MS2T, which was funded
by the French Government, through the program “Investments for the future” managed
by the National Agency for Research (Reference ANR- 11-IDEX-0004-02). It was also
supported by a scholarship from the China Scholarship Council.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 368–381, 2019.
https://doi.org/10.1007/978-3-030-35514-2_27
ConvNet and Dempster-Shafer Theory for Object Recognition 369
mass functions (e.g., [20,28]). The last approach is to design evidential classifiers
(e.g., [6]), which represent the evidence of each feature as elementary mass func-
tions and combine them by Dempster’s rule. The combined mass functions are
then used for decision making [5]. Compared with conventional classifiers, evi-
dential classifiers can provide more informative outputs, which can be exploited
for uncertainty quantification and novelty detection. Several principles have been
proposed to design evidential classifiers, mainly including the evidential k-nearest
neighbor rule [4,9], and evidential neural network classifiers [6]. In practice, the
performance of evidential classifiers heavily depends on two factors: the training
set size and the reliability of object representation. With the development of the
“Big Data” age, the number of examples in benchmark datasets for supervised
algorithms has increased from 102 to 105 [14] and even 109 [21]. However, little
has been done to combine recent techniques for object representation with DS
theory.
Thanks to the explosive development of deep learning [15] and its applications
[14,25], several approaches for object representation have been developed, such
as restricted Boltzmann machines [1], deep autoencoders [26,27], deep belief net-
works [22,23], and convolutional neural networks (ConvNets) [12,17]. ConvNet,
which is maybe the most promising model and the main focus of this paper,
mainly consists of convolutional layers, pooling layers, and fully connected lay-
ers. It has been proved that ConvNets have the ability to extract local features
and compute global features, such as from edges to corners and contours to
object parts. In general, robustness and automation are two desirable properties
of ConvNets for object representation. Robustness means strong tolerance to
translation and distortion in deep representation, while automation implies that
object representation is data-driven with no human assistance.
Motivated by recent advances in DS theory and deep learning, we propose to
combine ConvNet and DS theory for object recognition allowing for ambiguous
pattern rejection. In this approach, a ConvNet with nonlinear convolutional
layers and a global pooling layer is used to extract high-order features from
input data. Then, the features are imported into a belief function classifier, in
which they are converted into Dempster-Shafer mass functions and aggregated by
Dempster’s rule. Finally, evidence-theoretic rules are used for pattern recognition
and rejection based on the aggregated mass functions. The performances of this
classifier on the CIFAR-10, CIFAR-100, and MNIST datasets are demonstrated
and discussed.
The organization of the rest of this paper is as follows. Background knowledge
on DS theory and ConvNet is recalled in Sect. 2. The new combination between
DS theory and ConvNet is then established in Sect. 3, and numerical experiments
are reported in Sect. 4. Finally, we conclude the paper in Sect. 5.
2 Background
In this section, we first recall some necessary definitions regarding the DS theory
and belief function classifier (Sect. 2.1). We then provide a description of the
370 Z. Tong et al.
Evidence Theory. The main concepts regarding DS theory are briefly presented
in this section, and some basic notations are introduced. Detailed information can
be found in Shafer’s original work [24] and some up-to-date studies [8].
Given a finite set Ω = {ω1 , · · · , ωk }, called the frame of discernment, a mass
function is a function m from 2Ω to [0,1] verifying m(∅) = 0 and
m(A) = 1. (1)
A⊆Ω
For any A ⊆ Ω, given a certain piece of evidence, m(A) can be regarded as the
belief that one is willing to commit to A. Set A is called a focal element of m
when m(A) > 0.
For all A ⊆ Ω, a credibility function bel and a plausibility function pl,
associated with m, are defined as
bel(A) = m(B) (2)
B⊆A
pl(A) = m(B). (3)
A∩B=∅
for all A = ∅ and (m1 ⊕ m2 )(∅) = 0. Mass functions m1 and m2 can be combined
if and only if the denominator on the right-hand side of (4) is strictly positive.
The operator ⊕ is commutative and associative.
(a) Architecture of a belief function classifier (b) Connection between layers L2 and L3
μiM +1 = μi−1 i
M +1 m ({Ω}) i = 2, · · · , n. (7c)
The classifier outputs m = (m({ω1 }), . . . , m({ωM }), m(Ω))T is finally obtained
as m = μn .
Otherwise, the pattern is assigned to class ωj with j = arg maxk=1,··· ,M m({ωk }).
For the maximum plausibility and maximum pignistic probability rules, rejection
is possible if and only if 0 ≤ λ0 ≤ 1 − 1/M , whereas a rejection action for the
maximum credibility rule only requires 0 ≤ λ0 ≤ 1.
T
1
fi,j,k = ReLU wk1 · x + b1k , k = 1, · · · , C (8a)
..
.
m T m−1
fi,j,k = ReLU (wkm ) · fi,j + bm
k , k = 1, · · · , C. (8b)
3 ConvNet-BF Classifier
In this section, we present a method to combine a belief function classifier and a
ConvNet for objection recognition allowing for ambiguous pattern rejection. The
374 Z. Tong et al.
and the weights are normalized by introducing new parameters ζki (ζki ∈ R) as
(ζki )2
wki = P
. (11)
i
(ζl ) 2
l=1
3.2 Learning
The proposed learning strategy to train a ConvNet-BF classifier consists in two
parts: (a) an end-to-end training method to train ConvNet and belief function
classifier simultaneously; (b) a data-driven method to select λ0 .
P
∂Eν (x) ∂Eν (x) ∂si ∂Eν (x) i 2 i
= = · 2(η ) s · wki (xk − pik ), (13)
∂pik ∂si ∂pik ∂si k=1
P
∂Eν (x) ∂Eν (x) ∂si ∂Eν (x) i 2 i
= = − · 2(η ) s · ωki (xk − pik ), (16)
∂xk ∂si ∂xk ∂si k=1
m
∂Eν (x) ∂Eν (x) ∂fi,j,k m ∂Eν (x)
m = m · m = wi,j,k · m k = 1, · · · , P, (17)
∂wi,j,k ∂fi,j,k ∂wi,j,k ∂fi,j,k
and m
∂Eν (x) ∂Eν (x) ∂fi,j,k ∂Eν (x)
m = m · m = m k = 1, · · · , P (18)
∂bk ∂fi,j,k ∂bk ∂fi,j,k
376 Z. Tong et al.
with
∂Eν (x) ∂Eν (x) ∂xk 1 ∂Eν (x)
m = · m = k = 1, · · · , P. (19)
∂fi,j,k ∂xk ∂fi,j,k W · H ∂xk
m
Here, wi,j,k is the component of the weight matrix wkm , while fi,j,k
m
is the com-
m
ponent of vector fi,j in Eq. (8).
4 Numerical Experiments
In this section, we evaluate ConvNet-BF classifiers on three benchmark datasets:
CIFAR-10 [13], CIFAR-100 [13], and MNIST [16]. To compare with traditional
ConvNets, the architectures and training strategies of the ConvNet parts in
ConvNet-BF classifiers are the same as those used in the study of Lin et al.,
called NIN [18]. Feature vectors from the ConvNet parts are imported into a
belief function classifier in our method, while they are directly injected into
softmax layers in NINs.
In order to make a fair comparison, a probability-based rejection rule is
adopted for NINs as maxj=1,··· ,M pj < 1 − λ0 , where pj is the output probability
of NINs.
4.1 CIFAR-10
The CIFAR-10 dataset [13] is made up of 60,000 RGB images of size 32 × 32
partitioned in 10 classes. There are 50,000 training images, and we randomly
selected 10,000 images as validation data for the ConvNet-BF classifier. We
then randomly used 10,000 images of the training set to determine λ0 .
ConvNet and Dempster-Shafer Theory for Object Recognition 377
The test set error rates without rejection of the ConvNet-BF and NIN classi-
fiers are 9.46% and 9.21%, respectively. The difference is small but statistically
significant according to McNemar’s test (p-value: 0.012). Error rates without
rejection mean that we only consider maxj=1,··· ,M pj and maxj=1,··· ,M m ({ωj }).
If the selected class is not the correct one, we regard it as an error. It turns
out in our experiment that using a belief function classifier instead of a softmax
layer only slightly impacts the classifier performance.
The test set error rates with rejection of the two models are presented in
Fig. 4a. A rejection decision is not regarded as an incorrect classification. When
the rejection rate increases, the test set error decreases, which shows that the
belief function classifier rejects a part of incorrect classification. However, the
error decreases slightly when the rejection rate is higher than 7.5%. This demon-
strates that the belief function classifier rejects more and more correctly classified
(i)
patterns with the increase of rejection rates. Thus, a satisfactory λ0 should be
determined to guarantee that the ConvNet-BF classifier has a desirable accuracy
rate and a low correct-rejection rate. Additionally, compared with the NIN, the
ConvNet-BF classifier rejects significantly more incorrectly classified patterns.
For example, the p-value of McNemar’s test for the difference of error rates
between the two classifiers with a 5.0% rejection rate is close to 0. We can con-
clude that a belief function classifier with an evidence-theoretic rejection rule is
more suitable for making a decision allowing for pattern rejection than a softmax
layer and the probability-based rejection rule.
Table 1 presents the confusion matrix of the ConvNet-BF classifier with the
maximum credibility rule, whose rejection rate is 5.0%. The ConvNet-BF clas-
sifier tends to select rejection when there are two or more similar patterns,
such as dog and cat, which can lead to incorrect classification. In the view of
evidence theory, the ConvNet part provides conflicting evidence when two or
more similar patterns exist. The maximally conflicting evidence corresponds to
m ({ωi }) = m ({ωj }) = 0.5 [7]. Additionally, the additional mass function m (Ω)
provides the possibility to verify whether the model is well trained because we
have m (Ω) = 1 when the ConvNet part cannot provide any useful evidence.
4.2 CIFAR-100
The CIFAR-100 dataset [13] has the same size and format at the CIFAR-10
dataset, but it contains 100 classes. Thus the number of images in each class is
only 100. For CIFAR-100, we also randomly selected 10,000 images of the training
set to determine λ0 . The ConvNet-BF and NIN classifiers achieved, respectively,
40.62% and 39.24% test set error rates without rejection, a small but statisti-
cally significant difference (p-value: 0.014). Similarly to CIFAR-10, it turns out
that the belief function classifier has a similar error rate as a network with a
softmax layer. Figure 4b shows the test set error rates with rejection for the two
models. Compared with the rejection performance in CIFAR-10, the ConvNet-
BF classifier rejects more incorrect classification results. We can conclude that
the evidence-theoretic classifier still performs well when the classification task is
difficult and the training set is not adequate. Similarly, Table 2 shows that the
378 Z. Tong et al.
Fig. 4. Rejection-error curves: CIFAR-10 (a), CIFAR-100 (b), and MNIST (c)
Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
Airplane - 0.03 0.03 0.01 0.02 0.05 0.04 0.01 0.04 0.05
Automobile 0 - 0.04 0.04 0.08 0.08 0.04 0.06 0.03 0.07
Bird 0.02 0.04 - 0.05 0.04 0.07 0.03 0.08 0 0.04
Cat 0.02 0.03 0.13 - 0.06 0.44 0.11 0.04 0.05 0.06
Deer 0.01 0.04 0.07 0.12 - 0.03 0.12 0.34 0.04 0.08
Dog 0.02 0.03 0.05 0.49 0.11 - 0.06 0.09 0.01 0.04
Frog 0.02 0.04 0.08 0.06 0.12 0.06 - 0.06 0.06 0.05
Horse 0.01 0.02 0.04 0.06 0.31 0.10 0.04 - 0.04 0.04
Ship 0.04 0.05 0.02 0.04 0.12 0.05 0.04 0.18 - 0.02
Truck 0.02 0 0.06 0.09 0.03 0.06 0.07 0.06 0.04 -
Rejection 0.20 0.13 0.14 1.05 0.84 1.07 0.14 1.14 0.18 0.11
ConvNet and Dempster-Shafer Theory for Object Recognition 379
ConvNet-BF classifier tends to select the rejection action when two classes are
similar, in which case we have m ({ωi }) ≈ m ({ωj }). In contrast, the classifier
tends to produce m(Ω) ≈ 1 when the model is not trained well because of an
inadequate training set.
4.3 MNIST
The MNIST database of handwritten digits consists of a training set of 60,000
examples and a test set of 10,000 examples. The training strategy for the
ConvNet-BF classifier was the same as the strategy in CIFAR-10 and CIFAR-
100. The test set error rates without rejection of the two models are close (0.88%
and 0.82%) and weakly signifiant (p-value: 0.077). Again, using a belief function
classifier instead of a softmax layer introduce no negative effect on the network
in MNIST. The test set error rates with rejection of the two models are shown
in Fig. 4c. The ConvNet-BF classifier rejects a small number of classification
results because the feature vectors provided by the ConvNet part include little
confusing information.
5 Conclusion
In this work, we proposed a novel classifier based on ConvNet and DS theory for
object recognition allowing for ambiguous pattern rejection, called “ConvNet-BF
classifier”. This new structure consists of a ConvNet with nonlinear convolutional
layers and a global pooling layer to extract high-dimensional features and a belief
function classifier to convert the features into Dempster-Shafer mass functions.
The mass functions can be used for classification or rejection based on evidence-
theoretic rules. Additionally, the novel classifier can be trained in an end-to-end
way.
The use of belief function classifiers in ConvNets had no negative effect on the
classification performances on the CIFAR-10, CIFAR-100, and MNIST datasets.
The combination of belief function classifiers and ConvNet can reduce the errors
by rejecting a part of the incorrect classification. This provides a new direction
to improve the performance of deep learning for object recognition. The classifier
380 Z. Tong et al.
is prone to assign a rejection action when there are conflicting features, which
easily yield incorrect classification in the traditional ConvNets. In addition, the
proposed method opens a way to explain the relationship between the extracted
features in convolutional layers and class membership of each pattern. The mass
m(Ω) assigned to the set of classes provides the possibility to verify whether a
ConvNet is well trained or not.
References
1. Bengio, Y.: Learning deep architectures for AI. Found. Trends R Mach. Learn.
2(1), 1–127 (2009)
2. Bi, Y.: The impact of diversity on the accuracy of evidential classifier ensembles.
Int. J. Approximate Reasoning 53(4), 584–607 (2012)
3. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
In: Yager, R.R., Liu, L. (eds.) Classic Works of the Dempster-Shafer Theory of
Belief Functions. STUDFUZZ, vol. 219, pp. 57–72. Springer, Heidelberg (2008).
https://doi.org/10.1007/978-3-540-44792-4 3
4. Denœux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer
theory. IEEE Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
5. Denœux, T.: Analysis of evidence-theoretic decision rules for pattern classification.
Pattern Recogn. 30(7), 1095–1107 (1997)
6. Denœux, T.: A neural network classifier based on Dempster-Shafer theory. IEEE
Trans. Syst. Man Cybern. Part A Syst. Hum. 30(2), 131–150 (2000)
7. Denœux, T.: Logistic regression, neural networks and Dempster-Shafer theory: a
new perspective. Knowl.-Based Syst. 176, 54–67 (2019)
8. Denœux, T., Dubois, D., Prade, H.: Representations of uncertainty in artificial
intelligence: beyond probability and possibility. In: Marquis, P., Papini, O., Prade,
H. (eds.) A Guided Tour of Artificial Intelligence Research, Chap. 4. Springer
(2019)
9. Denœux, T., Kanjanatarakul, O., Sriboonchitta, S.: A new evidential K-nearest
neighbor rule based on contextual discounting with partially supervised learning.
Int. J. Approximate Reasoning 113, 287–302 (2019)
10. Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Targeted dropout. In:
CDNNRIA Workshop at the 32nd Conference on Neural Information Processing
Systems (NeurIPS 2018), Montréal (2018)
11. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
preprint arXiv:1207.0580 (2012)
12. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing,
Doha, pp. 1746–1751 (2014)
13. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.
University of Toronto, Technical report (2009)
14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017)
15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning
applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
ConvNet and Dempster-Shafer Theory for Object Recognition 381
17. Leng, B., Liu, Y., Yu, K., Zhang, X., Xiong, Z.: 3D object understanding with 3D
convolutional neural networks. Inf. Sci. 366, 188–201 (2016)
18. Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on
Learning Representations (ICLR 2014), Banff, pp. 1–10 (2014)
19. Liu, Z., Pan, Q., Dezert, J., Han, J.W., He, Y.: Classifier fusion with contextual
reliability evaluation. IEEE Trans. Cybern. 48(5), 1605–1618 (2018)
20. Minary, P., Pichon, F., Mercier, D., Lefevre, E., Droit, B.: Face pixel detection
using evidential calibration and fusion. Int. J. Approximate Reasoning 91, 202–
215 (2017)
21. Sakaguchi, K., Post, M., Van Durme, B.: Efficient elicitation of annotations for
human evaluation of machine translation. In: Proceedings of the Ninth Workshop
on Statistical Machine Translation, Baltimore, pp. 1–11 (2014)
22. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Artificial Intelligence
and Statistics, Florida, pp. 448–455 (2009)
23. Salakhutdinov, R., Tenenbaum, J.B., Torralba, A.: Learning with hierarchical-deep
models. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1958–1971 (2012)
24. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
25. Tong, Z., Gao, J., Zhang, H.: Recognition, location, measurement, and 3D recon-
struction of concealed cracks using convolutional neural networks. Constr. Build.
Mater. 146, 775–787 (2017)
26. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing
robust features with denoising autoencoders. In: Proceedings of the 25th Interna-
tional Conference on Machine Learning, pp. 1096–1103, New York (2008)
27. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-
ing autoencoders: learning useful representations in a deep network with a local
denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
28. Xu, P., Davoine, F., Zha, H., Denœux, T.: Evidential calibration of binary SVM
classifiers. Int. J. Approximate Reasoning 72, 55–70 (2016)
29. Yager, R.R., Liu, L.: Classic Works of the Dempster-Shafer Theory of Belief Func-
tions, vol. 219. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-
44792-4
On Learning Evidential Contextual Corrections
from Soft Labels Using a Measure
of Discrepancy Between Contour Functions
1 Introduction
In Dempster-Shafer theory [15, 17], the correction of a source of information, a sen-
sor for example, is classically done using the discounting operation introduced by
Shafer [15], but also by so-called contextual correction mechanisms [10, 13] taking into
account more refined knowledge about the quality of a source.
These mechanisms, called contextual discounting, negating and reinforcement [13],
can be derived from the notions of reliability (or relevance), which concerns the com-
petence of a source to answer the question of interest, and truthfulness [12, 13] indicat-
ing the source’s ability to say what it knows (it may also be linked with the notion of
bias of a source). The contextual discounting is an extension of the discounting opera-
tion, which corresponds to a partially reliable and totally truthful source. The contex-
tual negating is an extension of the negating operation [12, 13], which corresponds to
the case of a totally reliable but partially truthful source, the extreme case being the
negation of a source [5]. At last, the contextual reinforcement is an extension of the
reinforcement, a dual operation of the discounting [11, 13].
In this paper, the problem of learning the parameters of these correction mechanisms
from soft labels, meaning partially labelled data, is tackled. More specifically, in our
case, soft labels indicate the true class of each object in an imprecise manner through a
contour function.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 382–389, 2019.
https://doi.org/10.1007/978-3-030-35514-2_28
Learning Contextual Corrections from Soft Labels 383
A method for learning these corrections from labelled data (hard labels), where
the truth is perfectly known for each element of the learning set, has already been intro-
duced in [13]. It consists in minimizing a measure of discrepancy between the corrected
contour functions and the ground truths over elements of a learning set. In this paper,
it is shown that this same measure can be used to learn from soft labels, and tests on
synthetic and real data illustrate its advantages to (1) improve a classifier even if the
data is only partially labelled; and (2) obtain better performances than learning these
corrections from approximate hard labels approaching the only available soft labels.
This paper is organized as follows. In Sect. 2, the basic concepts and notations used
in this paper are presented. Then, in Sect. 3, the three applied contextual corrections
as well as their learning from hard labels are exposed. The proposition to extend this
method to soft labels is introduced. Tests of this method on synthetic and real data are
presented in Sect. 4. At last, a discussion and future works are given in Sect. 5.
The focal elements of a MF m are the subsets A of Ω such that m(A) > 0.
A MF m is in one-to-one correspondence with a plausibility function P l defined for
all A ⊆ Ω by
P l(A) = m(B). (1)
B∩A=∅
for all A ⊆ Ω, where mΩ represents the total ignorance, i.e. the MF defined by
mΩ (Ω) = 1.
Several justifications for this mechanism can be found in [10, 13, 16].
The contour function of the MF m resulting from the discounting (3) is defined for
all ω ∈ Ω by (see for example [13, Prop. 11])
In this paper, we consider the case where the truth is no longer given precisely by the
values δi,k , but only in an imprecise manner by a contour function δ̃i s.t.
δ̃i : Ω → [0, 1]
(9)
ωk → δ̃i (ωk ) = δ̃i,k .
The contour function δ̃i gives information about the true class in Ω of the instance i.
Knowing then the truth only partially, we propose to learn the corrections parame-
ters using the following discrepancy measure Ẽpl , extending directly (8):
n
K
Ẽpl (β) = (pli (ωk ) − δ̃i,k )2 . (10)
i=1 k=1
The discrepancy measure Ẽpl also yields, for each correction (CD, CR et CN), a linear
least-squares optimization problem. For example, for CD, Ẽpl can be written by
˜2
Ẽpl (β) = Qβ − d (11)
with
⎡ ⎤ ⎡ ⎤
diag(pl1 − 1) δ˜1 − 1
⎢ .. ⎥ ˜ ⎢ .. ⎥
Q=⎣ . ⎦, d = ⎣ . ⎦ (12)
diag(pln − 1) δ˜n − 1
where diag(v) is a square diagonal matrix whose diagonal is composed of the elements
of the vector v, and where for all i ∈ {1, . . . , n}, δ̃i is the column vector composed of
the values of the contour function δ̃i , meaning δ̃i = (δ̃i,1 , . . . , δ̃i,K )T .
In the following, this learning proposition is tested with generated and real data.
We first expose how soft labels can be generated from hard labels to make the tests
exposed afterwards on synthetic and real data.
It is not easy to find partially labelled data in the literature. Thus, as in [1, 8, 9, 14], we
have built our partially labelled data sets (soft labels) from perfect truths (hard labels)
using the procedure described in Algorithm 1 (where Bêta, B, and U means respec-
tively Bêta, Bernoulli and uniform distributions).
386 S. Mutmainah et al.
Algorithm 1 allows one to obtain soft labels that are all the more imprecise as the
most plausible class is false.
We have then considered several real data sets from the UCI database [6] composed
of numerical attributes as the EkNN classifier is used. Theses data sets are described in
Table 1.
Learning Contextual Corrections from Soft Labels 387
Table 1. Characteristics of the UCI dataset used (number of instances without missing data,
number of classes, number of numerical attributes used)
For each dataset, a 10-repeated 10-fold cross validation has been undertaken as
follows:
– the group containing one tenth of the data is considered as the test set (the instances
labels being made imprecise using Algorithm 1),
– the other 9 groups form the learning set, which is randomly divided into two groups
of equal size:
• one group to learn the EkNN classifier (learnt from hard truths),
• one group to learn the parameters of the correction mechanisms from soft labels
(the labels of the dataset are made imprecise using Algorithm 1).
For learning the parameters of contextual corrections, two strategies are compared.
1. In the first strategy, we use the optimization of Eq. (8) from the closest hard truths
from the soft truths (the most plausible class is chosen). Corrections with this strat-
egy are denoted by CD, CR and CN.
2. In the second strategy, Eq. (10) is directly optimized from soft labels (cf Sect. 3.3).
The resulting corrections using this second strategy are denoted by CDsl, CRsl and
CNsl.
The performances of the systems (the classifier alone and the corrections - CD, CR
or CN - of this classifier according to the two strategies described above) are measured
using Ẽpl (10), where δ̃ represents the partially known truth. This measure corresponds
to the sum over the test instances of the differences, in the least squares sense, between
the truths being sought and the system outputs.
The performances Ẽpl (10) obtained from UCI and generated data for the classi-
fier and its corrections are summed up in Table 2 for each type of correction. Standard
deviations are indicated in brackets.
From the results presented in Table 2, we can remark that, for CD, the second strat-
egy (CDsl) consisting in learning directly from the soft labels, allows one to obtain
lower differences Ẽpl from the truth on the test set than the first strategy (CD) where
the correction parameters are learnt from approximate hard labels. We can also remark
that this strategy yields lower differences Ẽpl than the classifier alone, illustrating, in
these experiments, the usefulness of soft labels even if hard labels are not available,
which can be interesting in some applications.
The same conclusions can be drawn for CN.
388 S. Mutmainah et al.
Table 2. Performances Ẽpl obtained for the classifier alone and the classifier corrected with CD,
CR and CN using both strategies. Standard deviations are indicated in brackets.
For CR, the second strategy is also better than the first one but we can note that
unlike the other corrections, there is no improvement for the first strategy in comparison
to the classifier alone (the second strategy having also some close performances to the
classifier alone).
Acknowledgement. The authors would like to thank the anonymous reviewers for their helpful
and constructive comments, which have helped them to improve the quality of the paper and to
consider new paths for future research.
Mrs. Mutmainah’s research is supported by the overseas 5000 Doctors program of Indonesian
Religious Affairs Ministry (MORA French Scholarship).
References
1. Côme, E., Oukhellou, L., Denœux, T., Aknin, P.: Learning from partially supervised data
using mixture models and belief functions. Pattern Recogn. 42(3), 334–348 (2009)
Learning Contextual Corrections from Soft Labels 389
2. Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE
Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
3. Denœux, T.: Conjunctive and disjunctive combination of belief functions induced by nondis-
tinct bodies of evidence. Artif. Intell. 172, 234–264 (2008)
4. Denœux, T.: Maximum likelihood estimation from uncertain data in the belief function
framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
5. Dubois, D., Prade, H.: A set-theoretic view of belief functions: logical operations and approx-
imations by fuzzy sets. Int. J. Gen. Syst. 12(3), 193–226 (1986)
6. Dua, D., Graff, C.: UCI Machine Learning Repository. School of Information and Computer
Science, University of California, Irvine (2019). http://archive.ics.uci.edu/ml
7. Elouedi, Z., Mellouli, K., Smets, P.: The evaluation of sensors’ reliability and their tuning
for multisensor data fusion within the transferable belief model. In: Benferhat, S., Besnard,
P. (eds.) ECSQARU 2001. LNCS (LNAI), vol. 2143, pp. 350–361. Springer, Heidelberg
(2001). https://doi.org/10.1007/3-540-44652-4 31
8. Kanjanatarakul, O., Kuson, S., Denoeux, T.: An evidential K-nearest neighbor classifier
based on contextual discounting and likelihood maximization. In: Destercke, S., Denoeux,
T., Cuzzolin, F., Martin, A. (eds.) BELIEF 2018. LNCS (LNAI), vol. 11069, pp. 155–162.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99383-6 20
9. Kanjanatarakul, O., Kuson, S., Denœux, T.: A new evidential k-nearest neighbor rule based
on contextual discounting with partially supervised learning. Int. J. Approx. Reason. 113,
287–302 (2019)
10. Mercier, D., Quost, B., Denœux, T.: Refined modeling of sensor reliability in the belief func-
tion framework using contextual discounting. Inf. Fusion 9(2), 246–258 (2008)
11. Mercier, D., Lefèvre, E., Delmotte, F.: Belief functions contextual discounting and canonical
decompositions. Int. J. Approx. Reason. 53(2), 146–158 (2012)
12. Pichon, F., Dubois, D., Denoeux, T.: Relevance and truthfulness in information correction
and fusion. Int. J. Approx. Reason. 53(2), 159–175 (2012)
13. Pichon, F., Mercier, D., Lefèvre, E., Delmotte, F.: Proposition and learning of some belief
function contextual correction mechanisms. Int. J. Approx. Reason. 72, 4–42 (2016)
14. Quost, B., Denoeux, T., Li, S.: Parametric classification with soft labels using the eviden-
tial EM algorithm: linear discriminant analysis versus logistic regression. Adv. Data Anal.
Classif. 11(4), 659–690 (2017)
15. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton
(1976)
16. Smets, P.: Belief functions: the disjunctive rule of combination and the generalized Bayesian
theorem. Int. J. Approx. Reason. 9(1), 1–35 (1993)
17. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66(2), 191–234 (1994)
18. Zafallon, M., Corani, G., Mauá, D.-D.: Evaluating credal classifiers by utility-discounted
predictive accuracy. Int. J. Approx. Reason. 53(8), 1282–1301 (2012)
Efficient Möbius Transformations
and Their Applications to D-S Theory
1 Introduction
Dempster-Shafer Theory (DST) [11] is an elegant formalism that generalizes
Bayesian probability theory. It is more expressive by giving the possibility for
a source to represent its belief in the state of a variable not only by assigning
credit directly to a possible state (strong evidence) but also by assigning credit to
any subset (weaker evidence) of the set Ω of all possible states. This assignment
of credit is called a mass function and provides meta-information to quantify
This work was carried out and co-funded in the framework of the Labex MS2T and
the Hauts-de-France region of France. It was supported by the French Government,
through the program “Investments for the future” managed by the National Agency
for Research (Reference ANR-11-IDEX-0004-02).
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 390–403, 2019.
https://doi.org/10.1007/978-3-030-35514-2_29
Efficient Möbius Transformations and Their Applications to D-S Theory 391
the level of uncertainty about one’s believes considering the way one established
them, which is critical for decision making.
Nevertheless, this information comes with a cost: considering 2|Ω| potential
values instead of only |Ω| can lead to computationally and spatially expensive
algorithms. They can become difficult to use for more than a dozen possible states
(e.g. 20 states in Ω generate more than a million subsets), although we may need
to consider large frames of discernment (e.g. for classification or identification).
Moreover, these algorithms not being tractable anymore beyond a few dozen
states means their performances greatly degrade before that, which further limits
their application to real-time applications. To tackle this issue, a lot of work has
been done to reduce the complexity of transformations used to combine belief
sources with Dempster’s rule [6]. We distinguish between two approaches that
we call powerset-based and evidence-based.
The powerset-based approach concerns all algorithms based on the structure
of the powerset 2Ω of the frame of discernment Ω. They have a complexity
dependent on |Ω|. Early works [1,7,12,13] proposed optimizations by restricting
the structure of evidence to only singletons and their negation, which greatly
restrains the expressiveness of the DST. Later, a family of optimal algorithms
working in the general case, i.e. the ones based on the Fast Möbius Transform
(FMT) [9], was discovered. Their complexity is O(|Ω|.2|Ω| ) in time and O(2|Ω| )
in space. It has become the de facto standard for the computation of every
transformation in DST. Consequently, efforts were made to reduce the size of Ω
to benefit from the optimal algorithms of the FMT. More specifically, [14] refers
to the process of conditioning by the combined core (intersection of the unions of
all focal sets of each belief source) and lossless coarsening (merging of elements
of Ω which always appear together in focal sets). Also, Monte Carlo methods
[14] have been proposed but depend on a number of trials that must be large
and grows with |Ω|, in addition to not being exact.
The evidence-based approach concerns all algorithms that aim to reduce the
computations to the only subsets that contain information (evidence), called
focal sets and usually far less numerous than 2|Ω| . This approach, also refered
as the obvious one, implicitly originates from the seminal work of Shafer [11]
and is often more efficient than the powerset-based one since it only depends
on information contained in sources in a quadratic way. Doing so, it allows
for the exploitation of the full potential of DST by enabling us to choose any
frame of discernment, without concern about its size. Moreover, the evidence-
based approach benefits directly from the use of approximation methods, some
of which are very efficient [10]. Therefore, this approach seems superior to the
FMT in most use cases, above all when |Ω| is large, where an algorithm with
exponential complexity is just intractable.
It is also possible to easily find evidence-based algorithms computing all
DST transformation, except for the conjunctive and disjunctive decompositions
for which we recently proposed a method [4].
However, since these algorithms rely only on the information contained
in sources, they do not exploit the structure of the powerset to reduce the
392 M. Chaveroche et al.
1
The following definitions hold for lower semifinite partially ordered sets as well, i.e.
partially ordered sets such that the number of elements of P lower in the sense of ≤
than another element of P is finite. But for the sake of simplicity, we will only talk
of finite partially ordered sets.
Efficient Möbius Transformations and Their Applications to D-S Theory 393
Fig. 1. Illustration representing the arrows contained in the sequence H computing the
zeta transformation of G⊆ = {(X, Y ) ∈ 2Ω × 2Ω /X ⊆ Y }, where Ω = {a, b, c}. For the
sake of clarity, identity arrows are not displayed. This representation is derived from
the one used in [9].
The sequences of graphs H and H are the foundation of the FMT algorithms.
Their execution is O(n.2n ) in time and O(2n ) in space.
the condition is already satisfied by the one of Sect. 2.1. So, if this condition is
satisfied, we will say that H computes the Möbius transformation of G≤ .
Application to the Boolean Lattice 2Ω (FMT). Resuming the application
of Sect. 2.1, for all X ∈ 2Ω , if ωi ∈ X, then there is an arrow (X, Y ) in Hi where
Y = X ∪ {ωi } and X = Y , but then for any set Y such that (Y, Y ) ∈ Hi , we
have Y = Y ∪ {ωi } = Y . Conversely, if ωi ∈ X, then the arrow (X, X ∪ {ωi })
is in Hi , but its head and tail are equal. Thus, H also computes the Möbius
transformation of G⊆ .
∨
Irreducible Elements. We note I(P ) the set of join-irreducible elements of
P , i.e. the elements i such that i = P for which it holds that ∀x, y ∈ P , if x < i
and y < i, then x ∨ y < i. Dually, we note ∧ I(P ) the set of meet-irreducible
elements of P , i.e. the elements i such that i = P for which it holds that
∀x, y ∈ P , if x > i and y > i, then x ∧ y > i. For example, in the Boolean lattice
2Ω , the join-irreducible elements are the singletons {ω}, where ω ∈ Ω.
If P is a finite lattice, then every element of P is the join of join-irreducible
elements and the meet of meet-irreducible elements.
Support of a Function in P . The support supp(f ) of a function f : P → R
is defined as supp(f ) = {x ∈ P/f (x) = 0}.
For example, in DST, the set of focal elements of a mass function m is supp(m).
The set of focal points F̊ of a mass function m from [4] for the conjunctive weight
function is ∧ supp(m). For the disjunctive one, it is ∨ supp(m).
It has been proven in [4] that the image of 2Ω through the conjunctive weight
function can be computed without redundancies by only considering the focal
points ∧ supp(m) in the definition of the multiplicative Möbius transform of
the commonality function. The image of all set in 2Ω \∧ supp(m) through the
conjunctive weight function is 1. The same can be stated for the disjunctive
weight function regarding the implicability function and ∨ supp(m). In the same
way, the image of any set in 2Ω \∧ supp(m) through the commonality function
is only a duplicate of the image of a set in ∧ supp(m) and can be recovered by
searching for its smallest superset in ∧ supp(m). In fact, as generalized in an
upcoming article [5], for any function f : P → R, ∧ supp(f ) is sufficient to define
396 M. Chaveroche et al.
its zeta and Möbius transforms based on the partial order ≥, and ∨ supp(f ) is
sufficient to define its zeta and Möbius transforms based on the partial order ≤.
However, considering the case where P is a finite lattice, naive algorithms
that only consider ∧ supp(f ) or ∨ supp(f ) have upper bound complexities in
O(|∧ supp(f )|2 ) or O(|∨ supp(f )|2 ), which may be worse than the optimal com-
plexity O(|∨ I(P )|.|P |) for a procedure that considers the whole lattice P . In this
paper, we propose algorithms with complexities always less than O(|∨ I(P )|.|P |)
computing the image of a meet-closed (e.g. ∧ supp(f )) or join-closed (e.g.
∨
supp(f )) subset of P through the zeta or Möbius transform, provided that
P is a finite distributive lattice.
Lemma 1 (Safe join). Let us consider a finite distributive lattice L. For all
i ∈ ∨ I(L) and for all x, y ∈ L such that i ≤ x and i ≤ y, we have i ≤ x ∨ y.
it can be easily shown that the meet of any two elements of ι(S)
Proof. First,
is either S or in ι(S). Then, suppose thatwe generate LS with the join of
elements of ι(S), to which we add the element S. Then,since P is distributive,
we have that for all x, y ∈ LS , their meet x ∧ y is either S or equal to the join
of every meet of pairs (iS,x , iS,y ) ∈ ι(S)2 , where iS,x ≤ x and iS,y ≤ y. Thus,
x ∧ y ∈ LS , which implies that LS is a sublattice of P . In addition, notice that
for each nonzero element s ∈ S and for all i ∈∨ I(P ) such that s ≥ i, we also
have by construction s ≥ iS ≥ i, where iS = {s ∈ S/s ≥ i}. Therefore, we
Efficient Möbius Transformations and Their Applications to D-S Theory 397
∨
s = {i ∈ I(P )/s ≥ i} = {i ∈ ι(S)/s ≥ i}, i.e. s ∈ LS . Besides,
have
if P ∈ S, then it is equal to S, which is also in LS by construction. So,
S ⊆ LS . It follows that the meet or join of every nonempty subset of S is in
LS , i.e. MS ⊆ LS and JS ⊆ LS , where MS is the smallest meet-closed subset
of P containing S and JS is the smallest join-closed subset of P containing S.
Furthermore, ι(S) ⊆ MS which means that we cannot build a smaller sublattice
of P containing S. Therefore, LS is the smallest sublattice of P containing S.
Finally, for any i ∈ ∨ I(P ) such that ∃s ∈ S, s ≥ i, we note iS = {s ∈
S/s ≥ i}. For all x, y ∈ LS , if iS > x and iS > y, then by construction of
ι(S), we have i ≤ x and i ≤ y (otherwise, iS would be less than x or y), which
implies by Lemma 1 that i ≤ x ∨ y. Since i ≤ iS , we have necessarily iS > x ∨ y.
Therefore, iS is a join-irreducible element of LS .
Proposition 2 (Lattice support). The smallest sublattice of P containing
both ∧ supp(f ) and ∨ supp(f ), noted L supp(f ), can be defined as:
L
supp(f ) = X/X ⊆ ι(supp(f )), X = ∅ ∪ supp(f ) .
Fig. 2. Illustration representing the arrows contained in the sequence H when com-
puting the zeta transformation of G⊆ = {(x, y) ∈ L2 /x ⊆ y}, where L =
{∅, {a}, {d}, {a, d}, {c, d, f }, {a, c, d, f }, Ω} with Ω = {a, b, c, d, e, f } and ∨ I(L) =
{{a}, {d}, {c, d, f }, Ω}. For the sake of clarity, identity arrows are not displayed.
Proof. Since the join-irreducible elements are ordered such that ∀ik , il ∈ ∨ I(L),
k < l ⇒ ik ≥ il , it is trivial to see that for any il ∈ ∨ I(L) and ik ∈ ∨ I(L)l−1 ,
ik ≥ il . Then, using Lemma 1 by recurrence, it is easy to get that
we have
il ≤ ∨ I(L)l−1 .
Hk = (x, y) ∈ L2 /y = x or y = x ∨ ik ,
H k = (x, y) ∈ L2 /x = y or x = y ∨ ik .
leading to the same issue. Indeed, to build a unique path between two elements
x, y of L such that x ≤ y, we start from x. Then at step 1, we get to the join x∨in
if in ≤ y (we stay at x otherwise, i.e. identity arrow), then we get to x∨in ∨in−1
if in−1 ≤ y, and so on until we get to y. However, if we have in ≤ x ∨ in−1 , with
in ≤ x, then there are at least two paths from x to y: one passing by the join
with in at step 1 and one passing by the identity arrow instead.
More generally, this kind of issue may only appear if there is a k where
ik ≤ x ∨ ∨ I(L)k−1 with ik ≤ x, where ∨ I(L)k−1 = {ik−1 , ik−2 , . . . , i1 }. But,
since L is a finite distributive lattice, and since its join-irreducible elements are
∨
ordered such ∨that ∀ij , il ∈ I(L), j < l ⇒ ij ≥ il , we have by Corollary 1
thatik ≤ I(L)k−1 . So, if ik ≤ x, then by Lemma 1, we also have ik ≤
x ∨ ∨ I(L)k−1 . Thereby, there is a unique path from x to y, meaning that the
condition of Sect. 2.1 is satisfied. H computes the zeta transformation of G≤ .
Also, ∀x ∈ L, if ik ≤ x, then there is an arrow (x, y) in Hk where y = x ∨ ik
and x = y, but then for any element y such that (y, y ) ∈ Hk , we have y =
y ∨ ik = y. Conversely, if ik ≤ x, then the arrow (x, x ∨ ik ) is in Hk , but its head
and tail are equal. Thus, the condition of Sect. 2.2 is satisfied. H also computes
the Möbius transformation of G≤ .
Finally, to obtain H, we only need to reverse the paths of H, i.e. reverse the
arrows in each Hk and reverse the sequence of join-irreducible elements.
The procedure described in Theorem 1 to compute the zeta and Möbius
transforms of a function on P is always less than O(|∨ I(P )|.|P |). Its upper
bound complexity for the distributive lattice L = L supp(f ) is O(|∨ I(L)|.|L|),
which is actually the optimal one for a lattice.
Yet, we can reduce this complexity even further if we have ∧ supp(f ) or
∨
supp(f ). This is the motivation behind the procedure decribed in the follow-
ing Theorem 2. As a matter of fact, [3] proposed a meta-procedure producing
an algorithm that computes the zeta and Möbius transforms in an arbitrary
intersection-closed family F of sets of 2Ω with a circuit of size O(|Ω|.|F |). How-
ever, this meta-procedure is O(|Ω|.2|Ω| ). Here, Theorem 2 provides a procedure
that directly computes the zeta and Möbius transforms with the optimal com-
plexity O(|Ω|.|F |), while being much simpler. Besides, our method is far more
general since it has the potential (depending on data structure) to reach this
complexity in any meet-closed subset of a finite distributive lattice.
Theorem 2 (Efficient Möbius Transformation in a join-closed or
meet-closed subset of P). Let us consider a meet-closed subset M of P (such
as ∧ supp(f )). Also, let the join-irreducible elements ι(M ) be ordered such that
∀ik , il ∈ ι(M ), k < l ⇒ ik ≥ il .
The sequence H M of graphs HkM computes the zeta and Möbius transforma-
tions of GM ≥ = {(x, y) ∈ M /x ≥ y} if:
2
HkM = (x, y) ∈ M 2 /x = y
or x = {s ∈ M/s ≥ y ∨ ik } and y ∨ ι(M )k ≥ x ,
400 M. Chaveroche et al.
2
This unit cost can be obtained when P = 2Ω using a dynamic binary tree as data
structure for the representation of M . With it, finding the proxy element only takes
the reading of a binary string, considered as one operation. Further details will soon
be available in an extended version of this work [5].
Efficient Möbius Transformations and Their Applications to D-S Theory 401
The idea is that we use the same procedure as in Theorem 1 that builds unique
paths simply by generating all elements of a finite distributive lattice L based on
the join of its join-irreducible elements step by step, as if we had M ⊆ L, except
that we remove all elements that are not in M . Doing so, the only difference
is that the join y ∨ ik of an element y of M with a join-irreducible ik ∈ ι(M )
of this hypothetical lattice L is not necessary in M . However, thanks to the
meet-closure of M and to the synchronizing condition y ∨ ι(M )k ≥ p, we can
“jump the gap” between two elements y and p of M separated by elements of
L\M and maintain the unicity of the path between any two elements x and y
of M . Indeed, for all join-irreducible element ik ∈ ι(M ), if x ≥ y ∨ ik , then
sinceM is meet-closed, we have an element p of M that we call proxy such that
p = {s ∈ M/s ≥ y ∨ ik }. Yet, we have to make sure that (1) p can only be
obtained from y with exactly one particular ik if p = y, and (2) that the sequence
of these particular join-irreducible elements forming the arrows of the path from
x toy are in the correct order. This is the purpose of the synchronizing condition
y ∨ ι(M )k ≥ p.
For (1), we will show that for a same proxy p, it holds that ∃!k ∈ [1, |ι(M )|]
such that p = y, y ∨ ι(M )k ≥ p and y ≥ ik . Recall that we ordered the elements
ι(M ) such that ∀ij , il ∈ ι(M ), j < l ⇒ ij ≥ il . Let us note k the greatest
index among [1, |ι(M )|] such that p ≥ ik and y ≥ ik . It is easy to see that the
synchonizing condition is statisfied for ik . Then, for all j ∈ [1, k −1], Corollary
1 and Lemma 1 give us that y ∨ ι(M )j ≥ ik , meaning that y ∨ ι(M )j ≥ p.
For all j ∈ [k + 1, |ι(M )|], either y ≥ ij (i.e. p = y ∨ ij = y) or p ≥ ij . Either
way, it is impossible to reach p from y ∨ ij . Therefore, there exists a unique path
from y to p that takes the arrow (p, y) from HkM .
Concerning
(2), for all (x, y) ∈ GM ≥ , x = y, let us note the proxy element
p1 = {s ∈ M/s ≥ y ∨ ik1 } where k1 is the greatest index among [1, |ι(M )|] such
that p1 ≥ ik1 and y ≥ ik1 . We have (p1 , y) ∈ HkM1 . Let us suppose that there exists
another proxy element p2 such that p2 = p1 , x ≥ p2 and p2 = {s ∈ M/s ≥
p1 ∨ ik2 } where k2 is the greatest index among [1, |ι(M )|] such that p2 ≥ ik2 and
p1 ≥ ik2 . We have (p2 , p1 ) ∈ HkM2 . Since p2 > p1 and p1 ≥ ik1 , we have that
p2 ≥ ik1 , i.e. k2 = k1 . So, two cases are possible: either k1 > k2 or k1 < k2 .
If k1 > k2 , then there is a path ((p2 , p1 ), (p1 , p1 ), . . . , (p1 , p1 ), (p1 , y)) from p2
to y. Moreover, we know that at step k1 , we get p1 from y and that we have
p2 ≥ ik1 and y ≥ ik1 , meaning that there could only exist an arrow (p2 , y) in
HkM3 if k3 > k1 > k2 . Suppose this k3 exists. Then, since k3 > k1 > k2 , we
have that p2 ≥ ik3 and y ≥ ik3 , but also p1 ≥ ik3 since we would have k1 = k3
otherwise. This implies that k2 = k3 , which is impossible. Therefore, there is no
k3 such that (p2 , y) ∈ HkM3 , i.e. there is a unique path from p2 to y. Otherwise, if
k1 < k2 , then the latter path between p2 and y does not exist. But, since p1 ≥ ik2
and p1 ≥ y, we have y ≥ ik2 , meaning that there exists an arrow (p2 , y) ∈ HkM2 ,
which forms a unique path from p2 to y. The recurrence of this reasoning enables
us to conclude that there is a unique path from x to y.
Thus, the condition of Sect. 2.1 is satisfied. H M computes the zeta transfor-
mation of GM ≥ . Also, for the same reasons as with Theorem 1, we have that H
M
M J J
computes the Möbius transformation of G≥ . The proof for H and G≤ is analog
if P is a Boolean lattice.
402 M. Chaveroche et al.
5 Conclusion
In this paper, we proposed the Efficient Möbius Transformations (EMT), which
are general procedures to compute the zeta and Möbius transforms of any func-
tion defined on any finite distributive lattice with optimal complexity. They are
based on our reformulation of the Möbius inversion theorem with focal points
only, featured in an upcoming detailed article [5] currently in preparation. The
EMT optimally exploit the information contained in both the support of this
function and the structure of distributive lattices. Doing so, the EMT always
perform better than the optimal complexity for an algorithm considering the
whole lattice, such as the FMT for all DST transformations, given the support
of this function. In [5], we will see that our approach is still more efficient when
this support is not given. This forthcoming article will also feature examples of
application in DST, algorithms and implementation details.
References
1. Barnett, J.A.: Computational methods for a mathematical theory of evidence. In:
Proceedings of IJCAI, vol. 81, pp. 868–875 (1981)
2. Björklund, A., Husfeldt, T., Kaski, P., Koivisto, M.: Trimmed moebius inversion
and graphs of bounded degree. Theory Comput. Syst. 47(3), 637–654 (2010)
3. Björklund, A., Husfeldt, T., Kaski, P., Koivisto, M., Nederlof, J., Parviainen, P.:
Fast zeta transforms for lattices with few irreducibles. ACM TALG 12(1), 4 (2016)
4. Chaveroche, M., Davoine, F., Cherfaoui, V.: Calcul exact de faible complexité
des décompositions conjonctive et disjonctive pour la fusion d’information. In:
Proceedings of GRETSI (2019)
5. Chaveroche, M., Davoine, F., Cherfaoui, V.: Efficient algorithms for Möbius trans-
formations and their applications to Dempster-Shafer Theory. Manuscript available
on request (2019)
6. Dempster, A.: A generalization of Bayesian inference. J. Roy. Stat. Soc. Ser. B
(Methodol.) 30, 205–232 (1968)
7. Gordon, J., Shortliffe, E.H.: A method for managing evidential reasoning in a
hierarchical hypothesis space. Artif. Intell. 26(3), 323–357 (1985)
8. Kaski, P., Kohonen, J., Westerbäck, T.: Fast Möbius inversion in semimodular
lattices and U-labelable posets. arXiv preprint arXiv:1603.03889 (2016)
9. Kennes, R.: Computational aspects of the Mobius transformation of graphs. IEEE
Trans. Syst. Man Cybern. 22(2), 201–223 (1992)
10. Sarabi-Jamab, A., Araabi, B.N.: Information-based evaluation of approximation
methods in Dempster-Shafer Theory. IJUFKS 24(04), 503–535 (2016)
11. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
12. Shafer, G., Logan, R.: Implementing Dempster’s rule for hierarchical evidence.
Artif. Intell. 33(3), 271–298 (1987)
13. Shenoy, P.P., Shafer, G.: Propagating belief functions with local computations.
IEEE Expert 1(3), 43–52 (1986)
14. Wilson, N.: Algorithms for Dempster-Shafer Theory. In: Kohlas, J., Moral, S. (eds.)
Handbook of Defeasible Reasoning and Uncertainty Management Systems: Algo-
rithms for Uncertainty and Defeasible Reasoning, vol. 5, pp. 421–475. Springer,
Netherlands (2000). https://doi.org/10.1007/978-94-017-1737-3 10
Dealing with Continuous Variables
in Graphical Models
Christophe Gonzales(B)
Since their introduction in the 80’s, Bayesian networks (BN) have become one
of the most popular model for handling “precise” uncertainties [24]. However,
by their very definition, BNs are limited to cope only with discrete random
variables. Unfortunately, in real-world applications, it is often the case that some
variables are of a continuous nature. Dealing with such variables is challenging
both for learning and inference tasks [9]. The goal of this tutorial is to investigate
techniques used to cope with such variables and, more importantly, to highlight
their pros and cons.
3 Conclusion
Dealing with continuous random variables in probabilistic graphical models is
challenging. Either one has to resort to discretization, but this raises many issues
and the result may be far from the expected one, or to exploiting models specif-
ically designed to cope with continuous variables. But choosing the best model
is not easy in the sense that one has to trade-off between the flexibility of the
model and the complexity of its learning and inference. Clearly, there is still
room for improvements in such models, maybe by exploiting other features of
probabilities, like, e.g., copula [5].
References
1. Bergsma, W.: Testing conditional independence for continuous random variables.
Technical report, 2004–049, EURANDOM (2004)
2. Cobb, B., Shenoy, P.: Inference in hybrid Bayesian networks with mixtures of
truncated exponentials. Int. J. Approximate Reasoning 41(3), 257–286 (2006)
3. Cobb, B., Shenoy, P., Rumı́, R.: Approximating probability density functions in
hybrid Bayesian networks with mixtures of truncated exponentials. Stat. Comput.
16(3), 293–308 (2006)
4. Dechter, R.: Bucket elimination: a unifying framework for reasoning. Artif. Intell.
113, 41–85 (1999)
5. Elidan, G.: Copula Bayesian networks. In: Proceedings of NIPS 2010, pp. 559–567
(2010)
6. Elidan, G., Nachman, I., Friedman, N.: “ideal parent” structure learning for con-
tinuous variable Bayesian networks. J. Mach. Learn. Res. 8, 1799–1833 (2007)
7. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning
Bayesian networks. In: Proceedings of ICML 1996, pp. 157–165 (1996)
8. Geiger, D., Heckerman, D.: Learning Gaussian networks. In: Proceedings of UAI
1994, pp. 235–243 (1994)
9. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press, Cambridge (2009)
10. Kozlov, A., Koller, D.: Nonuniform dynamic discretization in hybrid networks. In:
Proceedings of UAI 1997, pp. 314–325 (1997)
11. Kuipers, J., Moffa, G., Heckerman, D.: Addendum on the scoring of Gaussian
directed acyclic graphical models. Ann. Stat. 42(4), 1689–1691 (2014)
12. Langseth, H., Nielsen, T., Rumı́, R., Salmerón, A.: Inference in hybrid Bayesian
networks with mixtures of truncated basis functions. In: Proceedings of PGM 2012,
pp. 171–178 (2012)
13. Langseth, H., Nielsen, T., Rumı́, R., Salmerón, A.: Mixtures of truncated basis
functions. Int. J. Approximate Reasoning 53(2), 212–227 (2012)
14. Lauritzen, S.: Propagation of probabilities, means and variances in mixed graphical
association models. J. Am. Stat. Assoc. 87, 1098–1108 (1992)
408 C. Gonzales
15. Lauritzen, S., Jensen, F.: Stable local computation with mixed Gaussian distribu-
tions. Stat. Comput. 11(2), 191–203 (2001)
16. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Ann. Stat. 17(1), 31–57
(1989)
17. Lerner, U., Segal, E., Koller, D.: Exact inference in networks with discrete children
of continuous parents. In: Proceedings of UAI 2001, pp. 319–328 (2001)
18. Mabrouk, A., Gonzales, C., Jabet-Chevalier, K., Chojnaki, E.: Multivariate cluster-
based discretization for Bayesian network structure learning. In: Beierle, C., Dekht-
yar, A. (eds.) SUM 2015. LNCS (LNAI), vol. 9310, pp. 155–169. Springer, Cham
(2015). https://doi.org/10.1007/978-3-319-23540-0 11
19. Madsen, A., Jensen, F.: LAZY propagation: a junction tree inference algorithm
based on lazy inference. Artif. Intell. 113(1–2), 203–245 (1999)
20. Margaritis, D., Thrun, S.: A Bayesian multiresolution independence test for con-
tinuous variables. In: Proceedings of UAI 2001, pp. 346–353 (2001)
21. Monti, S., Cooper, G.: A multivariate discretization method for learning Bayesian
networks from mixed data. In: Proceedings of UAI 1998, pp. 404–413 (1998)
22. Moral, S., Rumi, R., Salmerón, A.: Mixtures of truncated exponentials in hybrid
Bayesian networks. In: Benferhat, S., Besnard, P. (eds.) ECSQARU 2001. LNCS
(LNAI), vol. 2143, pp. 156–167. Springer, Heidelberg (2001). https://doi.org/10.
1007/3-540-44652-4 15
23. Moral, S., Rumı́, R., Salmerón, A.: Estimating mixtures of truncated exponentials
from data. In: Proceedings of PGM 2002, pp. 135–143 (2002)
24. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kauffmann, Burlington (1988)
25. Poland, W., Shachter, R.: Three approaches to probability model selection. In: de
Mantaras, R.L., Poole, D. (eds.) Proceedings of UAI 1994, pp. 478–483 (1994)
26. Romero, R., Rumı́, R., Salmerón, A.: Structural learning of Bayesian networks
with mixtures of truncated exponentials. In: Proceedings of PGM 2004, pp. 177–
184 (2004)
27. Rumı́, R., Salmerón, A.: Approximate probability propagation with mixtures of
truncated exponentials. Int. J. Approximate Reasoning 45(2), 191–210 (2007)
28. Salmerón, A., Rumı́, R., Langseth, H., Madsen, A.L., Nielsen, T.D.: MPE infer-
ence in conditional linear Gaussian networks. In: Destercke, S., Denoeux, T. (eds.)
ECSQARU 2015. LNCS (LNAI), vol. 9161, pp. 407–416. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-20807-7 37
29. Shafer, G.: Probabilistic Expert Systems. Society for Industrial and Applied Math-
ematics (1996)
30. Shenoy, P.: A re-definition of mixtures of polynomials for inference in hybrid
Bayesian networks. In: Liu, W. (ed.) ECSQARU 2011. LNCS (LNAI), vol. 6717, pp.
98–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22152-1 9
31. Shenoy, P.: Two issues in using mixtures of polynomials for inference in hybrid
Bayesian networks. Int. J. Approximate Reasoning 53(5), 847–866 (2012)
32. Shenoy, P., Shafer, G.: Axioms for probability and belief function propagation. In:
Proceedings of UAI 1990, pp. 169–198 (1990)
33. Shenoy, P., West, J.: Inference in hybrid Bayesian networks using mixtures of
polynomials. Int. J. Approximate Reasoning 52(5), 641–657 (2011)
Towards Scalable and Robust
Sum-Product Networks
1 Introduction
Sum-Product Networks (SPNs) [15] (conceptually similar to Arithmetic
Circuits [4]) are a class of deep probabilistic graphical models where exact
marginal inference is always tractable. More precisely, any marginal query can
be computed in time polynomial in the network size. Still, SPNs can capture
high tree-width models [15] and are capable of representing complex and highly
multidimensional distributions [5]. This promising combination of efficiency and
representational power has motivated several applications of SPNs to a variety
of machine learning tasks [1,3,11,16–18].
As any other standard probabilistic graphical model, SPNs learned from
data are prone to overfitting when evaluated at poorly represented regions of the
feature space, leading to overconfident and often unreliable conclusions. However,
due to the probabilistic semantics of SPNs, we can mitigate that issue through a
principled analyses of the reliability of each output. A notable example is Credal
SPNs (CSPNs) [9], a extension of SPNs to imprecise probabilities where we can
compute a measure of the robustness of each prediction. Such robustness values
are useful tools for decision-making, as they are highly correlated with accuracy,
and thus tell us when to trust the CSPN’s prediction: if the robustness of a
prediction is low, we can suspend judgement or even resort to another machine
where xE is any partial or complete evidence. The SPN and its root node are
used interchangeably to mean the same object. We assume that every indicator
variable appears in at most one leaf node. Every arc from a sum node i to a
child j is associated with a non-negative weight wi,j such that j wi,j = 1 (this
constraint does not affect the generality of the model [14]).
Given an SPN S and a node i, we denote S i the SPN obtained by rooting
the network at i, that is, by discarding any non-descendant of i (other than i
itself). We call S i the sub-network rooted at i, which is an SPN by itself (albeit
over a possibly different set of variables). If ω are the weights of an SPN S and
i is a node, we denote by ωi the weights in the sub-network S i rooted at i, and
by wi the vector of weights wi,j associated with arcs from i to children j.
The scope of an SPN is the set of variables that appear in it. For an SPN
which is a leaf associated with an indicator variable, the scope is the respective
random variable. For an SPN which is not a leaf, the scope is the union of the
scopes of its children. We assume that scopes of children of a sum node are
identical (completeness) and scopes of children of a product node are disjoint
(decomposability) [12].
The value of an SPN S at a given instantiation xE , written S(xE ), is
defined recursively in terms of its root node r. If r is a leaf node associated
with indicator variable λr,xr then S(xE ) = λr,xr (xE ). Else, if r is a product
node, then S(xE ) = j S j (xE ), where j ranges over the children of r. Finally,
j
if r is a sum node then S(xE ) = j wr,j S (xE ), where again j ranges over
the children of r. Given these properties, it is easy to check that S induces
a probability distribution for XV such that S(xE ) = P(xE ) for any xE and
E ⊆ V. One can also compute expectations of functions over a variable Xi as
E(f |XE = xE ) = xi f (xi )P(xi |XE = xE ).
A Credal SPN (CSPN) is defined similarly, except for containing sets of
weight vectors in each sum node instead of a single weight vector. More precisely,
a CSPN C is defined by a set of SPNs C = {Sω : ω ∈ C} over the same graph
structure of S, where C is the Cartesian product of finitely-generated simplexes
Ci , one for each sum node i, such that the weights wi of a sum node i are con-
strained by Ci . While an SPN represents one joint distribution over its variables,
a CSPN represents a set of joint distributions. Therefore, one can use CSPNs to
obtain lower and upper bounds minω Eω (f |XE = xE ) and maxω Eω (f |XE = xE )
on the expected value of some function f of a variable, conditional on evidence
XE = xE . Recall that each choice of the weights ω of a CSPN {Sω : ω ∈ C}
412 A. H. C. Correia and C. P. de Campos
as to obtain the exact value of the minimisation one can run a binary search for
μ until minω Eω ((f − μ)|xE ) = 0 (to the desired precision). The following result
will be used here. Corollary 1 is a small variation of the result in [9].
Theorem 1 (Theorem 1 in [9]). Consider a CSPN C = {Sω : ω ∈ C}.
Computing minω Sω (xE ) and maxω Sω (xE ) takes O(|C| · K) time, where |C| is
the number of nodes and arcs in C, and K is an upper bound on the cost of
solving a linear program of the form minwi j ci,j wi,j subject to wi ∈ Ci .
Proof. When local simplexes Ci have constraints li,j ≤ wi,j ≤ ui,j , then the
local optimisations S i (xE ) = minwi j wi,j S j (xE ) are equivalent to fractional
knapsack problems [7], which can be solved in constant time for nodes with
bounded number of children. Thus, the overall running time is O(|S|).
The name class-selective was inspired by selective SPNs [13], where only one
child of each sum node is active at a time. In a class-selective SPN such property
is restricted to the root node: for a given class value, only one of the sub-networks
remains active. That is made clear in Fig. 1 as only one of the indicator nodes
Cn is non-zero and all but one of children of the root node evaluate to zero.
...
where r is the root node index with children S xc for each value xc . Note that each
of these internal optimisations can be obtained by independent executions which
take altogether time O(|S|) by Corollary 1 (as each execution runs over non-
overlapping sub-CSPNs corresponding to different class labels xc ). Moreover,
note that in a non-credal class-selective SPN, finding the class label of maximum
probability (and its probability) takes time O(|S|) in the worst case. That is more
efficient than general SPNs, where |S| · |Xc | nodes may need to be visited.
Let us turn our attention to the CSPN robustness estimation in a classifica-
tion problem. Given input instance XE = xE for which we want to predict the
class variable value, we say that the classification using a CSPN C is robust if
the class value xc = arg maxxc P(xc |xE ) predicted by an SPN Sω that belongs to
C = {Sω : ω ∈ C} is also the prediction of any other Sω ∈ C (hence it is unique
for C), which happens if and only if
min Eω (Ixc − Ixc |XE = e) > 0 for every xc = xc . (3)
ω
that is, regardless of the choice of weights w ∈ C, we would have Pω (xc |xE ) >
Pω (xc |xE ) for all other labels xc .
General CSPNs may require 2 · |S| · (|Xc | − 1) node evaluations in the worst
case to identify whether a predicted class label xc is robust for instance XE = xE ,
while a class-selective CSPN obtains such result in |S| node evaluations. This
is because CSPNs will run over its nodes (|Xc | − 1) times in order to reach a
conclusion about Expression (3), while the class-selective CSPN can compute
minω Sω (xc , xE ) and maxω Sω (xc , xE ) (the max is done for each xc ) only once
(taking overall |S| node evaluations, since they run over non-overlapping sub-
networks for different class values).
Finally, given an input instance XE = xE and an SPN S learned from data,
we can compute a robustness measure as follows. We define a collection of CSPNs
CS, parametrised by 0 ≤ < 1 such that each wi of a node i in CS, is allowed
to vary within an -contaminated credal set of the original weight vector wi of
the same node i in S. A robustness measure for the issued prediction CS, (xE )
is then defined by the largest such that CS, is robust for xE . Finding such
can be done using a simple binary search, as shown in Algorithm 1.
4 Memoisation
When evaluating an SPN on a number of different instances, some of the com-
putations are likely to reoccur, especially in nodes with scopes of only a few
Towards Scalable and Robust Sum-Product Networks 415
variables. One simple and yet effective solution to reduce computing time is to
cache the results at each node. Thus, when evaluating the network recursively,
we do not have to visit any of the children of a node if a previously evaluated data
point had the same instantiation over the scope of that node. To be more precise,
consider a node S with scope S, and two instances x, x such that xS = xS . It is
clear that S(x) = S(x ), so once we have cached the value of one, there is no need
to reevaluate node S, or any of its children, on the other. For a CSPN C, the
same approach holds, but we need different caches for maximisation and min-
imisation, as well as for different SPNs Sω that belong to C (after all, a change
of ω may imply a change of the result). Notice that the computational overhead
of memoisation is amortised constant, as it amounts to accessing a hash table.
Mei et al. proposed computing maximum a posteriori (MAP) by storing the
values of nodes at a given evidence and searching for the most likely query in a
reduced SPN, where nodes associated with the evidence are pruned [10]. Memoi-
sation can be seen as eliminating such nodes implicitly (as their values are stored
and they are not revisited) but goes further by using values calculated at other
instances to save computational power. In fact, the application of memoisation
to MAP inference is a promising avenue for future research, since many methods
(e.g. hill climbing) evaluate the SPN at small variations of the same input that
are likely to share partial instantiations in many of the nodes in the network.
5 Experiments
We investigated the performance of memoisation and class-selective (C)SPNs
through a series of experiments over a range of 25 UCI datasets [8]. All experi-
ments were run in a single modern core with our implementation of Credal SPNs,
which runs LearnSPN [6] for structure learning. Source code is available on our
pages and/or upon request.
Table 1 presents the UCI data sets on which we ran experiments described
by their number of independent instances N , number of variables |X| (including
a class variable Xc ) and number of class labels |Xc |. All data sets are categor-
ical (or have been made categorical using discretisation by median value). We
also show the 0–1 classification accuracy obtained by both General and Class-
selective SPNs as well as the XGBoost library that provides a parallel tree gra-
dient boosting method [2], considered a state-of-the-art technique for supervised
classification tasks. Results are obtained by stratified 5-fold cross-validation, and
as one can inspect, class-selective SPNs largely outperformed general SPNs while
being comparable to XGBoost in terms of classification accuracy.
We leave further comparisons between general and class-selective networks
to the appendix, where Table 3 depicts the two types of network in terms of their
architecture and processing times on classification tasks. There one can see that
class-selective SPNs have a higher number of parameters due to a larger number
of sum nodes. However, in some cases general SPNs are deeper, which means
class-selective SPNs tend to grow sideways, especially when the number of classes
is high. Nonetheless, the larger number of parameters in class-selective networks
416 A. H. C. Correia and C. P. de Campos
Table 1. Percent accuracy of XGBoost, General SPNs and Class-selective SPNs across
several UCI datasets. All experiments consisted in stratified 5-fold cross validation.
does not translate into higher latency as both architectures have similar learning
and inference times. We attribute that to the independence of the subnetwork
of each class which facilitates inference. Notice that the two architectures are
equally efficient only in the classification task (only aspect compared in Table 3)
and not on robustness computations. We mathematically proved the latter to be
more efficient in class-selective networks when using Algorithm 1.
In Table 2, we have the average inference time per instance for 25 UCI
datasets. When using memoisation, the inference time dropped by at least 50% in
all datasets we evaluated, proving that memoisation is a valuable tool to render
(C)SPNs more efficient. We can also infer from the experiments, that the rela-
tive reduction in computing time tends to increase with the number of instances
N . For datasets with more than 2000 instances, which are still relatively small,
adding memoisation already cut the inference time by more than 90%. That is a
promising result for large scale applications, as memoisation grows more effective
with number of data points. We can better observe the effect of memoisation by
plotting a histogram of the inference times as in Fig. 2, where we have 6 of the
UCI datasets of Table 2. We can see that memoisation concentrates the distri-
bution at lower time values, proving that most instances take advantage of the
cached results in a number of nodes in the network.
Towards Scalable and Robust Sum-Product Networks 417
Table 2. Average inference time and number of nodes visited per inference for a CSPN
with (+M) and without (–M) memoisation across 25 UCI datasets. The respective
+M
ratios (%) are the saved time or node visits, that is, 1 − −M . Robustness was computed
with precision of 0.004, which requires 8 passes through the network as per Algorithm 1.
If we consider the number of nodes visited instead of time, the results are
even more telling. In Table 2 on average, less than 15% of node evaluations was
necessary during inference with memoisation. That is a more drastic reduction
than what we observed when comparing processing times, which means there
still plenty of room for improvement in computational time with better data
structures or faster programming languages.
It is worth noting that memoisation is only effective over discrete variables.
When a subset of the variables is continuous, the memoisation will only be
effective in nodes whose scope contains only discrete variables, which is likely to
reduce the computational gains of memoisation. An alternative is to construct
the cache with ranges of values for the continuous variables. To be sure, that
is a form of discretisation that might worsen the performance of the model,
but it occurs only at inference time and can be easily switched off when high
precision is needed. A thorough study of the effect of memoisation on models
with continuous variables is left for future work.
418 A. H. C. Correia and C. P. de Campos
Fig. 2. Empirical distribution of CSPN inference times with and without memoisation.
Fig. 3. Accuracy of predictions with robustness (a) above and (b) below different
thresholds for 12 UCI datasets. Some curves end abruptly because we only computed
the accuracy when 50 or more data points were available for a given threshold.
Fig. 4. Performance of the ensemble model against different robustness thresholds for
12 UCI datasets. (a) Accuracy of the ensemble model; (b) accuracy of the ensemble
model over accuracy of the CSPN and (c) over accuracy of the XGBoost.
the threshold and, on f for the remaining ones. To be precise, we can define an
ensemble EC,f as
x∗c if ≥ t
EC,f (xE , t) =
f (xE ) otherwise,
420 A. H. C. Correia and C. P. de Campos
where x∗c = arg maxxc S(xc , xE ) is the class predicted by a class-selective SPN S
learned from data (from which the -contaminated CSPN C is built), and is the
corresponding robustness value, = Robustness(S, xE , x∗c )—see Algorithm 1.
We implemented such an ensemble model by combining the -contaminated
CSPN with an XGBoost model. We computed the accuracy of the ensemble for
different thresholds t over a range of UCI data sets, as reported in Fig. 4. In plot
(a), we see the accuracy varies considerably with the threshold t, which suggests
there is an optimum value for t. In the other two plots, we compare the ensemble
against the CSPN (b); and the XGBoost model (c). We computed the ratio of
the accuracy of the ensemble over the accuracy of the competing model, so that
any point above one indicates an improve in performance. For many datasets, the
ensemble delivered better results and in some cases was superior to both original
models for an appropriate robustness threshold. In spite of that, we have not
investigated how to find good thresholds, which we leave for future work. Yet,
we point out that the complexity of queries using the class-selective CSPN in
the ensemble will be the same as that of class-selective SPNs (the robustness
comes “for free”), since the binary search for the computation of the threshold
will not be necessary (we can run it for the pre-decided t only).
6 Conclusion
SPNs and their credal version are highly expressive deep architectures with
tractable inference, which makes them a promising option for large-scale appli-
cations. However, they still lag behind some other machine learning models in
terms of architectural optimisations and fast implementations. We address that
through the introduction of memoisation, a simple yet effective technique that
caches previously computed results in a node. In our experiments, memoisation
reduced the number of node evaluations by more than 85% and the inference
time by at least 50% (often much more). We believe this is a valuable new tool
to help bring (C)SPNs to large-scale applications where low latency is essential.
We also discussed a new architecture, class-selective (C)SPNs, that combine
efficient robustness computation with high accuracy on classification tasks, out-
performing general (C)SPNs. Even though they excel in discriminative tasks,
class-selective SPNs are still generative models fully endowed with the seman-
tics of graphical models. We demonstrated how their probabilistic semantics can
be brought to bear through their extension to Credal SPNs. Namely, we explored
how robustness values relate to the accuracy of the model and how one can use
them to develop ensemble models guided through principled decision-making.
We finally point out some interesting directions for future work. As demon-
strated here, class-selective (C)SPNs have proven to be powerful models in clas-
sification tasks, but they arbitrarily place the class variable in a privileged posi-
tion in the network. Future research might investigate how well class-selective
(C)SPNs fit the joint distribution and how they would fair in predicting other
variables. Memoisation also opens up new promising research avenues, notably
in how it performs on other inferences tasks such as MAP and how it can be
extended to accommodate continuous variables.
A
ing and average inference times (s), number of nodes, height and number of parameters.
Table 3. Comparison between General (Gen) and Class-Selective (CS) SPNs in learn-
In Table 3 we compare general and class-selective SPNs in terms of their archi-
421
References
1. Amer, M.R., Todorovic, S.: Sum product networks for activity recognition. IEEE
Trans. Pattern Anal. Mach. Intell. 38(4), 800–813 (2016)
2. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, vol. 19, no. (6), pp. 785–794 (2016)
3. Cheng, W.C., Kok, S., Pham, H.V., Chieu, H.L., Chai, K.M.A.: Language model-
ing with sum-product networks. In: Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH, pp. 2098–2102
(2014)
4. Darwiche, A.: A differential approach to inference in bayesian networks. J. ACM
50(3), 280–305 (2003)
5. Delalleau, O., Bengio, Y.: Shallow vs. deep sum-product networks. In: Advances in
Neural Information Processing Systems, vol. 24. pp. 666–674. Curran Associates,
Inc. (2011)
6. Gens, R., Domingos, P.: Learning the structure of sum-product networks. In: Pro-
ceedings of the 30th International Conference on Machine Learning, vol. 28, pp.
229–264 (2013)
7. Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms, 5th
edn. Springer Publishing Company, Incorporated, Heidelberg (2012)
8. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
9. Maua, D.D., Conaty, D., Cozman, F.G., Poppenhaeger, K., de Campos, P.C.:
Robustifying sum-product networks. Int. J. Approximate Reasoning 101, 163–180
(2018)
10. Mei, J., Jiang, Y., Tu, K.: Maximum a posteriori inference in sum-product net-
works. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
11. Nath, A., Domingos, P.: Learning tractable probabilistic models for fault localiza-
tion. In: 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 1294–
1301 (2016)
12. Peharz, R.: Foundations of Sum-Product Networks for Probabilistic Modeling.
Ph.D. thesis, Graz University of Technology (2015)
13. Peharz, R., Gens, R., Domingos, P.: Learning selective sum-product networks. In:
Proceedings of the 31st International Conference on Machine Learning, vol. 32
(2014)
14. Peharz, R., Tschiatschek, S., Pernkopf, F., Domingos, P.: On Theoretical Properties
of Sum-Product Networks. In: Proceedings of the 18th International Conference
on Artificial Intelligence and Statistics (AISTATS), vol. 38, pp. 744–752 (2015)
15. Poon, H., Domingos, P.: Sum product networks: a new deep architecture. In: 2011
IEEE International Conference on Computer Vision Workshops (2011)
16. Pronobis, A., Rao, R.P.: Learning deep generative spatial models for mobile robots.
In: IEEE International Conference on Intelligent Robots and Systems, pp. 755–762
(2017)
17. Sguerra, B.M., Cozman, F.G.: Image classification using sum-product networks for
autonomous flight of micro aerial vehicles. In: 2016 5th Brazilian Conference on
Intelligent Systems (BRACIS), pp. 139–144 (2016)
18. Wang, J., Wang, G.: Hierarchical spatial sum-product networks for action recog-
nition in still images. IEEE Trans. Circuits Syst. Video Technol. 28(1), 90–100
(2018)
Learning Models over Relational Data: A
Brief Tutorial
Abstract. This tutorial overviews the state of the art in learning mod-
els over relational databases and makes the case for a first-principles
approach that exploits recent developments in database research.
The input to learning classification and regression models is a training
dataset defined by feature extraction queries over relational databases.
The mainstream approach to learning over relational data is to materi-
alize the training dataset, export it out of the database, and then learn
over it using a statistical package. This approach can be expensive as it
requires the materialization of the training dataset. An alternative app-
roach is to cast the machine learning problem as a database problem by
transforming the data-intensive component of the learning task into a
batch of aggregates over the feature extraction query and by computing
this batch directly over the input database.
The tutorial highlights a variety of techniques developed by the
database theory and systems communities to improve the performance of
the learning task. They rely on structural properties of the relational data
and of the feature extraction query, including algebraic (semi-ring), com-
binatorial (hypertree width), statistical (sampling), or geometric (dis-
tance) structure. They also rely on factorized computation, code special-
ization, query compilation, and parallelization.
This project has received funding from the European Union’s Horizon 2020 research
and innovation programme under grant agreement No. 682588.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 423–432, 2019.
https://doi.org/10.1007/978-3-030-35514-2_32
424 M. Schleich et al.
exports it as one table commonly in CSV or binary format. The ML library then
imports the training dataset in its own format and learns the desired model.
For the first step, it is common to use open source database management
systems, such as PostgreSQL or SparkSQL [57], or query processing libraries,
such as Python Pandas [33] and R dplyr [56]. Common examples for ML libraries
include scikit-learn [44], R [46], TensorFlow [1], and MLlib [34].
One advantage is the delegation of concerns: Database systems are used to
deal with data, whereas statistical packages are for learning models. Using this
approach, one can learn virtually any model over any database.
The key disadvantage is the non-trivial time spent on materializing, export-
ing, and importing the training dataset, which is commonly orders of magnitude
larger than the input database. Even though the ML libraries are much less
scalable than the data systems, in this approach they are thus expected to work
on much larger inputs. Furthermore, these solutions inherit the limitations of
both of their underlying systems, e.g., the maximum data frame size in R and
the maximum number of columns in PostgreSQL are much less than typical
database sizes and respectively number of model features.
A second approach is based on a loose integration of the two systems, with code
of the statistical package migrated inside the database system space. In this
approach, each machine learning task is implemented as a distinct user-defined
aggregate function (UDAF) inside the database system. For instance, there are
distinct UDAFs for learning: logistic regression models, linear regression models,
k-means, Principal Component Analysis, and so on. Each of these UDAFs are
registered in the underlying database system and there is a keyword in the query
language supported by the database system to invoke them. The benefit is the
direct interface between the two systems, with one single process running for
both the construction of the training dataset and learning. The database system
computes one table, which is the training dataset, and the learning task works
directly on it. Prime example of this approach is MADLib [23] that extends
PostgreSQL with a comprehensive library of machine learning UDAFs. The key
advantage of this approach over the previous one is better runtime performance,
since it does not need to export and import the (usually large) training dataset.
Nevertheless, one has to explicitly write a UDAF for each new model and opti-
mization method, essentially redoing the large implementation effort behind well-
established statistical libraries. Approaches discussed in the next sections also
suffer from this limitation, yet some contribute novel learning algorithms that
can be asymptotically faster than existing off-the-shelf ones.
A variation of the second approach provides a unified programming archi-
tecture, one framework for many machine learning tasks instead of one distinct
UDAF per task, with possible code reuse across UDAFs. Prime example of this
approach is Bismark [16], a system that supports incremental (stochastic) gra-
dient descent for convex programming. Its drawback is that its code may be less
426 M. Schleich et al.
Feature Extraction
DB ML Tool θ
Query
materialized output
Model
efficient than the specialized UDAFs. Code reuse across various models and opti-
mization problems may however speed up the development of new functionalities
such as new models and optimization algorithms.
The aforementioned approaches do not exploit the structure of the data residing
in the database. The next and final approach features a tight integration of the
data and learning systems. The UDAF for the machine learning task is pushed
into the feature extraction query and one single evaluation plan is created to
compute both of them. This approach enables database optimizations such as
pushing parts of the UDAFs past the joins of the feature extraction query. Prime
examples are Orion [29],which supports generalized linear models, Hamlet [30],
which supports logistic regression and naı̈ve Bayes, Morpheus [11], which linear
and logistic regression, k-means clustering, and Gaussian non-negative matrix
factorization, F [40,41,51], which supports ridge linear regression, AC/DC [3],
which supports polynomial regression and factorization machines [47–49], and
LMFAO [50], which supports a larger class of models including the previously
mentioned ones and decision trees [10], Chow-Liu trees [12], mutual information,
and data cubes [19,22].
3 Structure-Aware Learning
The tightly-integrated systems F [51], AC/DC [3], and LMFAO [50] are data
structure-aware in that they exploit the structure and sparsity of the database
to lower the complexity and drastically improve the runtime performance of the
Learning Models over Relational Data 427
This tutorial is a call to arms for more sustained and principled work on the
theory and systems of structure-aware approaches to data analytics. What are the
theoretical limits of structure-aware learning? What are the classes of machine
learning models that can benefit from structure-aware learning over relational
data? What other types of structure can benefit learning over relational data?
References
1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI,
pp. 265–283 (2016)
2. Abo Khamis, M., et al.: On functional aggregate queries with additive inequalities.
In: PODS, pp. 414–431 (2019)
3. Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: AC/DC: In-
database learning thunderstruck. In: DEEM, pp. 8:1–8:10 (2018)
4. Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database
learning with sparse tensors. In: PODS, pp. 325–340 (2018)
5. Abo Khamis, M., Ngo, H.Q., Olteanu, D., Suciu, D.: Boolean tensor decomposition
for conjunctive queries with negation. In: ICDT, pp. 21:1–21:19 (2019)
6. Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In:
PODS, pp. 13–28 (2016)
430 M. Schleich et al.
7. Abo Khamis, M., Ngo, H.Q., Suciu, D.: What do shannon-type inequalities, sub-
modular width, and disjunctive datalog have to do with one another? In: PODS,
pp. 429–444 (2017)
8. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins.
In: FOCS, pp. 739–748 (2008)
9. Bakibayev, N., Kociský, T., Olteanu, D., Závodný, J.: Aggregation and ordering
in factorised databases. PVLDB 6(14), 1990–2001 (2013)
10. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression
Trees. Wadsworth and Brooks, Monterey (1984)
11. Chen, L., Kumar, A., Naughton, J.F., Patel, J.M.: Towards linear algebra over
normalized data. PVLDB 10(11), 1214–1225 (2017)
12. Chow, C., Liu, C.: Approximating discrete probability distributions with depen-
dence trees. IEEE Trans. Inf. Theor. 14(3), 462–467 (2006)
13. Curtin, R.R., Edel, M., Lozhnikov, M., Mentekidis, Y., Ghaisas, S., Zhang, S.:
mlpack 3: a fast, flexible machine learning library. J. Open Source Soft. 3, 726
(2018)
14. Curtin, R.R., Moseley, B., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: Rk-
means: fast coreset construction for clustering relational data (2019)
15. Elghandour, I., Kara, A., Olteanu, D., Vansummeren, S.: Incremental techniques
for large-scale dynamic query processing. In: CIKM, pp. 2297–2298 (2018). Tutorial
16. Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-
RDBMS analytics. In: SIGMOD, pp. 325–336 (2012)
17. van Geffen, B.: QR decomposition of normalised relational data (2018), MSc thesis,
University of Oxford
18. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins
University Press, Baltimore (2013)
19. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational aggre-
gation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, pp. 152–
159 (1996)
20. Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp.
289–298 (2006)
21. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp.
287–298 (1999)
22. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes effi-
ciently. In: SIGMOD, pp. 205–216 (1996)
23. Hellerstein, J.M., et al.: The madlib analytics library or MAD skills, the SQL.
PVLDB 5(12), 1700–1711 (2012)
24. Inelus, G.R.: Quadratically Regularised Principal Component Analysis over multi-
relational databases, MSc thesis, University of Oxford (2019)
25. Joachims, T.: Training linear SVMS in linear time. In: SIGKDD, pp. 217–226
(2006)
26. Kaggle: The State of Data Science and Machine Learning (2017). https://www.
kaggle.com/surveys/2017
27. Kara, A., Ngo, H.Q., Nikolic, M., Olteanu, D., Zhang, H.: Counting triangles under
updates in worst-case optimal time. In: ICDT, pp. 4:1–4:18 (2019)
28. Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha,
A.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views.
VLDB J. 23(2), 253–278 (2014)
29. Kumar, A., Naughton, J.F., Patel, J.M.: Learning generalized linear models over
normalized data. In: SIGMOD, pp. 1969–1984 (2015)
Learning Models over Relational Data 431
30. Kumar, A., Naughton, J.F., Patel, J.M., Zhu, X.: To join or not to join?: thinking
twice about joins before feature selection. In: SIGMOD, pp. 19–34 (2016)
31. Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join and XDB: online aggregation via
random walks. ACM Trans. Database Syst. 44(1), 2:1–2:41 (2019)
32. Marx, D.: Approximating fractional hypertree width. ACM Trans. Algorithms 6(2),
29:1–29:17 (2010)
33. McKinney, W.: pandas: a foundational python library for data analysis and statis-
tics. Python High Perform. Sci. Comput. 14 (2011)
34. Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res.
17(1), 1235–1241 (2016)
35. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cam-
bridge (2013)
36. Neumann, T.: Efficiently compiling efficient query plans for modern hardware.
PVLDB 4(9), 539–550 (2011)
37. Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In:
PODS, pp. 37–48 (2012)
38. Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: New developments in the theory
of join algorithms. In: SIGMOD Rec., pp. 5–16 (2013)
39. Nikolic, M., Olteanu, D.: Incremental view maintenance with triple lock factoriza-
tion benefits. In: SIGMOD, pp. 365–380 (2018)
40. Olteanu, D., Schleich, M.: F: regression models over factorized views. PVLDB
9(10), 1573–1576 (2016)
41. Olteanu, D., Schleich, M.: Factorized databases. SIGMOD Rec. 45(2), 5–16 (2016)
42. Olteanu, D., Závodný, J.: Size bounds for factorised representations of query
results. TODS 40(1), 2 (2015)
43. Park, Y., Qing, J., Shen, X., Mozafari, B.: Blinkml: efficient maximum likelihood
estimation with probabilistic guarantees. In: SIGMOD, pp. 1135–1152 (2019)
44. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
45. Poon, H., Domingos, P.M.: Sum-product networks: a new deep architecture. In:
UAI, pp. 337–346 (2011)
46. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
dation for Stat. Comp. (2013). www.r-project.org
47. Rendle, S.: Factorization machines. In: Proceedings of the 2010 IEEE International
Conference on Data Mining. ICDM 2010, pp. 995–1000. IEEE Computer Society,
Washington, DC (2010)
48. Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol.
3(3), 57:1–57:22 (2012)
49. Rendle, S.: Scaling factorization machines to relational data. PVLDB 6(5), 337–348
(2013)
50. Schleich, M., Olteanu, D., Abo Khamis, M., Ngo, H.Q., Nguyen, X.: A layered
aggregate engine for analytics workloads. In: SIGMOD, pp. 1642–1659 (2019)
51. Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over
factorized joins. In: SIGMOD, pp. 3–18 (2016)
52. Shaikhha, A., Klonatos, Y., Koch, C.: Building efficient query engines in a high-
level language. TODS 43(1), 4:1–4:45 (2018)
53. Shaikhha, A., Klonatos, Y., Parreaux, L., Brown, L., Dashti, M., Koch, C.: How
to architect a query compiler. In: SIGMOD, pp. 1907–1922 (2016)
54. Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found.
Trends Mach. Learn. 9(1), 1–118 (2016)
432 M. Schleich et al.
55. Veldhuizen, T.L.: Triejoin: a simple, worst-case optimal join algorithm. In: ICDT,
pp. 96–106 (2014)
56. Wickham, H., Francois, R., Henry, L., Müller, K., et al.: dplyr: a grammar of data
manipulation. R package version 0.4 3 (2015)
57. Zaharia, M., Chowdhury, M., et al.: Resilient distributed datasets: a fault-tolerant
abstraction for in-memory cluster computing. In: NSDI, p. 2 (2012)
Subspace Clustering and Some Soft
Variants
Marie-Jeanne Lesot(B)
1 Introduction
In the unsupervised learning framework, the only available input is a set of data,
here considered to be numerically described by feature vectors. The aim is then
to extract information from the data, e.g. in the form of linguistic summaries
(see e.g. [17]), frequent value co-occurrences, as expressed by association rules, or
as clusters. The latter are subgroups of data that are both compact and distinct,
which means that any data point is more similar to points assigned to the same
group than to points assigned to other groups. These clusters provide insight to
the data structure and a summary of the data set.
Subspace clustering [3,28] is a refined form of the clustering task, where the
clusters are assumed to live in different subspaces of the feature space: on the
one hand, this assumption can help identifying more relevant data subgroups,
relaxing the need to use a single, global, similarity relation; on the other hand,
it leads to refine the identified data summaries, so as to characterise each cluster
through its associated subspace. These two points of view have led to the two
main families of subspace clustering approaches, that have slightly different aims
and definitions.
This paper first discusses in more details the definition of the subspace clus-
tering task in Sect. 2 and presents in turn the two main paradigms, in Sects. 3
and 4 respectively. It then focuses on soft computing approaches that have been
proposed to perform subspace clustering, in particular fuzzy ones: fuzzy logic
tools have proved to be useful to all types of machine learning tasks, such as
classification, extraction of association rules or clustering. Section 5 describes
their applications to the case of subspace clustering. Section 6 concludes the
paper.
global feature space before characterising them. Now subspace clustering aims at
extracting more appropriate clusters that can be identified in lower dimensional
spaces only. Reciprocally, first performing feature selection and then clustering
the data would impose a subspace common to all clusters. Subspace cluster-
ing addresses both subgroup and feature identification simultaneously, so as to
obtain better subgroups, defined locally.
Subspace clustering is especially useful for high dimensional data, due to the
curse of dimensionality that makes all distances between pairs of points have
very close values: it can be the case that there exists no dense data subgroup in
the whole feature space and that clusters can only be identified when considering
subspaces with lower dimensionality.
initially defined as the whole feature space. The refinement step consists in pro-
jecting the data to subspaces, making it possible to identify the cluster structure
of the data even when there is no dense clusters in the full feature space.
Proclus [1] belongs to this framework of projected clustering, it identifies
three components: (i) clusters, (ii) associated dimensions that define axes-parallel
subspaces, as well as (iii) outliers, i.e. points that are assigned to none of the
clusters. The candidate projection subspaces are defined by the dimensions along
which the cluster members have the lowest dispersion. Orclus [2] is a variant of
proclus that allows to identify subspaces that are not parallel to the initial axes.
to cluster r, wr = (wr1 , · · · , wrd ) for r = 1..c the dimension weights for cluster r
and η and q two hyperparameters, FSC considers the cost function
n
c d
c
d
q q
JF SC = uri wrp (xip − crp )2 + η wrp (1)
i=1 r=1 p=1 r=1 p=1
c d
under the constraints uri ∈ {0, 1}, r=1 uri = 1 for all i and p=1 wrp = 1 for
all r. In this cost, the first term is identical to the k-means cost function when
replacing the Euclidean distance by a weighted one, the second term is required
so that the update equations are well defined [10]. The two terms are balanced
by the η hyperparameter. The first two constraints are identical to the k-means
ones, the third one forbids the trivial solution where all weights wrp = 0. The
q
q hyperparameter defining the exponent of the weights wrp is similar to the
fuzzifier m used in the fuzzy c-means algorithm to avoid converging to binary
weights wrp ∈ {0, 1} [20].
The Entropy Weighted k-means algorithm, EWKM [16], is an extension that
aims at controlling the sparsity of the dimension weights wrp , so that they tend
to equal 0, instead of being small but non-zero: to that aim, it replaces the second
term in JF SC with an entropy regularisation term, balanced with a γ parameter
n
c d
c
d
JEW KM = uri wrp (xip − crp )2 + γ wrp log wrp (2)
i=1 r=1 p=1 r=1 p=1
under the same constraints. When γ tends to 0, it allows to control the sparsity
level of the wrp weights.
5 Soft Variants
This section aims at detailing partitioning approaches, introduced in the previous
section, in the case where the point assignment to the cluster is not binary but
soft, i.e., using the above notations, uri ∈ [0, 1] instead of uri ∈ {0, 1}: they
constitute subspace extensions of the fuzzy c-means algorithm, called fcm, and
its variants (see e.g. [14,21] for overviews).
First the Gustafson-Kessel algorithm [13] can be viewed as the fuzzy corre-
spondent of GMM discussed in the previous section: both use the Mahalanobis
distance and consider weighted assignments to the clusters. They differ by the
interpretation of these weights and by the cost function they consider: GMM
optimises the log-likelihood of the data, in a probabilistic modelling framework,
whereas Gustafson-Kessel considers a quantisation error, in a fcm manner.
Using the same notations as in the previous section, with the additional
hyperparameters m, called fuzzifier, and (αr )r=1..c ∈ R, the Attribute Weighted
Fuzzy c-means algorithm, AWFCM [19], is based on the cost function
n
c d
JAW F CM = um
ri
q
wrp (xip − crp )2 (3)
i=1 r=1 p=1
Subspace Clustering and Some Soft Variants 439
c n
with uri ∈ [0, 1] and under the constraints r=1 uri = 1 for all i, i=1 uri > 0
d
for all r and p=1 wrp = αr for all r. The cost function is thus identical to
the fcm one, replacing the Euclidean distance with its weighted variant. The
first two constraints also are identical to the fcm ones, the third one forbids the
trivial solution wrp = 0. The (αr ) hyperparameters can also allow to weight the
relative importance of the c clusters in the final partition, but they are usually
set to be all equal to 1 [19].
Many variants of AWFCM have been proposed, for instance to introduce
sparsity in the subspace description: AWFCM indeed produces solutions where
none of the wrp parameters equals zero, even if they are very small. This is
similar to a well-known effect of the fcm, where the optimisation actually leads
to uri ∈ ]0, 1[, except for data points that are equal to a cluster centre: the
membership degrees can be very small, but they cannot equal zero [20]. Borgelt
[4] thus proposes to apply the sparsity inducing constraints introduced for the
membership degrees uri [20], considering
n
c d
JBOR = g(uri ) g(wrp )(xip − crp )2 (4)
i=1 r=1 p=1
c
n d
n
c
q
JW LF C = u2ri wrp (xip − crp ) + γ
2
(uri − usi )2 sij (5)
i=1 r=1 p=1 i,j=1 r=1
under the same constraints as AWFCM. In this cost, sij is a well-chosen global
similarity measure [11] that imposes that neighbouring points in the whole fea-
ture space still have somewhat similar membership degrees. The γ hyperparam-
eter allows to balance the two effects and to prevent some discontinuity in the
solution among point neighbourhood.
440 M.-J. Lesot
The Proximal Fuzzy Subspace C-Means, PFSCM [12] considers the cost func-
tion defined as
n
c d
c
d
JP F SCM = um
ri
2
wrp (xip − crp )2 + γ | (wrp ) − 1| (6)
i=1 r=1 p=1 r=1 p=1
under the first two constraints of AWFCM: the second term can be interpreted
as an inline version of the third constraint that is thus moved within the cost
function. As it is not differentiable, PFSCM proposes an original optimisation
scheme that does not rely on standard alternate optimisation but on proximal
descent (see e.g. [25]). This algorithm appears to identify better the number
of relevant dimensions for each cluster, where AWFCM tends to underestimate
it. Moreover, the proposition to apply proximal optimisation techniques to the
clustering task opens the way for defining a wide range of regularisation terms: it
allows for more advanced penalty terms that are not required to be differentiable.
6 Conclusion
This paper proposed a brief overview of the subspace clustering task and the
main categories of methods proposed to address it. They differ in the under-
standing of the general aim and offer a large variety of approaches that provide
different types of outputs and knowledge extracted from the data.
Still they have several properties in common. First most of them rely on a
non-constant distance measure: the comparison of two data points does not rely
on a global measure, but on a local one, that somehow takes the assignment to the
same cluster as a parameter to define this measure. As such, subspace clustering
constitutes a task that must extract from the data, in an unsupervised way, both
compact and distinct data subgroups, as well as the reasons why these subgroups
can be considered as compact. This makes it clear that subspace clustering is a
highly demanding and difficult task, that aims at exploiting inputs with little
information (indeed, inputs reduce to the data position in the feature space only)
to extract very rich knowledge.
Moreover, it can be observed that many subspace clustering methods share
a constraint of sparsity: it imposes subspaces to be as small as possible so as
to contain the clusters, while avoiding to oversimplify their complexity. A large
variety of criteria to define sparsity and integrate it into the task objective is
exploited across the existing approaches.
Among the directions for ongoing works in the subspace clustering domain,
a major one deals with the question of evaluation: as is especially the case for
any unsupervised learning task, there is no consensus about the quality criteria
to be used to assess the obtained results. The first category of methods, that
exploit the subspace existence to learn an affinity matrix, usually focuses on
evaluating the cluster quality: they resort to general clustering criteria, such
as the clustering error, measured as accuracy, the cluster purity or Normalised
Mutual Information. Thus they are often evaluated in a supervised manner,
Subspace Clustering and Some Soft Variants 441
References
1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for
projected clustering. In: Proceedings of the International Conference on Manage-
ment of Data, SIGMOD, pp. 61–72. ACM (1999)
2. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimen-
sional spaces. In: Proceedings of the International Conference on Management of
Data, SIGMOD, pp. 70–81. ACM (2000)
3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Proceedings of
the ACM SIGMOD International Conference on Management of Data, SIGMOD,
pp. 94–105. ACM (1998)
4. Borgelt, C.: Fuzzy subspace clustering. In: Fink, A., Lausen, B., Seidel, W., Ultsch,
A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence.
Studies in Classification, Data Analysis, and Knowledge Organization, pp. 93–103.
Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-01044-6 8
5. Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: a maximal frequent itemset algo-
rithm for transactional databases. In: Proceedings of the 17th International Con-
ference on Data Engineering, pp. 443–452 (2001)
6. Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for min-
ing numerical data. In: Proceedings of the 5th ACM International Conference on
Knowledge Discovery and Data Mining, pp. 84–93 (1999)
7. Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: Proceedings of the IEEE
International Conference on Computer Vision and Pattern Recognition, CVPR,
pp. 2790–2797 (2009)
8. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and appli-
cations. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
9. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Proceedings of the 2nd
ACM International Conference on Knowledge Discovery and Data Mining, KDD,
pp. 226–231 (1996)
10. Gan, G., Wu, J.: A convergence theorem for the fuzzy subspace clustering algo-
rithm. Pattern Recogn. 41(6), 1939–1947 (2008)
11. Guillon, A., Lesot, M.J., Marsala, C.: Laplacian regularization for fuzzy subspace
clustering. In: Proceedings of the IEEE International Conference on Fuzzy Systems,
FUZZ-IEEE 2017 (2017)
12. Guillon, A., Lesot, M.J., Marsala, C.: A proximal framework for fuzzy subspace
clustering. Fuzzy Sets Syst. 366, 24–45 (2019)
442 M.-J. Lesot
13. Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In:
Proceedings of the IEEE Conference on Decision and Control, vol. 17, pp. 761–
766. IEEE (1978)
14. Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis: Methods
for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)
15. Ji, P., Zhang, T., Li, H., Salzmann, M., Reid, I.: Deep subspace clustering net-
works. In: Proceedings of the 31st International Conference on Neural Information
Processing Systems, NIPS (2017)
16. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for
subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data
Eng. 19(8), 1026–1041 (2007)
17. Kacprzyk, J., Zadrozny, S.: Linguistic database summaries and their protoforms:
towards natural language based knowledge discovery tools. Inf. Sci. 173(4), 281–
304 (2005)
18. Kanatani, K.: Motion segmentation by subspace separation and model selection.
In: Proceedings of the 8th International Conference on Computer Vision, ICCV,
vol. 2, pp. 586–591 (2001)
19. Keller, A., Klawonn, F.: Fuzzy clustering with weighting of data variables. Int. J.
Uncertain. Fuzziness Knowl. Based Syst. 8(6), 735–746 (2000)
20. Klawonn, F., Höppner, F.: What is fuzzy about fuzzy clustering? Understanding
and improving the concept of the fuzzifier. In: Proceedings of the 5th International
Symposium on Intelligent Data Analysis, pp. 254–264 (2003)
21. Kruse, R., Döring, C., Lesot, M.J.: Fundamentals of fuzzy clustering. In: de
Oliveira, J., Pedrycz, W. (eds.) Advances in Fuzzy Clustering and its Applica-
tions. Wiley, New York (2007)
22. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace
structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell.
35(1), 171–184 (2013)
23. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416
(2007)
24. Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., Lun, J.: A survey of clustering
with deep learning: from the perspective of network architecture. IEEE Access 6,
39501–39514 (2018)
25. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231
(2014)
26. Patel, V.M., Vidal, R.: Kernel sparse subspace clustering. In: Proceedings of ICIP,
pp. 2849–2853 (2014)
27. Peng, X., Xiao, S., Feng, J., Yau, W.Y., Yi, Z.: Deep subspace clustering with spar-
sity prior. In: Proceedings of the 25th International Joint Conference on Artificial
Intelligence, IJCAI, pp. 1925–1931 (2016)
28. Vidal, R.: A tutorial on subspace clustering. IEEE Sig. Process. Mag. 28(2), 52–68
(2010)
29. Wang, D., Ding, C., Li, T.: K-subspace clustering. In: Buntine, W., Grobelnik,
M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI),
vol. 5782, pp. 506–521. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-04174-7 33
30. Xiao, S., Tan, M., Xu, D., Dong, Z.Y.: Robust kernel low-rank representation.
IEEE Trans. Neural Netw. Learn. Syst. 27(11), 2268–2281 (2016)
Subspace Clustering and Some Soft Variants 443
31. Yin, M., Guo, Y., Gao, J., He, Z., Xie, S.: Kernel sparse subspace clustering on
symmetric positive definite manifolds. In: Proceedings of the IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5157–5164
(2016)
32. Zhou, L., Bai, X., Wang, D., Liu, X., Zhou, J., Hancock, E.: Latent distribution
preserving deep subspace clustering. In: Proceedings of the 28th International Joint
Conference on Artificial Intelligence, IJCAI, pp. 4440–4446 (2019)
Invited Keynotes
From Shallow to Deep Interactions Between Knowledge
Representation, Reasoning
and Machine Learning
Kay R. Amel(B)
Abstract. Reasoning and learning are two basic concerns at the core of Artificial
Intelligence (AI). In the last three decades, Knowledge Representation and Rea-
soning (KRR) on the one hand and Machine Learning (ML) on the other hand,
have been considerably developed and have specialised in a large number of ded-
icated sub-fields. These technical developments and specialisations, while they
were strengthening the respective corpora of methods in KRR and in ML, also
contributed to an almost complete separation of the lines of research in these two
areas, making many researchers on one side largely ignorant of what was going
on the other side.
This state of affairs is also somewhat relying on general, overly simplis-
tic, dichotomies that suggest great differences between KRR and ML: KRR
deals with knowledge, ML handles data; KRR privileges symbolic, discrete
approaches, while numerical methods dominate ML. Even if such a rough pic-
ture points out things that cannot be fully denied, it is also misleading, as for
instance KRR can deal with data as well (e.g., formal concept analysis) and
ML approaches may rely on symbolic knowledge (e.g., inductive logic program-
ming). Indeed, the frontier between the two fields is actually much blurrier than
it appears, as both share approaches such as Bayesian networks, or case-based
reasoning and analogical reasoning, as well as important concerns such as uncer-
tainty representation. In fact, one may well argue that similarities between the
two fields are more numerous than one may think.
This talk proposes a tentative and original survey of meeting points between
KRR and ML. Some common concerns are first identified and discussed such as
Kay R. Amel is the pen name of the working group “Apprentissage et Raisonnement” of
the GDR (“Groupement De Recherche”) “Aspects Formels et Algorithmiques de l’Intelligence
Artificielle”, CNRS, France (https://www.gdria.fr/presentation/). The contributors to this
paper include: Zied Bouraoui (CRIL, Lens, Fr, [email protected]), Antoine Cornuéjols
(AgroParisTech, Paris, Fr, [email protected]), Thierry Denoeux (Heudiasyc,
Compiègne, Fr, [email protected]), Sébastien Destercke (Heudiasyc, Compiègne, Fr,
[email protected]), Didier Dubois (IRIT, Toulouse, Fr, [email protected]), Romain
Guillaume (IRIT, Toulouse, Fr, [email protected]), Jérôme Mengin (IRIT, Toulouse,
Fr, [email protected]), Henri Prade (IRIT, Toulouse, Fr, [email protected]), Steven Schock-
aert (School of Computer Science and Informatics, Cardiff, UK, [email protected]),
Mathieu Serrurier (IRIT, Toulouse, Fr, [email protected]), Christel Vrain (LIFO,
Orléans, Fr, [email protected]).
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 447–448, 2019.
https://doi.org/10.1007/978-3-030-35514-2
448 K. R. Amel
the types of representation used, the roles of knowledge and data, the lack or the
excess of information, the need for explanations and causal understanding.
Then some methodologies combining reasoning and learning are reviewed
(such as inductive logic programming, neuro-symbolic reasoning, formal concept
analysis, rule-based representations and machine learning, uncertainty assess-
ment in prediction, or case-based reasoning and analogical reasoning), before
discussing examples of synergies between KRR and ML (including topics such
as belief functions on regression, EM algorithm versus revision, the semantic
description of vector representations, the combination of deep learning with high
level inference, knowledge graph completion, declarative frameworks for data
mining, or preferences and recommendation).
The full paper will be the first step of a work in progress aiming at a bet-
ter mutual understanding of researches in KRR and ML, and how they could
cooperate.
Algebraic Approximations for Weighted Model
Counting
Wolfgang Gatterbauer(B)
We discuss recently developed deterministic upper and lower bounds for the
probability of Boolean functions. The bounds result from treating multiple occur-
rences of variables as independent and assigning them new individual probabil-
ities, an approach called dissociation. By performing several dissociations, one
can transform a Boolean formula whose probability is difficult to compute, into
one whose probability is easy to compute. Appropriately executed, these steps
can give rise to a novel class of inequalities from which upper and lower bounds
can be derived efficiently. In addition, the resulting bounds are oblivious, i.e. they
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 449–450, 2019.
https://doi.org/10.1007/978-3-030-35514-2
450 W. Gatterbauer
require only limited observations of the structure and parameters of the prob-
lem. This technique can yield fast approximate schemes that generate upper and
lower bounds for various inference tasks.
3 Talk Outline
We discuss Boolean formulas and their connection to weighted model counting.
We introduce dissociation-based bounds and draw the connection to approximate
knowledge compilation. We then illustrate the use of dissociation-based bounds
with three applications: (1) anytime approximations of monotone Boolean for-
mulas [7]. (2) approximate lifted inference with relational databases [4, 5], and
(3) approximate weighted model counting [3]. If time remains, we will discuss
the similarities and differences to four other techniques that similarly fall into
Pearl’s classification of extensional approaches to uncertainty [9, Ch 1.1.4]: (i)
relaxation-based methods in logical optimization [8, Ch 13], (ii) relaxation &
compensation for approximate probabilistic inference in graphical models [2],
(iii) probabilistic soft logic that uses continuous relaxations in a smart way [1],
and (iv) quantization on algebraic decision diagrams [6]. The slides will be made
available at https://northeastern-datalab.github.io/afresearch/.
References
1. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields
and probabilistic soft logic. J. Mach. Learn. Res. 18, 109:1–109:67 (2017)
2. Choi, A., Darwiche, A.: Relax then compensate: on max-product belief propagation
and more. In: NIPS, pp. 351–359 (2009)
3. Chou, L., Gatterbauer, W., Gogate, V.: Dissociation-based oblivious bounds for
weighted model counting. In: UAI (2018)
4. Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic
databases. PVLDB 8(5), 629–640 (2015)
5. Gatterbauer, W., Suciu, D.: Dissociation and propagation for approximate lifted
inference with standard relational database management systems. VLDB J. 26(1),
5–30 (2017)
6. Gogate, V., Domingos, P.: Approximation by quantization. In: UAI, pp. 247–255
(2011)
7. den Heuvel, M.V., Ivanov, P., Gatterbauer, W., Geerts, F., Theobald, M.: Anytime
approximation in probabilistic databases via scaled dissociations. In: SIGMOD, pp.
1295–1312 (2019)
8. Hooker, J.: Logic-Based Methods for Optimization: Combining Optimization and
Constraint Satisfaction. John Wiley & sons (2000)
9. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-
ence. Morgan Kaufmann (1988)
Author Index